summaryrefslogtreecommitdiff
path: root/arch/x86/include/asm
AgeCommit message (Collapse)Author
2025-07-01x86/sev: Use TSC_FACTOR for Secure TSC frequency calculationNikunj A Dadhania
When using Secure TSC, the GUEST_TSC_FREQ MSR reports a frequency based on the nominal P0 frequency, which deviates slightly (typically ~0.2%) from the actual mean TSC frequency due to clocking parameters. Over extended VM uptime, this discrepancy accumulates, causing clock skew between the hypervisor and a SEV-SNP VM, leading to early timer interrupts as perceived by the guest. The guest kernel relies on the reported nominal frequency for TSC-based timekeeping, while the actual frequency set during SNP_LAUNCH_START may differ. This mismatch results in inaccurate time calculations, causing the guest to perceive hrtimers as firing earlier than expected. Utilize the TSC_FACTOR from the SEV firmware's secrets page (see "Secrets Page Format" in the SNP Firmware ABI Specification) to calculate the mean TSC frequency, ensuring accurate timekeeping and mitigating clock skew in SEV-SNP VMs. Use early_ioremap_encrypted() to map the secrets page as ioremap_encrypted() uses kmalloc() which is not available during early TSC initialization and causes a panic. [ bp: Drop the silly dummy var: https://lore.kernel.org/r/20250630192726.GBaGLlHl84xIopx4Pt@fat_crate.local ] Fixes: 73bbf3b0fbba ("x86/tsc: Init the TSC for Secure TSC guests") Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250630081858.485187-1-nikunj@amd.com
2025-06-30Move FCH header to a location accessible by all archsMario Limonciello
A new header fch.h was created to store registers used by different AMD drivers. This header was included by i2c-piix4 in commit 624b0d5696a8 ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>"). To prevent compile failures on non-x86 archs i2c-piix4 was set to only compile on x86 by commit 7e173eb82ae9717 ("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on CONFIG_X86"). This was not a good decision because loongarch and mips both actually support i2c-piix4 and set it enabled in the defconfig. Move the header to a location accessible by all architectures. Fixes: 624b0d5696a89 ("i2c: piix4, x86/platform: Move the SB800 PIIX4 FCH definitions to <asm/amd/fch.h>") Suggested-by: Hans de Goede <hansg@kernel.org> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Hans de Goede <hansg@kernel.org> Link: https://lore.kernel.org/r/20250610205817.3912944-1-superm1@kernel.org Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
2025-06-30Merge 6.16-rc4 into tty-nextGreg Kroah-Hartman
We need the tty/serial fixes in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-29Merge tag 'x86_urgent_for_v6.16_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Make sure DR6 and DR7 are initialized to their architectural values and not accidentally cleared, leading to misconfigurations * tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/traps: Initialize DR7 by writing its architectural reset value x86/traps: Initialize DR6 by writing its architectural reset value
2025-06-29serial: 8250: Move CE4100 quirks to a module under 8250 driverAndy Shevchenko
There is inconvenient for maintainers and maintainership to have some quirks under architectural code. Move it to the specific quirk file like other 8250-compatible drivers do. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://lore.kernel.org/r/20250627182743.1273326-1-andriy.shevchenko@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-27x86/sev: Let sev_es_efi_map_ghcbs() map the CA pages tooGerd Hoffmann
OVMF EFI firmware needs access to the CA page to do SVSM protocol calls. For example, when the SVSM implements an EFI variable store, such calls will be necessary. So add that to sev_es_efi_map_ghcbs() and also rename the function to reflect the additional job it is doing now. [ bp: Massage. ] Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250626114014.373748-4-kraxel@redhat.com
2025-06-25VFIO: KVM: x86: Drop kvm_arch_{start,end}_assignment()Sean Christopherson
Drop kvm_arch_{start,end}_assignment() and all associated code now that KVM x86 no longer consumes assigned_device_count. Tracking whether or not a VFIO-assigned device is formally associated with a VM is fundamentally flawed, as such an association is optional for general usage, i.e. is prone to false negatives. E.g. prior to commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), device passthrough via VFIO would fail to enable IRQ bypass if userspace omitted the formal VFIO<=>KVM binding. And device drivers that *need* the VFIO<=>KVM connection, e.g. KVM-GT, shouldn't be relying on generic x86 tracking infrastructure. Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20250523011756.3243624-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25Merge branch 'kvm-x86 mmio'Sean Christopherson
Merge the MMIO stale data branch with the device posted IRQs branch to provide a common base for removing KVM's tracking of "assigned" devices. Link: https://lore.kernel.org/all/20250523011756.3243624-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25KVM: VMX: Apply MMIO Stale Data mitigation if KVM maps MMIO into the guestSean Christopherson
Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO into the VM, not if the VM has an assigned device. VFIO is but one of many ways to map host MMIO into a KVM guest, and even within VFIO, formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely optional. Track whether or not the guest can access host MMIO on a per-MMU basis, i.e. based on whether or not the vCPU has a mapping to host MMIO. For simplicity, track MMIO mappings in "special" rools (those without a kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so only legacy 32-bit shadow paging is affected, i.e. lack of precise tracking is a complete non-issue. Make the per-MMU and per-VM flags sticky. Detecting when *all* MMIO mappings have been removed would be absurdly complex. And in practice, removing MMIO from a guest will be done by deleting the associated memslot, which by default will force KVM to re-allocate all roots. Special roots will forever be mitigated, but as above, the affected scenarios are not expected to be performance sensitive. Use a VMX_RUN flag to communicate the need for a buffers flush to vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its dependencies don't need to be marked __always_inline, e.g. so that KASAN doesn't trigger a noinstr violation. Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Fixes: 8cb861e9e3c9 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data") Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24x86/fpu/xstate: Add CET supervisor xfeature support as a guest-only featureYang Weijiang
== Background == CET defines two register states: CET user, which includes user-mode control registers, and CET supervisor, which consists of shadow-stack pointers for privilege levels 0-2. Current kernels disable shadow stacks in kernel mode, making the CET supervisor state unused and eliminating the need for context switching. == Problem == To virtualize CET for guests, KVM must accurately emulate hardware behavior. A key challenge arises because there is no CPUID flag to indicate that shadow stack is supported only in user mode. Therefore, KVM cannot assume guests will not enable shadow stacks in kernel mode and must preserve the CET supervisor state of vCPUs. == Solution == An initial proposal to manually save and restore CET supervisor states using raw RDMSR/WRMSR in KVM was rejected due to performance concerns and its impact on KVM's ABI. Instead, leveraging the kernel's FPU infrastructure for context switching was favored [1]. The main question then became whether to enable the CET supervisor state globally for all processes or restrict it to vCPU processes. This decision involves a trade-off between a 24-byte XSTATE buffer waste for all non-vCPU processes and approximately 100 lines of code complexity in the kernel [2]. The agreed approach is to first try this optimal solution [3], i.e., restricting the CET supervisor state to guest FPUs only and eliminating unnecessary space waste. The guest-only xfeature infrastructure has already been added. Now, introduce CET supervisor xstate support as the first guest-only feature to prepare for the upcoming CET virtualization in KVM. Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/kvm/ZM1jV3UPL0AMpVDI@google.com/ [1] Link: https://lore.kernel.org/kvm/1c2fd06e-2e97-4724-80ab-8695aa4334e7@intel.com/ [2] Link: https://lore.kernel.org/kvm/2597a87b-1248-b8ce-ce60-94074bc67ea4@intel.com/ [3] Link: https://lore.kernel.org/all/20250522151031.426788-7-chao.gao%40intel.com
2025-06-24x86/fpu/xstate: Introduce "guest-only" supervisor xfeature setYang Weijiang
In preparation for upcoming CET virtualization support, the CET supervisor state will be added as a "guest-only" feature, since it is required only by KVM (i.e., guest FPUs). Establish the infrastructure for "guest-only" features. Define a new XFEATURE_MASK_GUEST_SUPERVISOR mask to specify features that are enabled by default in guest FPUs but not in host FPUs. Specifically, for any bit in this set, permission is granted and XSAVE space is allocated during vCPU creation. Non-guest FPUs cannot enable guest-only features, even dynamically, and no XSAVE space will be allocated for them. The mask is currently empty, but this will be changed by a subsequent patch. Co-developed-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/all/20250522151031.426788-6-chao.gao%40intel.com
2025-06-24x86/fpu/xstate: Differentiate default features for host and guest FPUsChao Gao
Currently, guest and host FPUs share the same default features. However, the CET supervisor xstate is the first feature that needs to be enabled exclusively for guest FPUs. Enabling it for host FPUs leads to a waste of 24 bytes in the XSAVE buffer. To support "guest-only" features, add a new structure to hold the default features and sizes for guest FPUs to clearly differentiate them from those for host FPUs. Add two helpers to provide the default feature masks for guest and host FPUs. Default features are derived by applying the masks to the maximum supported features. Note that, 1) for now, guest_default_mask() and host_default_mask() are identical. This will change in a follow-up patch once guest permissions, default xfeatures, and fpstate size are all converted to use the guest defaults. 2) only supervisor features will diverge between guest FPUs and host FPUs, while user features will remain the same [1][2]. So, the new vcpu_fpu_config struct does not include default user features and size for the UABI buffer. An alternative approach is adding a guest_only_xfeatures member to fpu_kernel_cfg and adding two helper functions to calculate the guest default xfeatures and size. However, calculating these defaults at runtime would introduce unnecessary overhead. Suggested-by: Chang S. Bae <chang.seok.bae@intel.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Chao Gao <chao.gao@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: John Allen <john.allen@amd.com> Link: https://lore.kernel.org/kvm/aAwdQ759Y6V7SGhv@google.com/ [1] Link: https://lore.kernel.org/kvm/9ca17e1169805f35168eb722734fbf3579187886.camel@intel.com/ [2] Link: https://lore.kernel.org/all/20250522151031.426788-2-chao.gao%40intel.com
2025-06-24x86/traps: Initialize DR7 by writing its architectural reset valueXin Li (Intel)
Initialize DR7 by writing its architectural reset value to always set bit 10, which is reserved to '1', when "clearing" DR7 so as not to trigger unanticipated behavior if said bit is ever unreserved, e.g. as a feature enabling flag with inverted polarity. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sean Christopherson <seanjc@google.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com
2025-06-24KVM: x86: Use kvzalloc() to allocate VM structSean Christopherson
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page listSean Christopherson
Dynamically allocate the (massive) array of hashed lists used to track shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation all on its own, and is *exactly* an order-3 allocation. Dynamically allocating the array will allow allocating "struct kvm" using kvmalloc(), and will also allow deferring allocation of the array until it's actually needed, i.e. until the first shadow root is allocated. Opportunistically use kvmalloc() for the hashed lists, as an order-3 allocation is (stating the obvious) less likely to fail than an order-4 allocation, and the overhead of vmalloc() is undesirable given that the size of the allocation is fixed. Cc: Vipin Sharma <vipinsh@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masksSean Christopherson
Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs being targeted for Hyper-V's paravirt TLB flushing. If KVM_MAX_NR_VCPUS is set to 4096 (which is allowed even for MAXSMP=n builds), putting the vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN limit. arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024) in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than] 2001 | static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) | ^ 1 error generated. Note, sparse_banks was given the same treatment by commit 7d5e88d301f8 ("KVM: x86: hyper-v: Use preallocated buffer in 'struct kvm_vcpu_hv' instead of on-stack 'sparse_banks'"), for the exact same reason. Reported-by: Abinash Lalotra <abinashsinghlalotra@gmail.com> Closes: https://lore.kernel.org/all/20250613111023.786265-1-abinashsinghlalotra@gmail.com Link: https://lore.kernel.org/all/aEylI-O8kFnFHrOH@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()Sean Christopherson
Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq() with kvm_set_msi(). Opportunistically delete the public declaration and export, as they are no longer used/needed. No functional change intended. Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blockingSean Christopherson
Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Allow KVM to control need for GA log interruptsSean Christopherson
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or not an IRTE is configured to generate GA log interrupts. KVM only needs a notification if the target vCPU is blocking, so the vCPU can be awakened. If a vCPU is preempted or exits to userspace, KVM clears is_run, but will set the vCPU back to running when userspace does KVM_RUN and/or the vCPU task is scheduled back in, i.e. KVM doesn't need a notification. Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes from the KVM changes insofar as possible. Opportunistically swap the ordering of parameters for amd_iommu_update_ga() so that the match amd_iommu_activate_guest_mode(). Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as a non-cached field, but per AMD hardware architects, it's not cached and can be safely updated without an invalidation. Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Decouple device assignment from IRQ bypassSean Christopherson
Use a dedicated counter to track the number of IRQs that can utilize IRQ bypass instead of piggybacking the assigned device count. As evidenced by commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), it's possible for a device to be able to post IRQs to a vCPU without said device being assigned to a VM. Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment to avoid regressing the MMIO stale data mitigation. KVM is abusing the assigned device count when applying mmio_stale_data_clear, and it's not at all clear if vDPA devices rely on this behavior. This will hopefully be cleaned up in the future, as the number of assigned devices is a terrible heuristic for detecting if a VM has access to host MMIO. Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinitySean Christopherson
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structsSean Christopherson
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU drivers into vendor specific structures. Overloading a single structure makes the code hard to read and maintain, is *very* misleading as it suggests that mixing vendors is actually supported, and bastardizing Intel's posted interrupt descriptor address when AMD's IOMMU already has its own structure is quite unnecessary. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Dedup AVIC vs. PI code for identifying target vCPUSean Christopherson
Hoist the logic for identifying the target vCPU for a posted interrupt into common x86. The code is functionally identical between Intel and AMD. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM: - Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing synchronisation - A small fix for the irqbypass hook fixes, tightening the check and ensuring that we only deal with MSI for both the old and the new route entry - Rework the way the shadow LRs are addressed in a nesting configuration, plugging an embarrassing bug as well as simplifying the whole process - Add yet another fix for the dreaded arch_timer_edge_cases selftest RISC-V: - Fix the size parameter check in SBI SFENCE calls - Don't treat SBI HFENCE calls as NOPs x86 TDX: - Complete API for handling complex TDVMCALLs in userspace. This was delayed because the spec lacked a way for userspace to deny supporting these calls; the new exit code is now approved" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: TDX: Exit to userspace for GetTdVmCallInfo KVM: TDX: Handle TDG.VP.VMCALL<GetQuote> KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs KVM: arm64: VHE: Centralize ISBs when returning to host KVM: arm64: Remove cpacr_clear_set() KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd() KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync() KVM: arm64: Reorganise CPTR trap manipulation KVM: arm64: VHE: Synchronize CPTR trap deactivation KVM: arm64: VHE: Synchronize restore of host debug registers KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases KVM: arm64: Explicitly treat routing entry type changes as changes KVM: arm64: nv: Fix tracking of shadow list registers RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
2025-06-21x86/CPU/AMD: Add CPUID faulting supportBorislav Petkov (AMD)
Add CPUID faulting support on AMD using the same user interface. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/20250528213105.1149-1-bp@kernel.org
2025-06-20KVM: VMX: Move enable_ipiv knob to common x86Sean Christopherson
Move enable_ipiv to common x86 so that it can be reused by SVM to control IPI virtualization when AVIC is enabled. SVM doesn't actually provide a way to truly disable IPI virtualization, but KVM can get close enough by skipping the necessary table programming. Link: https://lore.kernel.org/r/20250611224604.313496-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creationSean Christopherson
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with an ID that is too big, but otherwise allow vCPU creation to succeed. Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger than 254 (AVIC) or 511 (x2AVIC). Alternatively, KVM could advertise an accurate value depending on which AVIC mode is in use, but that wouldn't really solve the underlying problem, e.g. would be a breaking change if KVM were to ever try and enable AVIC or x2AVIC by default. Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masksSean Christopherson
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned maximum theoretical physical address for x86-64 CPUs, as x86-64 is currently defined (going beyond PA52 would require an entirely new paging mode, which would arguably create a new, different architecture). All usage in KVM masks the result of page_to_phys(), which on x86-64 is guaranteed to be 4KiB aligned and a legal physical address; if either of those requirements doesn't hold true, KVM has far bigger problems. Drop masking the avic_backing_page with AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but keep the macro even though it's unused in functional code. It's a distinct architectural define, and having the definition in software helps visualize the layout of an entry. And to be hyper-paranoid about MAXPA going beyond 52, add a compile-time assert to ensure the kernel's maximum supported physical address stays in bounds. The unnecessary masking in avic_init_vmcb() also incorrectly assumes that SME's C-bit resides between bits 51:11; that holds true for current CPUs, but isn't required by AMD's architecture: In some implementations, the bit used may be a physical address bit Key word being "may". Opportunistically use the GENMASK_ULL() version for AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable than a set of repeating Fs. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BARSean Christopherson
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum theoretical 4KiB-aligned physical address, i.e. is not novel in any way, and its only usage is to mask the default APIC base, which is 4KiB aligned and (obviously) a legal physical address. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Pass new routing entries and irqfd when updating IRTEsSean Christopherson
When updating IRTEs in response to a GSI routing or IRQ bypass change, pass the new/current routing information along with the associated irqfd. This will allow KVM x86 to harden, simplify, and deduplicate its code. Since adding/removing a bypass producer is now conveniently protected with irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(), use the routing information cached in the irqfd instead of looking up the information in the current GSI routing tables. Opportunistically convert an existing printk() to pr_info() and put its string onto a single line (old code that strictly adhered to 80 chars). Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move IRQ mask notifier infrastructure to I/O APIC emulationSean Christopherson
Move the IRQ mask logic to ioapic.c as KVM's only user is its in-kernel I/O APIC emulation. In addition to encapsulating more I/O APIC specific code, trimming down irq_comm.c helps pave the way for removing it entirely. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APICSean Christopherson
Add a Kconfig to allow building KVM without support for emulating a I/O APIC, PIC, and PIT, which is desirable for deployments that effectively don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to create an in-kernel I/O APIC. E.g. compiling out support eliminates a few thousand lines of guest-facing code and gives security folks warm fuzzies. As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes it much easier for readers to understand which bits and pieces exist specifically for fully in-kernel IRQ chips. Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub for kvm_arch_post_irq_routing_update(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Don't clear PIT's IRQ line status when destroying PITSean Christopherson
Don't bother clearing the PIT's IRQ line status when destroying the PIT, as userspace can't possibly rely on KVM to lower the IRQ line in any sane use case, and it's not at all obvious that clearing the PIT's IRQ line is correct/desirable in kvm_create_pit()'s error path. When called from kvm_arch_pre_destroy_vm(), the entire VM is being torn down and thus {kvm_pic,kvm_ioapic}.irq_states are unreachable. As for the error path in kvm_create_pit(), the only way the PIT's bit in irq_states can be set is if userspace raises the associated IRQ before KVM_CREATE_PIT{2} completes. Forcefully clearing the bit would clobber userspace's input, nonsensical though that input may be. Not to mention that no known VMM will continue on if PIT creation fails. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Hardcode the PIT IRQ source ID to '2'Sean Christopherson
Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2 is always the first available bit in irq_sources_bitmap. Bits 0 and 1 are set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's impossible for the PIT to find a different free bit (there are no other users of kvm_request_irq_source_id(). Delete the now-defunct irq_sources_bitmap and all its associated code. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_set_pic_irq() => kvm_pic_set_irq() wrapperSean Christopherson
Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq() wrapper, and instead wire up ->set() directly to its final destination. Opportunistically move the declaration kvm_pic_set_irq() to irq.h to start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Rename msr_filter_changed() => recalc_msr_intercepts()Sean Christopherson
Rename msr_filter_changed() to recalc_msr_intercepts() and drop the trampoline wrapper now that both SVM and VMX use a filter-agnostic recalc helper to react to the new userspace filter. No functional change intended. Reviewed-by: Xin Li (Intel) <xin@zytor.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250610225737.156318-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Preserve host's DEBUGCTLMSR_FREEZE_IN_SMM while running the guestMaxim Levitsky
Set/clear DEBUGCTLMSR_FREEZE_IN_SMM in GUEST_IA32_DEBUGCTL based on the host's pre-VM-Enter value, i.e. preserve the host's FREEZE_IN_SMM setting while running the guest. When running with the "default treatment of SMIs" in effect (the only mode KVM supports), SMIs do not generate a VM-Exit that is visible to host (non-SMM) software, and instead transitions directly from VMX non-root to SMM. And critically, DEBUGCTL isn't context switched by hardware on SMI or RSM, i.e. SMM will run with whatever value was resident in hardware at the time of the SMI. Failure to preserve FREEZE_IN_SMM results in the PMU unexpectedly counting events while the CPU is executing in SMM, which can pollute profiling and potentially leak information into the guest. Check for changes in FREEZE_IN_SMM prior to every entry into KVM's inner run loop, as the bit can be toggled in IRQ context via IPI callback (SMP function call), by way of /sys/devices/cpu/freeze_on_smi. Add a field in kvm_x86_ops to communicate which DEBUGCTL bits need to be preserved, as FREEZE_IN_SMM is only supported and defined for Intel CPUs, i.e. explicitly checking FREEZE_IN_SMM in common x86 is at best weird, and at worst could lead to undesirable behavior in the future if AMD CPUs ever happened to pick up a collision with the bit. Exempt TDX vCPUs, i.e. protected guests, from the check, as the TDX Module owns and controls GUEST_IA32_DEBUGCTL. WARN in SVM if KVM_RUN_LOAD_DEBUGCTL is set, mostly to document that the lack of handling isn't a KVM bug (TDX already WARNs on any run_flag). Lastly, explicitly reload GUEST_IA32_DEBUGCTL on a VM-Fail that is missed by KVM but detected by hardware, i.e. in nested_vmx_restore_host_state(). Doing so avoids the need to track host_debugctl on a per-VMCS basis, as GUEST_IA32_DEBUGCTL is unconditionally written by prepare_vmcs02() and load_vmcs12_host_state(). For the VM-Fail case, even though KVM won't have actually entered the guest, vcpu_enter_guest() will have run with vmcs02 active and thus could result in vmcs01 being run with a stale value. Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/r/20250610232010.162191-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: VMX: Allow guest to set DEBUGCTL.RTM_DEBUG if RTM is supportedSean Christopherson
Let the guest set DEBUGCTL.RTM_DEBUG if RTM is supported according to the guest CPUID model, as debug support is supposed to be available if RTM is supported, and there are no known downsides to letting the guest debug RTM aborts. Note, there are no known bug reports related to RTM_DEBUG, the primary motivation is to reduce the probability of breaking existing guests when a future change adds a missing consistency check on vmcs12.GUEST_DEBUGCTL (KVM currently lets L2 run with whatever hardware supports; whoops). Note #2, KVM already emulates DR6.RTM, and doesn't restrict access to DR7.RTM. Fixes: 83c529151ab0 ("KVM: x86: expose Intel cpu new features (HLE, RTM) to guest") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop kvm_x86_ops.set_dr6() in favor of a new KVM_RUN flagSean Christopherson
Instruct vendor code to load the guest's DR6 into hardware via a new KVM_RUN flag, and remove kvm_x86_ops.set_dr6(), whose sole purpose was to load vcpu->arch.dr6 into hardware when DR6 can be read/written directly by the guest. Note, TDX already WARNs on any run_flag being set, i.e. will yell if KVM thinks DR6 needs to be reloaded. TDX vCPUs force KVM_DEBUGREG_AUTO_SWITCH and never clear the flag, i.e. should never observe KVM_RUN_LOAD_GUEST_DR6. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Convert vcpu_run()'s immediate exit param into a generic bitmapSean Christopherson
Convert kvm_x86_ops.vcpu_run()'s "force_immediate_exit" boolean parameter into an a generic bitmap so that similar "take action" information can be passed to vendor code without creating a pile of boolean parameters. This will allow dropping kvm_x86_ops.set_dr6() in favor of a new flag, and will also allow for adding similar functionality for re-loading debugctl in the active VMCS. Opportunistically massage the TDX WARN and comment to prepare for adding more run_flags, all of which are expected to be mutually exclusive with TDX, i.e. should be WARNed on. No functional change intended. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250610232010.162191-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: TDX: Exit to userspace for SetupEventNotifyInterruptPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Add new TDVMCALL status code for unsupported subfuncsBinbin Wu
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and return it for unimplemented TDVMCALL subfunctions. Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not implemented is vague because TDX guests can't tell the error is due to the subfunction is not supported or an invalid input of the subfunction. New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND. Before the change, for common guest implementations, when a TDX guest receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases: 1. Some operand is invalid. It could change the operand to another value retry. 2. The subfunction is not supported. For case 1, an invalid operand usually means the guest implementation bug. Since the TDX guest can't tell which case is, the best practice for handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf, treating the failure as fatal if the TDVMCALL is essential or ignoring it if the TDVMCALL is optional. With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to old TDX guest that do not know about it, but it is expected that the guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND. Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND specifically; for example Linux just checks for success. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-18virt: sev-guest: Contain snp_guest_request_ioctl in sev-guestAlexey Kardashevskiy
SNP Guest Request uses only exitinfo2 which is a return value from GHCB, has meaning beyond ioctl and therefore belongs to struct snp_guest_req. Move exitinfo2 there and remove snp_guest_request_ioctl from the SEV platform code. No functional change intended. Signed-off-by: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Dionna Glaze <dionnaglaze@google.com> Link: https://lore.kernel.org/20250611040842.2667262-2-aik@amd.com
2025-06-17x86/process: Move the buffer clearing before MONITORBorislav Petkov (AMD)
Move the VERW clearing before the MONITOR so that VERW doesn't disarm it and the machine never enters C1. Original idea by Kim Phillips <kim.phillips@amd.com>. Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-06-17KVM: SVM: Advertise TSA CPUID bits to guestsBorislav Petkov (AMD)
Synthesize the TSA CPUID feature bits for guests. Set TSA_{SQ,L1}_NO on unaffected machines. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-06-17x86/bugs: Add a Transient Scheduler Attacks mitigationBorislav Petkov (AMD)
Add the required features detection glue to bugs.c et all in order to support the TSA mitigation. Co-developed-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-06-16Merge tag 'x86_urgent_for_6.16-rc3' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Dave Hansen: "This is a pretty scattered set of fixes. The majority of them are further fixups around the recent ITS mitigations. The rest don't really have a coherent story: - Some flavors of Xen PV guests don't support large pages, but the set_memory.c code assumes all CPUs support them. Avoid problems with a quick CPU feature check. - The TDX code has some wrappers to help retry calls to the TDX module. They use function pointers to assembly functions and the compiler usually generates direct CALLs. But some new compilers, plus -Os turned them in to indirect CALLs and the assembly code was not annotated for indirect calls. Force inlining of the helper to fix it up. - Last, a FRED issue showed up when single-stepping. It's fine when using an external debugger, but was getting stuck returning from a SIGTRAP handler otherwise. Clear the FRED 'swevent' bit to ensure that forward progress is made" * tag 'x86_urgent_for_6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: Revert "mm/execmem: Unify early execmem_cache behaviour" x86/its: explicitly manage permissions for ITS pages x86/its: move its_pages array to struct mod_arch_specific x86/Kconfig: only enable ROX cache in execmem when STRICT_MODULE_RWX is set x86/mm/pat: don't collapse pages without PSE set x86/virt/tdx: Avoid indirect calls to TDX assembly functions selftests/x86: Add a test to detect infinite SIGTRAP handler loop x86/fred/signal: Prevent immediate repeat of single step trap on return from SIGTRAP handler
2025-06-16x86/bugs: Rename MDS machinery to something more genericBorislav Petkov (AMD)
It will be used by other x86 mitigations. No functional changes. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
2025-06-11mm: pgtable: fix pte_swp_exclusiveMagnus Lindholm
Make pte_swp_exclusive return bool instead of int. This will better reflect how pte_swp_exclusive is actually used in the code. This fixes swap/swapoff problems on Alpha due pte_swp_exclusive not returning correct values when _PAGE_SWP_EXCLUSIVE bit resides in upper 32-bits of PTE (like on alpha). Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Magnus Lindholm <linmag7@gmail.com> Cc: Sam James <sam@gentoo.org> Link: https://lore.kernel.org/lkml/20250218175735.19882-2-linmag7@gmail.com/ Link: https://lore.kernel.org/lkml/20250602041118.GA2675383@ZenIV/ [ Applied as the 'sed' script Al suggested - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-06-11x86/its: move its_pages array to struct mod_arch_specificMike Rapoport (Microsoft)
The of pages with ITS thunks allocated for modules are tracked by an array in 'struct module'. Since this is very architecture specific data structure, move it to 'struct mod_arch_specific'. No functional changes. Fixes: 872df34d7c51 ("x86/its: Use dynamic thunks for indirect branches") Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20250603111446.2609381-4-rppt@kernel.org