summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2025-06-20KVM: SVM: Delete IRTE link from previous vCPU irrespective of new routingSean Christopherson
Delete the IRTE link from the previous vCPU irrespective of the new routing state, i.e. even if the IRTE won't be configured to post IRQs to a vCPU. Whether or not the new route is postable as no bearing on the *old* route. Failure to delete the link can result in KVM incorrectly updating the IRTE, e.g. if the "old" vCPU is scheduled in/out. Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20iommu/amd: KVM: SVM: Delete now-unused cached/previous GA tag fieldsSean Christopherson
Delete the amd_ir_data.prev_ga_tag field now that all usage is superfluous. Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Delete IRTE link from previous vCPU before setting new IRTESean Christopherson
Delete the previous per-vCPU IRTE link prior to modifying the IRTE. If forcing the IRTE back to remapped mode fails, the IRQ is already broken; keeping stale metadata won't change that, and the IOMMU should be sufficiently paranoid to sanitize the IRTE when the IRQ is freed and reallocated. This will allow hoisting the vCPU tracking to common x86, which in turn will allow most of the IRTE update code to be deduplicated. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: SVM: Track per-vCPU IRTEs using kvm_kernel_irqfd structureSean Christopherson
Track the IRTEs that are posting to an SVM vCPU via the associated irqfd structure and GSI routing instead of dynamically allocating a separate data structure. In addition to eliminating an atomic allocation, this will allow hoisting much of the IRTE update logic to common x86. Cc: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Pass new routing entries and irqfd when updating IRTEsSean Christopherson
When updating IRTEs in response to a GSI routing or IRQ bypass change, pass the new/current routing information along with the associated irqfd. This will allow KVM x86 to harden, simplify, and deduplicate its code. Since adding/removing a bypass producer is now conveniently protected with irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(), use the routing information cached in the irqfd instead of looking up the information in the current GSI routing tables. Opportunistically convert an existing printk() to pr_info() and put its string onto a single line (old code that strictly adhered to 80 chars). Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Fold irq_comm.c into irq.cSean Christopherson
Drop irq_comm.c, a.k.a. common IRQ APIs, as there has been no non-x86 user since commit 003f7de62589 ("KVM: ia64: remove") (at the time, irq_comm.c lived in virt/kvm, not arch/x86/kvm). Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move IRQ mask notifier infrastructure to I/O APIC emulationSean Christopherson
Move the IRQ mask logic to ioapic.c as KVM's only user is its in-kernel I/O APIC emulation. In addition to encapsulating more I/O APIC specific code, trimming down irq_comm.c helps pave the way for removing it entirely. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: selftests: Fall back to split IRQ chip if full in-kernel chip is ↵Sean Christopherson
unsupported Now that KVM x86 allows compiling out support for in-kernel I/O APIC (and PIC and PIT) emulation, i.e. allows disabling KVM_CREATE_IRQCHIP for all intents and purposes, fall back to a split IRQ chip for x86 if creating the full in-kernel version fails with ENOTTY. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Squash two CONFIG_HAVE_KVM_IRQCHIP #ifdefs into oneSean Christopherson
Squash two #idef CONFIG_HAVE_KVM_IRQCHIP regions in KVM's trace events, as the only code outside of the #idefs depends on CONFIG_KVM_IOAPIC, and that Kconfig only exists for x86, which unconditionally selects HAVE_KVM_IRQCHIP. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Add CONFIG_KVM_IOAPIC to allow disabling in-kernel I/O APICSean Christopherson
Add a Kconfig to allow building KVM without support for emulating a I/O APIC, PIC, and PIT, which is desirable for deployments that effectively don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to create an in-kernel I/O APIC. E.g. compiling out support eliminates a few thousand lines of guest-facing code and gives security folks warm fuzzies. As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes it much easier for readers to understand which bits and pieces exist specifically for fully in-kernel IRQ chips. Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub for kvm_arch_post_irq_routing_update(). Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: Move x86-only tracepoints to x86's trace.hSean Christopherson
Move the I/O APIC tracepoints and trace_kvm_msi_set_irq() to x86, as __KVM_HAVE_IOAPIC is just code for "x86", and trace_kvm_msi_set_irq() isn't unique to I/O APIC emulation. Opportunistically clean up the absurdly messy #includes in ioapic.c. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Explicitly check for in-kernel PIC when getting ExtINTSean Christopherson
Explicitly check for an in-kernel PIC when checking for a pending ExtINT in the PIC. Effectively swapping the split vs. full irqchip logic will allow guarding the in-kernel I/O APIC (and PIC) emulation with a Kconfig, and also makes it more obvious that kvm_pic_read_irq() won't result in a NULL pointer dereference. Opportunistically add WARNs in the fallthrough path, mostly to document that the userspace ExtINT logic is only relevant to split IRQ chips. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Don't clear PIT's IRQ line status when destroying PITSean Christopherson
Don't bother clearing the PIT's IRQ line status when destroying the PIT, as userspace can't possibly rely on KVM to lower the IRQ line in any sane use case, and it's not at all obvious that clearing the PIT's IRQ line is correct/desirable in kvm_create_pit()'s error path. When called from kvm_arch_pre_destroy_vm(), the entire VM is being torn down and thus {kvm_pic,kvm_ioapic}.irq_states are unreachable. As for the error path in kvm_create_pit(), the only way the PIT's bit in irq_states can be set is if userspace raises the associated IRQ before KVM_CREATE_PIT{2} completes. Forcefully clearing the bit would clobber userspace's input, nonsensical though that input may be. Not to mention that no known VMM will continue on if PIT creation fails. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Hardcode the PIT IRQ source ID to '2'Sean Christopherson
Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2 is always the first available bit in irq_sources_bitmap. Bits 0 and 1 are set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's impossible for the PIT to find a different free bit (there are no other users of kvm_request_irq_source_id(). Delete the now-defunct irq_sources_bitmap and all its associated code. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move kvm_{request,free}_irq_source_id() to i8254.c (PIT)Sean Christopherson
Move kvm_{request,free}_irq_source_id() to i8254.c, i.e. the dedicated PIT emulation file, in anticipation of removing them entirely in favor of hardcoding the PIT's "requested" source ID (the source ID can only ever be '2', and the request can never fail). No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move kvm_setup_default_irq_routing() into irq.cSean Christopherson
Move the default IRQ routing table used for in-kernel I/O APIC and PIC routing to irq.c, and tweak the name to make it explicitly clear what routing is being initialized. In addition to making it more obvious that the so called "default" routing only applies to an in-kernel I/O APIC, getting it out of irq_comm.c will allow removing irq_comm.c entirely. And placing the function alongside other I/O APIC and PIC code will allow for guarding KVM's in-kernel I/O APIC and PIC emulation with a Kconfig with minimal #ifdefs. No functional change intended. Cc: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Rename irqchip_kernel() to irqchip_full()Sean Christopherson
Rename irqchip_kernel() to irqchip_full(), as "kernel" is very ambiguous due to the existence of split IRQ chip support, where only some of the "irqchip" is in emulated by the kernel/KVM. E.g. irqchip_kernel() often gets confused with irqchip_in_kernel(). Opportunistically hoist the definition up in irq.h so that it's co-located with other "full" irqchip code in anticipation of wrapping it all with a Kconfig/#ifdef. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move KVM_{GET,SET}_IRQCHIP ioctl helpers to irq.cSean Christopherson
Move the ioctl helpers for getting/setting fully in-kernel IRQ chip state to irq.c, partly to trim down x86.c, but mostly in preparation for adding a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT emulation. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Move PIT ioctl helpers to i8254.cSean Christopherson
Move the PIT ioctl helpers to i8254.c, i.e. to the file that implements PIT emulation. Eliminating PIT code in x86.c will allow adding a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT emulation with minimal #ifdefs. Opportunistically make kvm_pit_set_reinject() and kvm_pit_load_count() local to i8254.c as they were only publicly visible to make them available to the ioctl helpers. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_hv_set_sint() => kvm_hv_synic_set_irq() wrapperSean Christopherson
Drop the superfluous kvm_hv_set_sint() and instead wire up ->set() directly to its final destination, kvm_hv_synic_set_irq(). Keep hv_synic_set_irq() instead of kvm_hv_set_sint() to provide some amount of consistency in the ->set() helpers, e.g. to match kvm_pic_set_irq() and kvm_ioapic_set_irq(). kvm_set_msi() is arguably the oddball, e.g. kvm_set_msi_irq() should be something like kvm_msi_to_lapic_irq() so that kvm_set_msi() can instead be kvm_set_msi_irq(), but that's a future problem to solve. No functional change intended. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Kai Huang <kai.huang@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_set_ioapic_irq() => kvm_ioapic_set_irq() wrapperSean Christopherson
Drop the superfluous and confusing kvm_set_ioapic_irq() and instead wire up ->set() directly to its final destination. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Drop superfluous kvm_set_pic_irq() => kvm_pic_set_irq() wrapperSean Christopherson
Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq() wrapper, and instead wire up ->set() directly to its final destination. Opportunistically move the declaration kvm_pic_set_irq() to irq.h to start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}. No functional change intended. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: x86: Trigger I/O APIC route rescan in kvm_arch_irq_routing_update()Sean Christopherson
Trigger the I/O APIC route rescan that's performed for a split IRQ chip after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e. before dropping kvm->irq_lock. Calling kvm_make_all_cpus_request() under a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair in __kvm_make_request()+kvm_check_request() ensures the new routing is visible to vCPUs prior to the request being visible to vCPUs. In all likelihood, commit b053b2aef25d ("KVM: x86: Add EOI exit bitmap inference") somewhat arbitrarily made the request outside of irq_lock to avoid holding irq_lock any longer than is strictly necessary. And then commit abdb080f7ac8 ("kvm/irqchip: kvm_arch_irq_routing_update renaming split") took the easy route of adding another arch hook instead of risking a functional change. Note, the call to synchronize_srcu_expedited() does NOT provide ordering guarantees with respect to vCPUs scanning the new routing; as above, the request infrastructure provides the necessary ordering. I.e. there's no need to wait for kvm_scan_ioapic_routes() to complete if it's actively running, because regardless of whether it grabs the old or new table, the vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again and see the new mappings. Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Require producers to pass in Linux IRQ number during registrationSean Christopherson
Pass in the Linux IRQ associated with an IRQ bypass producer instead of relying on the caller to set the field prior to registration, as there's no benefit to relying on callers to do the right thing. Take care to set producer->irq before __connect(), as KVM expects the IRQ to be valid as soon as a connection is possible. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use xarray to track producers and consumersSean Christopherson
Track IRQ bypass producers and consumers using an xarray to avoid the O(2n) insertion time associated with walking a list to check for duplicate entries, and to search for an partner. At low (tens or few hundreds) total producer/consumer counts, using a list is faster due to the need to allocate backing storage for xarray. But as count creeps into the thousands, xarray wins easily, and can provide several orders of magnitude better latency at high counts. E.g. hundreds of nanoseconds vs. hundreds of milliseconds. Cc: Oliver Upton <oliver.upton@linux.dev> Cc: David Matlack <dmatlack@google.com> Cc: Like Xu <like.xu.linux@gmail.com> Cc: Binbin Wu <binbin.wu@linux.intel.com> Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217379 Link: https://lore.kernel.org/all/20230801115646.33990-1-likexu@tencent.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use guard(mutex) in lieu of manual lock+unlockSean Christopherson
Use guard(mutex) to clean up irqbypass's error handling. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Use paired consumer/producer to disconnect during unregisterSean Christopherson
Use the paired consumer/producer information to disconnect IRQ bypass producers/consumers in O(1) time (ignoring the cost of __disconnect()). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Explicitly track producer and consumer bindingsSean Christopherson
Explicitly track IRQ bypass producer:consumer bindings. This will allow making removal an O(1) operation; searching through the list to find information that is trivially tracked (and useful for debug) is wasteful. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Take ownership of producer/consumer token trackingSean Christopherson
Move ownership of IRQ bypass token tracking into irqbypass.ko, and explicitly require callers to pass an eventfd_ctx structure instead of a completely opaque token. Relying on producers and consumers to set the token appropriately is error prone, and hiding the fact that the token must be an eventfd_ctx pointer (for all intents and purposes) unnecessarily obfuscates the code and makes it more brittle. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Drop superfluous might_sleep() annotationsSean Christopherson
Drop superfluous might_sleep() annotations from irqbypass, mutex_lock() provides all of the necessary tracking. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20irqbypass: Drop pointless and misleading THIS_MODULE get/putSean Christopherson
Drop irqbypass.ko's superfluous and misleading get/put calls on THIS_MODULE. A module taking a reference to itself is useless; no amount of checks will prevent doom and destruction if the caller hasn't already guaranteed the liveliness of the module (this goes for any module). E.g. if try_module_get() fails because irqbypass.ko is being unloaded, then the kernel has already hit a use-after-free by virtue of executing code whose lifecycle is tied to irqbypass.ko. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Link: https://lore.kernel.org/r/20250516230734.2564775-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: arm64: WARN if unmapping a vLPI fails in any pathSean Christopherson
When unmapping a vLPI, WARN if nullifying vCPU affinity fails, not just if failure occurs when freeing an ITE. If undoing vCPU affinity fails, then odds are very good that vLPI state tracking has has gotten out of whack, i.e. that KVM and the GIC disagree on the state of an IRQ/vLPI. At best, inconsistent state means there is a lurking bug/flaw somewhere. At worst, the inconsistency could eventually be fatal to the host, e.g. if an ITS command fails because KVM's view of things doesn't match reality/hardware. Note, only the call from kvm_arch_irq_bypass_del_producer() by way of kvm_vgic_v4_unset_forwarding() doesn't already WARN. Common KVM's kvm_irq_routing_update() WARNs if kvm_arch_update_irqfd_routing() fails. For that path, if its_unmap_vlpi() fails in kvm_vgic_v4_unset_forwarding(), the only possible causes are that the GIC doesn't have a v4 ITS (from its_irq_set_vcpu_affinity()): /* Need a v4 ITS */ if (!is_v4(its_dev->its)) return -EINVAL; guard(raw_spinlock)(&its_dev->event_map.vlpi_lock); /* Unmap request? */ if (!info) return its_vlpi_unmap(d); or that KVM has gotten out of sync with the GIC/ITS (from its_vlpi_unmap()): if (!its_dev->event_map.vm || !irqd_is_forwarded_to_vcpu(d)) return -EINVAL; All of the above failure scenarios are warnable offences, as they should never occur absent a kernel/KVM bug. Acked-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/all/aFWY2LTVIxz5rfhh@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20KVM: TDX: Report supported optional TDVMCALLs in TDX capabilitiesPaolo Bonzini
Allow userspace to advertise TDG.VP.VMCALL subfunctions that the kernel also supports. For each output register of GetTdVmCallInfo's leaf 1, add two fields to KVM_TDX_CAPABILITIES: one for kernel-supported TDVMCALLs (userspace can set those blindly) and one for user-supported TDVMCALLs (userspace can set those if it knows how to handle them). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Exit to userspace for SetupEventNotifyInterruptPaolo Bonzini
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Exit to userspace for GetTdVmCallInfoBinbin Wu
Exit to userspace for TDG.VP.VMCALL<GetTdVmCallInfo> via KVM_EXIT_TDX, to allow userspace to provide information about the support of TDVMCALLs when r12 is 1 for the TDVMCALLs beyond the GHCI base API. GHCI spec defines the GHCI base TDVMCALLs: <GetTdVmCallInfo>, <MapGPA>, <ReportFatalError>, <Instruction.CPUID>, <#VE.RequestMMIO>, <Instruction.HLT>, <Instruction.IO>, <Instruction.RDMSR> and <Instruction.WRMSR>. They must be supported by VMM to support TDX guests. For GetTdVmCallInfo - When leaf (r12) to enumerate TDVMCALL functionality is set to 0, successful execution indicates all GHCI base TDVMCALLs listed above are supported. Update the KVM TDX document with the set of the GHCI base APIs. - When leaf (r12) to enumerate TDVMCALL functionality is set to 1, it indicates the TDX guest is querying the supported TDVMCALLs beyond the GHCI base TDVMCALLs. Exit to userspace to let userspace set the TDVMCALL sub-function bit(s) accordingly to the leaf outputs. KVM could set the TDVMCALL bit(s) supported by itself when the TDVMCALLs don't need support from userspace after returning from userspace and before entering guest. Currently, no such TDVMCALLs implemented, KVM just sets the values returned from userspace. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>Binbin Wu
Handle TDVMCALL for GetQuote to generate a TD-Quote. GetQuote is a doorbell-like interface used by TDX guests to request VMM to generate a TD-Quote signed by a service hosting TD-Quoting Enclave operating on the host. A TDX guest passes a TD Report (TDREPORT_STRUCT) in a shared-memory area as parameter. Host VMM can access it and queue the operation for a service hosting TD-Quoting enclave. When completed, the Quote is returned via the same shared-memory area. KVM only checks the GPA from the TDX guest has the shared-bit set and drops the shared-bit before exiting to userspace to avoid bleeding the shared-bit into KVM's exit ABI. KVM forwards the request to userspace VMM (e.g. QEMU) and userspace VMM queues the operation asynchronously. KVM sets the return code according to the 'ret' field set by userspace to notify the TDX guest whether the request has been queued successfully or not. When the request has been queued successfully, the TDX guest can poll the status field in the shared-memory area to check whether the Quote generation is completed or not. When completed, the generated Quote is returned via the same buffer. Add KVM_EXIT_TDX as a new exit reason to userspace. Userspace is required to handle the KVM exit reason as the initial support for TDX, by reentering KVM to ensure that the TDVMCALL is complete. While at it, add a note that KVM_EXIT_HYPERCALL also requires reentry with KVM_RUN. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> Tested-by: Mikko Ylinen <mikko.ylinen@linux.intel.com> Acked-by: Kai Huang <kai.huang@intel.com> [Adjust userspace API. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20KVM: TDX: Add new TDVMCALL status code for unsupported subfuncsBinbin Wu
Add the new TDVMCALL status code TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED and return it for unimplemented TDVMCALL subfunctions. Returning TDVMCALL_STATUS_INVALID_OPERAND when a subfunction is not implemented is vague because TDX guests can't tell the error is due to the subfunction is not supported or an invalid input of the subfunction. New GHCI spec adds TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED to avoid the ambiguity. Use it instead of TDVMCALL_STATUS_INVALID_OPERAND. Before the change, for common guest implementations, when a TDX guest receives TDVMCALL_STATUS_INVALID_OPERAND, it has two cases: 1. Some operand is invalid. It could change the operand to another value retry. 2. The subfunction is not supported. For case 1, an invalid operand usually means the guest implementation bug. Since the TDX guest can't tell which case is, the best practice for handling TDVMCALL_STATUS_INVALID_OPERAND is stopping calling such leaf, treating the failure as fatal if the TDVMCALL is essential or ignoring it if the TDVMCALL is optional. With this change, TDVMCALL_STATUS_SUBFUNC_UNSUPPORTED could be sent to old TDX guest that do not know about it, but it is expected that the guest will make the same action as TDVMCALL_STATUS_INVALID_OPERAND. Currently, no known TDX guest checks TDVMCALL_STATUS_INVALID_OPERAND specifically; for example Linux just checks for success. Signed-off-by: Binbin Wu <binbin.wu@linux.intel.com> [Return it for untrapped KVM_HC_MAP_GPA_RANGE. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2025-06-20Merge tag 'kvm-riscv-fixes-6.16-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini
into HEAD KVM/riscv fixes for 6.16, take #1 - Fix the size parameter check in SBI SFENCE calls - Don't treat SBI HFENCE calls as NOPs
2025-06-20Merge tag 'kvmarm-fixes-6.16-3' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.16, take #3 - Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some missing synchronisation - A small fix for the irqbypass hook fixes, tightening the check and ensuring that we only deal with MSI for both the old and the new route entry - Rework the way the shadow LRs are addressed in a nesting configuration, plugging an embarrassing bug as well as simplifying the whole process - Add yet another fix for the dreaded arch_timer_edge_cases selftest
2025-06-19KVM: arm64: VHE: Centralize ISBs when returning to hostMark Rutland
The VHE hyp code has recently gained a few ISBs. Simplify this to one unconditional ISB in __kvm_vcpu_run_vhe(), and remove the unnecessary ISB from the kvm_call_hyp_ret() macro. While kvm_call_hyp_ret() is also used to invoke __vgic_v3_get_gic_config(), but no ISB is necessary in that case either. For the moment, an ISB is left in kvm_call_hyp(), as there are many more users, and removing the ISB would require a more thorough audit. Suggested-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-8-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: Remove cpacr_clear_set()Mark Rutland
We no longer use cpacr_clear_set(). Remove cpacr_clear_set() and its helper functions. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-7-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()Mark Rutland
The hyp code FPSIMD/SVE/SME trap handling logic has some rather messy open-coded manipulation of CPTR/CPACR. This is benign for non-nested guests, but broken for nested guests, as the guest hypervisor's CPTR configuration is not taken into account. Consider the case where L0 provides FPSIMD+SVE to an L1 guest hypervisor, and the L1 guest hypervisor only provides FPSIMD to an L2 guest (with L1 configuring CPTR/CPACR to trap SVE usage from L2). If the L2 guest triggers an FPSIMD trap to the L0 hypervisor, kvm_hyp_handle_fpsimd() will see that the vCPU supports FPSIMD+SVE, and will configure CPTR/CPACR to NOT trap FPSIMD+SVE before returning to the L2 guest. Consequently the L2 guest would be able to manipulate SVE state even though the L1 hypervisor had configured CPTR/CPACR to forbid this. Clean this up, and fix the nested virt issue by always using __deactivate_cptr_traps() and __activate_cptr_traps() to manage the CPTR traps. This removes the need for the ad-hoc fixup in kvm_hyp_save_fpsimd_host(), and ensures that any guest hypervisor configuration of CPTR/CPACR is taken into account. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-6-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()Mark Rutland
There's no need for fpsimd_sve_sync() to write to CPTR/CPACR. All relevant traps are always disabled earlier within __kvm_vcpu_run(), when __deactivate_cptr_traps() configures CPTR/CPACR. With irrelevant details elided, the flow is: handle___kvm_vcpu_run(...) { flush_hyp_vcpu(...) { fpsimd_sve_flush(...); } __kvm_vcpu_run(...) { __activate_traps(...) { __activate_cptr_traps(...); } do { __guest_enter(...); } while (...); __deactivate_traps(....) { __deactivate_cptr_traps(...); } } sync_hyp_vcpu(...) { fpsimd_sve_sync(...); } } Remove the unnecessary write to CPTR/CPACR. An ISB is still necessary, so a comment is added to describe this requirement. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-5-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: Reorganise CPTR trap manipulationMark Rutland
The NVHE/HVHE and VHE modes have separate implementations of __activate_cptr_traps() and __deactivate_cptr_traps() in their respective switch.c files. There's some duplication of logic, and it's not currently possible to reuse this logic elsewhere. Move the logic into the common switch.h header so that it can be reused, and de-duplicate the common logic. This rework changes the way SVE traps are deactivated in VHE mode, aligning it with NVHE/HVHE modes: * Before this patch, VHE's __deactivate_cptr_traps() would unconditionally enable SVE for host EL2 (but not EL0), regardless of whether the ARM64_SVE cpucap was set. * After this patch, VHE's __deactivate_cptr_traps() will take the ARM64_SVE cpucap into account. When ARM64_SVE is not set, SVE will be trapped from EL2 and below. The old and new behaviour are both benign: * When ARM64_SVE is not set, the host will not touch SVE state, and will not reconfigure SVE traps. Host EL0 access to SVE will be trapped as expected. * When ARM64_SVE is set, the host will configure EL0 SVE traps before returning to EL0 as part of reloading the EL0 FPSIMD/SVE/SME state. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-4-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: VHE: Synchronize CPTR trap deactivationMark Rutland
Currently there is no ISB between __deactivate_cptr_traps() disabling traps that affect EL2 and fpsimd_lazy_switch_to_host() manipulating registers potentially affected by CPTR traps. When NV is not in use, this is safe because the relevant registers are only accessed when guest_owns_fp_regs() && vcpu_has_sve(vcpu), and this also implies that SVE traps affecting EL2 have been deactivated prior to __guest_entry(). When NV is in use, a guest hypervisor may have configured SVE traps for a nested context, and so it is necessary to have an ISB between __deactivate_cptr_traps() and fpsimd_lazy_switch_to_host(). Due to the current lack of an ISB, when a guest hypervisor enables SVE traps in CPTR, the host can take an unexpected SVE trap from within fpsimd_lazy_switch_to_host(), e.g. | Unhandled 64-bit el1h sync exception on CPU1, ESR 0x0000000066000000 -- SVE | CPU: 1 UID: 0 PID: 164 Comm: kvm-vcpu-0 Not tainted 6.15.0-rc4-00138-ga05e0f012c05 #3 PREEMPT | Hardware name: FVP Base RevC (DT) | pstate: 604023c9 (nZCv DAIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) | pc : __kvm_vcpu_run+0x6f4/0x844 | lr : __kvm_vcpu_run+0x150/0x844 | sp : ffff800083903a60 | x29: ffff800083903a90 x28: ffff000801f4a300 x27: 0000000000000000 | x26: 0000000000000000 x25: ffff000801f90000 x24: ffff000801f900f0 | x23: ffff800081ff7720 x22: 0002433c807d623f x21: ffff000801f90000 | x20: ffff00087f730730 x19: 0000000000000000 x18: 0000000000000000 | x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 | x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 | x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 | x8 : 0000000000000000 x7 : 0000000000000000 x6 : ffff000801f90d70 | x5 : 0000000000001000 x4 : ffff8007fd739000 x3 : ffff000801f90000 | x2 : 0000000000000000 x1 : 00000000000003cc x0 : ffff800082f9d000 | Kernel panic - not syncing: Unhandled exception | CPU: 1 UID: 0 PID: 164 Comm: kvm-vcpu-0 Not tainted 6.15.0-rc4-00138-ga05e0f012c05 #3 PREEMPT | Hardware name: FVP Base RevC (DT) | Call trace: | show_stack+0x18/0x24 (C) | dump_stack_lvl+0x60/0x80 | dump_stack+0x18/0x24 | panic+0x168/0x360 | __panic_unhandled+0x68/0x74 | el1h_64_irq_handler+0x0/0x24 | el1h_64_sync+0x6c/0x70 | __kvm_vcpu_run+0x6f4/0x844 (P) | kvm_arm_vcpu_enter_exit+0x64/0xa0 | kvm_arch_vcpu_ioctl_run+0x21c/0x870 | kvm_vcpu_ioctl+0x1a8/0x9d0 | __arm64_sys_ioctl+0xb4/0xf4 | invoke_syscall+0x48/0x104 | el0_svc_common.constprop.0+0x40/0xe0 | do_el0_svc+0x1c/0x28 | el0_svc+0x30/0xcc | el0t_64_sync_handler+0x10c/0x138 | el0t_64_sync+0x198/0x19c | SMP: stopping secondary CPUs | Kernel Offset: disabled | CPU features: 0x0000,000002c0,02df4fb9,97ee773f | Memory Limit: none | ---[ end Kernel panic - not syncing: Unhandled exception ]--- Fix this by adding an ISB between __deactivate_traps() and fpsimd_lazy_switch_to_host(). Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20250617133718.4014181-3-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: VHE: Synchronize restore of host debug registersMark Rutland
When KVM runs in non-protected VHE mode, there's no context synchronization event between __debug_switch_to_host() restoring the host debug registers and __kvm_vcpu_run() unmasking debug exceptions. Due to this, it's theoretically possible for the host to take an unexpected debug exception due to the stale guest configuration. This cannot happen in NVHE/HVHE mode as debug exceptions are masked in the hyp code, and the exception return to the host will provide the necessary context synchronization before debug exceptions can be taken. For now, avoid the problem by adding an ISB after VHE hyp code restores the host debug registers. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Fuad Tabba <tabba@google.com> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Will Deacon <will@kernel.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20250617133718.4014181-2-mark.rutland@arm.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_casesZenghui Yu
Close the GIC FD to free the reference it holds to the VM so that we can correctly clean up the VM. This also gets rid of the "KVM: debugfs: duplicate directory 395722-4" warning when running arch_timer_edge_cases. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Reviewed-by: Miguel Luis <miguel.luis@oracle.com> Reviewed-by: Sebastian Ott <sebott@redhat.com> Link: https://lore.kernel.org/r/20250608095402.1131-1-yuzenghui@huawei.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: Explicitly treat routing entry type changes as changesSean Christopherson
Explicitly treat type differences as GSI routing changes, as comparing MSI data between two entries could get a false negative, e.g. if userspace changed the type but left the type-specific data as- Note, the same bug was fixed in x86 by commit bcda70c56f3e ("KVM: x86: Explicitly treat routing entry type changes as changes"). Fixes: 4bf3693d36af ("KVM: arm64: Unmap vLPIs affected by changes to GSI routing information") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250611224604.313496-3-seanjc@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2025-06-19KVM: arm64: nv: Fix tracking of shadow list registersMarc Zyngier
Wei-Lin reports that the tracking of shadow list registers is majorly broken when resync'ing the L2 state after a run, as we confuse the guest's LR index with the host's, potentially losing the interrupt state. While this could be fixed by adding yet another side index to track it (Wei-Lin's fix), it may be better to refactor this code to avoid having a side index altogether, limiting the risk to introduce this class of bugs. A key observation is that the shadow index is always the number of bits in the lr_map bitmap. With that, the parallel indexing scheme can be completely dropped. While doing this, introduce a couple of helpers that abstract the index conversion and some of the LR repainting, making the whole exercise much simpler. Reported-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw> Reviewed-by: Wei-Lin Chang <r09922117@csie.ntu.edu.tw> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20250614145721.2504524-1-r09922117@csie.ntu.edu.tw Link: https://lore.kernel.org/r/86qzzkc5xa.wl-maz@kernel.org
2025-06-17RISC-V: KVM: Don't treat SBI HFENCE calls as NOPsAnup Patel
The SBI specification clearly states that SBI HFENCE calls should return SBI_ERR_NOT_SUPPORTED when one of the target hart doesn’t support hypervisor extension (aka nested virtualization in-case of KVM RISC-V). Fixes: c7fa3c48de86 ("RISC-V: KVM: Treat SBI HFENCE calls as NOPs") Reviewed-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Anup Patel <apatel@ventanamicro.com> Link: https://lore.kernel.org/r/20250605061458.196003-3-apatel@ventanamicro.com Signed-off-by: Anup Patel <anup@brainfault.org>