summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2025-06-24x86/traps: Initialize DR7 by writing its architectural reset valueXin Li (Intel)
Initialize DR7 by writing its architectural reset value to always set bit 10, which is reserved to '1', when "clearing" DR7 so as not to trigger unanticipated behavior if said bit is ever unreserved, e.g. as a feature enabling flag with inverted polarity. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sean Christopherson <seanjc@google.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com
2025-06-24KVM: x86/mmu: Defer allocation of shadow MMU's hashed page listSean Christopherson
When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a nested TDP VM is run, defer allocation of the array of hashed lists used to track shadow MMU pages until the first shadow root is allocated. Setting the list outside of mmu_lock is safe, as concurrent readers must hold mmu_lock in some capacity, shadow pages can only be added (or removed) from the list when mmu_lock is held for write, and tasks that are creating a shadow root are serialized by slots_arch_lock. I.e. it's impossible for the list to become non-empty until all readers go away, and so readers are guaranteed to see an empty list even if they make multiple calls to kvm_get_mmu_page_hash() in a single mmu_lock critical section. Use smp_store_release() and smp_load_acquire() to access the hash table pointer to ensure the stores to zero the lists are retired before readers start to walk the list. E.g. if the compiler hoisted the store before the zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale kernel data. Cc: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250523001138.3182794-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86: Use kvzalloc() to allocate VM structSean Christopherson
Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page listSean Christopherson
Dynamically allocate the (massive) array of hashed lists used to track shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation all on its own, and is *exactly* an order-3 allocation. Dynamically allocating the array will allow allocating "struct kvm" using kvmalloc(), and will also allow deferring allocation of the array until it's actually needed, i.e. until the first shadow root is allocated. Opportunistically use kvmalloc() for the hashed lists, as an order-3 allocation is (stating the obvious) less likely to fail than an order-4 allocation, and the overhead of vmalloc() is undesirable given that the size of the allocation is fixed. Cc: Vipin Sharma <vipinsh@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.David Woodhouse
To avoid imposing an ordering constraint on userspace, allow 'invalid' event channel targets to be configured in the IRQ routing table. This is the same as accepting interrupts targeted at vCPUs which don't exist yet, which is already the case for both Xen event channels *and* for MSIs (which don't do any filtering of permitted APIC ID targets at all). If userspace actually *triggers* an IRQ with an invalid target, that will fail cleanly, as kvm_xen_set_evtchn_fast() also does the same range check. If KVM enforced that the IRQ target must be valid at the time it is *configured*, that would force userspace to create all vCPUs and do various other parts of setup (in this case, setting the Xen long_mode) before restoring the IRQ table. Cc: stable@vger.kernel.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/e489252745ac4b53f1f7f50570b03fb416aa2065.camel@infradead.org [sean: massage comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masksSean Christopherson
Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs being targeted for Hyper-V's paravirt TLB flushing. If KVM_MAX_NR_VCPUS is set to 4096 (which is allowed even for MAXSMP=n builds), putting the vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN limit. arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024) in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than] 2001 | static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) | ^ 1 error generated. Note, sparse_banks was given the same treatment by commit 7d5e88d301f8 ("KVM: x86: hyper-v: Use preallocated buffer in 'struct kvm_vcpu_hv' instead of on-stack 'sparse_banks'"), for the exact same reason. Reported-by: Abinash Lalotra <abinashsinghlalotra@gmail.com> Closes: https://lore.kernel.org/all/20250613111023.786265-1-abinashsinghlalotra@gmail.com Link: https://lore.kernel.org/all/aEylI-O8kFnFHrOH@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULLSean Christopherson
When creating an SEV-ES vCPU for intra-host migration, set its vmsa_pa to INVALID_PAGE to harden against doing VMRUN with a bogus VMSA (KVM checks for a valid VMSA page in pre_sev_run()). Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Link: https://lore.kernel.org/r/20250602224459.41505-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flightSean Christopherson
Reject migration of SEV{-ES} state if either the source or destination VM is actively creating a vCPU, i.e. if kvm_vm_ioctl_create_vcpu() is in the section between incrementing created_vcpus and online_vcpus. The bulk of vCPU creation runs _outside_ of kvm->lock to allow creating multiple vCPUs in parallel, and so sev_info.es_active can get toggled from false=>true in the destination VM after (or during) svm_vcpu_create(), resulting in an SEV{-ES} VM effectively having a non-SEV{-ES} vCPU. The issue manifests most visibly as a crash when trying to free a vCPU's NULL VMSA page in an SEV-ES VM, but any number of things can go wrong. BUG: unable to handle page fault for address: ffffebde00000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP KASAN NOPTI CPU: 227 UID: 0 PID: 64063 Comm: syz.5.60023 Tainted: G U O 6.15.0-smp-DEV #2 NONE Tainted: [U]=USER, [O]=OOT_MODULE Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 12.52.0-0 10/28/2024 RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:206 [inline] RIP: 0010:arch_test_bit arch/x86/include/asm/bitops.h:238 [inline] RIP: 0010:_test_bit include/asm-generic/bitops/instrumented-non-atomic.h:142 [inline] RIP: 0010:PageHead include/linux/page-flags.h:866 [inline] RIP: 0010:___free_pages+0x3e/0x120 mm/page_alloc.c:5067 Code: <49> f7 06 40 00 00 00 75 05 45 31 ff eb 0c 66 90 4c 89 f0 4c 39 f0 RSP: 0018:ffff8984551978d0 EFLAGS: 00010246 RAX: 0000777f80000001 RBX: 0000000000000000 RCX: ffffffff918aeb98 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffebde00000000 RBP: 0000000000000000 R08: ffffebde00000007 R09: 1ffffd7bc0000000 R10: dffffc0000000000 R11: fffff97bc0000001 R12: dffffc0000000000 R13: ffff8983e19751a8 R14: ffffebde00000000 R15: 1ffffd7bc0000000 FS: 0000000000000000(0000) GS:ffff89ee661d3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffebde00000000 CR3: 000000793ceaa000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000b5f DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <TASK> sev_free_vcpu+0x413/0x630 arch/x86/kvm/svm/sev.c:3169 svm_vcpu_free+0x13a/0x2a0 arch/x86/kvm/svm/svm.c:1515 kvm_arch_vcpu_destroy+0x6a/0x1d0 arch/x86/kvm/x86.c:12396 kvm_vcpu_destroy virt/kvm/kvm_main.c:470 [inline] kvm_destroy_vcpus+0xd1/0x300 virt/kvm/kvm_main.c:490 kvm_arch_destroy_vm+0x636/0x820 arch/x86/kvm/x86.c:12895 kvm_put_kvm+0xb8e/0xfb0 virt/kvm/kvm_main.c:1310 kvm_vm_release+0x48/0x60 virt/kvm/kvm_main.c:1369 __fput+0x3e4/0x9e0 fs/file_table.c:465 task_work_run+0x1a9/0x220 kernel/task_work.c:227 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x7f0/0x25b0 kernel/exit.c:953 do_group_exit+0x203/0x2d0 kernel/exit.c:1102 get_signal+0x1357/0x1480 kernel/signal.c:3034 arch_do_signal_or_restart+0x40/0x690 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x67/0xb0 kernel/entry/common.c:218 do_syscall_64+0x7c/0x150 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f87a898e969 </TASK> Modules linked in: gq(O) gsmi: Log Shutdown Reason 0x03 CR2: ffffebde00000000 ---[ end trace 0000000000000000 ]--- Deliberately don't check for a NULL VMSA when freeing the vCPU, as crashing the host is likely desirable due to the VMSA being consumed by hardware. E.g. if KVM manages to allow VMRUN on the vCPU, hardware may read/write a bogus VMSA page. Accessing PFN 0 is "fine"-ish now that it's sequestered away thanks to L1TF, but panicking in this scenario is preferable to potentially running with corrupted state. Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration") Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Cc: stable@vger.kernel.org Cc: James Houghton <jthoughton@google.com> Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250602224459.41505-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()Sean Christopherson
Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq() with kvm_set_msi(). Opportunistically delete the public declaration and export, as they are no longer used/needed. No functional change intended. Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blockingSean Christopherson
Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Allow KVM to control need for GA log interruptsSean Christopherson
Add plumbing to the AMD IOMMU driver to allow KVM to control whether or not an IRTE is configured to generate GA log interrupts. KVM only needs a notification if the target vCPU is blocking, so the vCPU can be awakened. If a vCPU is preempted or exits to userspace, KVM clears is_run, but will set the vCPU back to running when userspace does KVM_RUN and/or the vCPU task is scheduled back in, i.e. KVM doesn't need a notification. Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes from the KVM changes insofar as possible. Opportunistically swap the ordering of parameters for amd_iommu_update_ga() so that the match amd_iommu_activate_guest_mode(). Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as a non-cached field, but per AMD hardware architects, it's not cached and can be safely updated without an invalidation. Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Consolidate IRTE update when toggling AVIC on/offSean Christopherson
Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into __avic_vcpu_{load,put}(), and add a param to the helpers to communicate whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update, or just a quick update to set the CPU and IsRun. Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/offSean Christopherson
Don't query a vCPU's blocking status when toggling AVIC on/off; barring KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the correct state is desirable, i.e. well worth any overhead in a buggy scenario. Isolating the "real" load/put flows will allow moving the IOMMU IRTE (de)activation logic from avic_refresh_apicv_exec_ctrl() to avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's physical ID entry and its IRTEs in a common path, under a single critical section of ir_list_lock. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Fold avic_set_pi_irte_mode() into its sole callerSean Christopherson
Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in anticipation of moving the __avic_vcpu_{load,put}() calls into the critical section, and because having a one-off helper with a name that's easily confused with avic_pi_update_irte() is unnecessary. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadataSean Christopherson
Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to find and kick vCPUs when a device posted interrupt serves as a wake event. Lookups on a vCPU index are O(fast) (not sure what xa_load() actually provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match its index. Unlike the Physical APIC Table, which is accessed by hardware when virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_ use APIC IDs to fill the Physical APIC Table, but KVM has free rein over the format/meaning of the GA tag. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypassSean Christopherson
WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are disabled, to make it obvious that the logic is a sanity check, and so that a bug related to nr_possible_bypass_irqs is more like to cause noisy failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs. Link: https://lore.kernel.org/r/20250611224604.313496-56-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Decouple device assignment from IRQ bypassSean Christopherson
Use a dedicated counter to track the number of IRQs that can utilize IRQ bypass instead of piggybacking the assigned device count. As evidenced by commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), it's possible for a device to be able to post IRQs to a vCPU without said device being assigned to a VM. Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment to avoid regressing the MMIO stale data mitigation. KVM is abusing the assigned device count when applying mmio_stale_data_clear, and it's not at all clear if vDPA devices rely on this behavior. This will hopefully be cleaned up in the future, as the number of assigned devices is a terrible heuristic for detecting if a VM has access to host MMIO. Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: WARN if ir_list is non-empty at vCPU freeSean Christopherson
Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APICSean Christopherson
Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC, as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any and all postable IRQs to an APIC-less VM. Link: https://lore.kernel.org/r/20250611224604.313496-53-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()Sean Christopherson
WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the code is only reachable if the VM already has an IRQ bypass producer (see kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(), which, stating the obvious, are called if and only if KVM enables its IRQ bypass hooks. Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250611224604.313496-52-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()Sean Christopherson
Don't bother checking if the VM has an assigned device when updating IRTE entries. kvm_arch_irq_bypass_add_producer() explicitly increments the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly decrements the count before invoking kvm_pi_update_irte(), and kvm_irq_routing_update() only updates IRTE entries if there's an active IRQ bypass producer. Link: https://lore.kernel.org/r/20250611224604.313496-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: WARN if updating IRTE GA fields in IOMMU failsSean Christopherson
WARN if updating GA information for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata, and because returning an error that is always ignored is pointless. Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Process all IRTEs on affinity change even if one update failsSean Christopherson
When updating IRTE GA fields, keep processing all other IRTEs if an update fails, as not updating later entries risks making a bad situation worse. Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: WARN if (de)activating guest mode in IOMMU failsSean Christopherson
WARN if (de)activating "guest mode" for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata. Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Don't check for assigned device(s) when activating AVICSean Christopherson
Don't short-circuit IRTE updating when (de)activating AVIC based on the VM having assigned devices, as nothing prevents AVIC (de)activation from racing with device (de)assignment. And from a performance perspective, bailing early when there is no assigned device doesn't add much, as ir_list_lock will never be contended if there's no assigned device. Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Don't check for assigned device(s) when updating affinitySean Christopherson
Don't bother checking if a VM has an assigned device when updating AVIC vCPU affinity, querying ir_list is just as cheap and nothing prevents racing with changes in device assignment. Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is ↵Sean Christopherson
inhibited If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave the actual IRTE in remapped mode. KVM already handles the case where AVIC is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()), but doesn't handle the scenario where a postable IRQ comes along while AVIC is inhibited. Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinitySean Christopherson
Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destinationSean Christopherson
Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMUSean Christopherson
Now that svm_ir_list_add() isn't overloaded with all manner of weird things, fold it into avic_pi_update_irte(), and more importantly take ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info that's shoved into the IRTE is fresh. While preemption (and IRQs) is disabled on the task performing the IRTE update, thanks to irqfds.lock, that task doesn't hold the vCPU's mutex, i.e. preemption being disabled is irrelevant. Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadataSean Christopherson
Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up and doesn't provide the necessary metadata. Returning an error up the stack without actually handling the error is useless and confusing. Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Don't update IRTE entries when old and new routes were !MSISean Christopherson
Skip the entirety of IRTE updates on a GSI routing change if neither the old nor the new routing is for an MSI, i.e. if the neither routing setup allows for posting to a vCPU. If the IRTE isn't already host controlled, KVM has bigger problems. Link: https://lore.kernel.org/r/20250611224604.313496-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targetedSean Christopherson
Don't "reconfigure" an IRTE into host controlled mode when it's already in the state, i.e. if KVM's GSI routing changes but the IRQ wasn't and still isn't being posted to a vCPU. Link: https://lore.kernel.org/r/20250611224604.313496-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Track irq_bypass_vcpu in common x86 codeSean Christopherson
Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted IRQ, in common x86 code. This will allow for additional consolidation of the SVM and VMX code. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()Sean Christopherson
Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing(). Calling arch code to know whether or not to call arch code is absurd. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: Don't WARN if updating IRQ bypass route failsSean Christopherson
Don't bother WARNing if updating an IRTE route fails now that vendor code provides much more precise WARNs. The generic WARN doesn't provide enough information to actually debug the problem, and has obviously done nothing to surface the myriad bugs in KVM x86's implementation. Drop all of the associated return code plumbing that existed just so that common KVM could WARN. Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structsSean Christopherson
Split the vcpu_data structure that serves as a handoff from KVM to IOMMU drivers into vendor specific structures. Overloading a single structure makes the code hard to read and maintain, is *very* misleading as it suggests that mixing vendors is actually supported, and bastardizing Intel's posted interrupt descriptor address when AMD's IOMMU already has its own structure is quite unnecessary. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Clean up return handling in avic_pi_update_irte()Sean Christopherson
Clean up the return paths for avic_pi_update_irte() now that the refactoring dust has settled. Opportunistically drop the pr_err() on IRTE update failures. Logging that a failure occurred without _any_ context is quite useless. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Move posted interrupt tracepoint to common codeSean Christopherson
Move the pi_irte_update tracepoint to common x86, and call it whenever the IRTE is modified. Tracing only the modifications that result in an IRQ being posted to a vCPU makes the tracepoint useless for debugging. Drop the vendor specific address; plumbing that into common code isn't worth the trouble, as the address is meaningless without a whole pile of other information that isn't provided in any tracepoint. Link: https://lore.kernel.org/r/20250611224604.313496-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Dedup AVIC vs. PI code for identifying target vCPUSean Christopherson
Hoist the logic for identifying the target vCPU for a posted interrupt into common x86. The code is functionally identical between Intel and AMD. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Nullify irqfd->producer after updating IRTEsSean Christopherson
Nullify irqfd->producer (when it's going away) _after_ updating IRTEs so that the producer can be queried during the update. Link: https://lore.kernel.org/r/20250611224604.313496-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.cSean Christopherson
Move a bunch of IRQ routing and delivery APIs from x86.c to irq.c. x86.c has grown quite fat, and irq.c is the perfect landing spot. Opportunistically rewrite kvm_arch_irq_bypass_del_producer()'s comment, as the existing comment has several typos and is rather confusing. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20250611224604.313496-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()Sean Christopherson
Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX. The only SVM specific information it provides is the AVIC back page, and that can be trivially retrieved by its sole caller. No functional change intended. Cc: Francesco Lavra <francescolavra.fl@gmail.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: VMX: Stop walking list of routing table entries when updating IRTESean Christopherson
Now that KVM provides the to-be-updated routing entry, stop walking the routing table to find that entry. KVM, via setup_routing_entry() and sanity checked by kvm_get_msi_route(), disallows having a GSI configured to trigger multiple MSIs, i.e. the for-loop can only process one entry. Link: https://lore.kernel.org/r/20250611224604.313496-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Stop walking list of routing table entries when updating IRTESean Christopherson
Now that KVM explicitly passes the new/current GSI routing to pi_update_irte(), simply use the provided routing entry and stop walking the routing table to find that entry. KVM, via setup_routing_entry() and sanity checked by kvm_get_msi_route(), disallows having a GSI configured to trigger multiple MSIs. I.e. this is subtly a glorified nop, as KVM allows at most one MSI per GSI, the for-loop can only ever process one entry, and that entry is the new/current entry (see the WARN_ON_ONCE() added by "KVM: x86: Pass new routing entries and irqfd when updating IRTEs" to ensure @new matches the entry found in the routing table). Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"Sean Christopherson
Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a vCPU" now that there's no need to communicate information back to KVM about the previous vCPU (KVM does its own tracking). Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptrSean Christopherson
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and KVM's one and only path for writing amd_iommu_pi_data.base computes the exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base, and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is valid, i.e. amd_iommu_pi_data.base is fully redundant. Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blockingSean Christopherson
Add a comment to explain why KVM clears IsRunning when putting a vCPU, even though leaving IsRunning=1 would be ok from a functional perspective. Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so fast as to induce a 50%+ loss in performance. Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com Cc: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: VMX: Suppress PI notifications whenever the vCPU is putSean Christopherson
Suppress posted interrupt notifications (set PID.SN=1) whenever the vCPU is put, i.e. unloaded, not just when the vCPU is preempted, as KVM doesn't do anything in response to a notification IRQ that arrives in the host, nor does KVM rely on the Outstanding Notification (PID.ON) flag when the vCPU is unloaded. And, the cost of scanning the PIR to manually set PID.ON when loading the vCPU is quite small, especially relative to the cost of loading (and unloading) a vCPU. On the flip side, leaving SN clear means a notification for the vCPU will result in a spurious IRQ for the pCPU, even if vCPU task is scheduled out, running in userspace, etc. Even worse, if the pCPU is running a different vCPU, the spurious IRQ could trigger posted interrupt processing for the wrong vCPU, which is technically a violation of the architecture, as setting bits in PIR aren't supposed to be propagated to the vIRR until a notification IRQ is received. The saving grace of the current behavior is that hardware sends notification interrupts if and only if PID.ON=0, i.e. only the first posted interrupt for a vCPU will trigger a spurious IRQ (for each window where the vCPU is unloaded). Ideally, KVM would suppress notifications before enabling IRQs in the VM-Exit, but KVM relies on PID.ON as an indicator that there is a posted interrupt pending in PIR, e.g. in vmx_sync_pir_to_irr(), and sadly there is no way to ask hardware to set PID.ON, but not generate an interrupt. That could be solved by using pi_has_pending_interrupt() instead of checking only PID.ON, but it's not at all clear that would be a performance win, as KVM would end up scanning the entire PIR whenever an interrupt isn't pending. And long term, the spurious IRQ window, i.e. where a vCPU is loaded with IRQs enabled, can effectively be made smaller for hot paths by moving performance critical VM-Exit handlers into the fastpath, i.e. by never enabling IRQs for hot path VM-Exits. Link: https://lore.kernel.org/r/20250611224604.313496-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235Maxim Levitsky
Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR write emulation, and might fail to VM-Exit on the sending vCPU, if IsRunning was recently cleared. The absence of the VM-Exit leads to KVM not waking (or triggering nested VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs, unbounded delays in L2 execution, etc. To workaround the erratum, simply disable IPI virtualization, which prevents KVM from setting IsRunning and thus eliminates the race where hardware sees a stale IsRunning=1. As a result, all ICR writes (except when "Self" shorthand is used) will VM-Exit and therefore be correctly emulated by KVM. Disabling IPI virtualization does carry a performance penalty, but benchmarkng shows that enabling AVIC without IPI virtualization is still much better than not using AVIC at all, because AVIC still accelerates posted interrupts and the receiving end of the IPIs. Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID table and updates the vIRR directly (because the vCPU is by definition actively running), i.e. Self-IPI isn't susceptible to the erratum *and* is still accelerated by hardware. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: rebase, massage changelog, disallow user override] Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>