linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-06-25	Revert "kvm: detect assigned device via irqbypass manager"	Sean Christopherson
	Now that KVM explicitly tracks the number of possible bypass IRQs, and doesn't conflate IRQ bypass with host MMIO access, stop bumping the assigned device count when adding an IRQ bypass producer. This reverts commit 2edd9cb79fb31b0907c6e0cdce2824780cf9b153. Link: https://lore.kernel.org/r/20250523011756.3243624-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25	Merge branch 'kvm-x86 mmio'	Sean Christopherson
	Merge the MMIO stale data branch with the device posted IRQs branch to provide a common base for removing KVM's tracking of "assigned" devices. Link: https://lore.kernel.org/all/20250523011756.3243624-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25	KVM: SVM: Simplify MSR interception logic for IA32_XSS MSR	Chao Gao
	Use svm_set_intercept_for_msr() directly to configure IA32_XSS MSR interception, ensuring consistency with other cases where MSRs are intercepted depending on guest caps and CPUIDs. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-3-chao.gao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25	KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush	Manuel Andreas
	In KVM guests with Hyper-V hypercalls enabled, the hypercalls HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX allow a guest to request invalidation of portions of a virtual TLB. For this, the hypercall parameter includes a list of GVAs that are supposed to be invalidated. However, when non-canonical GVAs are passed, there is currently no filtering in place and they are eventually passed to checked invocations of INVVPID on Intel / INVLPGA on AMD. While AMD's INVLPGA silently ignores non-canonical addresses (effectively a no-op), Intel's INVVPID explicitly signals VM-Fail and ultimately triggers the WARN_ONCE in invvpid_error(): invvpid failed: ext=0x0 vpid=1 gva=0xaaaaaaaaaaaaa000 WARNING: CPU: 6 PID: 326 at arch/x86/kvm/vmx/vmx.c:482 invvpid_error+0x91/0xa0 [kvm_intel] Modules linked in: kvm_intel kvm 9pnet_virtio irqbypass fuse CPU: 6 UID: 0 PID: 326 Comm: kvm-vm Not tainted 6.15.0 #14 PREEMPT(voluntary) RIP: 0010:invvpid_error+0x91/0xa0 [kvm_intel] Call Trace: vmx_flush_tlb_gva+0x320/0x490 [kvm_intel] kvm_hv_vcpu_flush_tlb+0x24f/0x4f0 [kvm] kvm_arch_vcpu_ioctl_run+0x3013/0x5810 [kvm] Hyper-V documents that invalid GVAs (those that are beyond a partition's GVA space) are to be ignored. While not completely clear whether this ruling also applies to non-canonical GVAs, it is likely fine to make that assumption, and manual testing on Azure confirms "real" Hyper-V interprets the specification in the same way. Skip non-canonical GVAs when processing the list of address to avoid tripping the INVVPID failure. Alternatively, KVM could filter out "bad" GVAs before inserting into the FIFO, but practically speaking the only downside of pushing validation to the final processing is that doing so is suboptimal for the guest, and no well-behaved guest will request TLB flushes for non-canonical addresses. Fixes: 260970862c88 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently") Cc: stable@vger.kernel.org Signed-off-by: Manuel Andreas <manuel.andreas@tum.de> Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/c090efb3-ef82-499f-a5e0-360fc8420fb7@tum.de Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-25	KVM: VMX: Apply MMIO Stale Data mitigation if KVM maps MMIO into the guest	Sean Christopherson
	Enforce the MMIO State Data mitigation if KVM has ever mapped host MMIO into the VM, not if the VM has an assigned device. VFIO is but one of many ways to map host MMIO into a KVM guest, and even within VFIO, formally attaching a device to a VM via KVM_DEV_VFIO_FILE_ADD is entirely optional. Track whether or not the guest can access host MMIO on a per-MMU basis, i.e. based on whether or not the vCPU has a mapping to host MMIO. For simplicity, track MMIO mappings in "special" rools (those without a kvm_mmu_page) at the VM level, as only Intel CPUs are vulnerable, and so only legacy 32-bit shadow paging is affected, i.e. lack of precise tracking is a complete non-issue. Make the per-MMU and per-VM flags sticky. Detecting when all MMIO mappings have been removed would be absurdly complex. And in practice, removing MMIO from a guest will be done by deleting the associated memslot, which by default will force KVM to re-allocate all roots. Special roots will forever be mitigated, but as above, the affected scenarios are not expected to be performance sensitive. Use a VMX_RUN flag to communicate the need for a buffers flush to vmx_vcpu_enter_exit() so that kvm_vcpu_can_access_host_mmio() and all its dependencies don't need to be marked __always_inline, e.g. so that KASAN doesn't trigger a noinstr violation. Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Fixes: 8cb861e9e3c9 ("x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data") Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86: Deduplicate MSR interception enabling and disabling	Chao Gao
	Extract a common function from MSR interception disabling logic and create disabling and enabling functions based on it. This removes most of the duplicated code for MSR interception disabling/enabling. No functional change intended. Signed-off-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20250612081947.94081-2-chao.gao@intel.com [sean: s/enable/set, inline the wrappers] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86/mmu: Locally cache whether a PFN is host MMIO when making a SPTE	Sean Christopherson
	When making a SPTE, cache whether or not the target PFN is host MMIO in order to avoid multiple rounds of the slow path of kvm_is_mmio_pfn(), e.g. hitting pat_pfn_immune_to_uc_mtrr() in particular can be problematic. KVM currently avoids multiple calls by virtue of the two users being mutually exclusive (.get_mt_mask() is Intel-only, shadow_me_value is AMD-only), but that won't hold true if/when KVM needs to detect host MMIO mappings for other reasons, e.g. for mitigating the MMIO Stale Data vulnerability. No functional change intended. Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86: Avoid calling kvm_is_mmio_pfn() when kvm_x86_ops.get_mt_mask is NULL	Sean Christopherson
	Guard the call to kvm_x86_call(get_mt_mask) with an explicit check on kvm_x86_ops.get_mt_mask so as to avoid unnecessarily calling kvm_is_mmio_pfn(), which is moderately expensive for some backing types. E.g. lookup_memtype() conditionally takes a system-wide spinlock if KVM ends up being call pat_pfn_immune_to_uc_mtrr(), e.g. for DAX memory. While the call to kvm_x86_ops.get_mt_mask() itself is elided, the compiler still needs to compute all parameters, as it can't know at build time that the call will be squashed. <+243>: call 0xffffffff812ad880 <kvm_is_mmio_pfn> <+248>: mov %r13,%rsi <+251>: mov %rbx,%rdi <+254>: movzbl %al,%edx <+257>: call 0xffffffff81c26af0 <__SCT__kvm_x86_get_mt_mask> Fixes: 3fee4837ef40 ("KVM: x86: remove shadow_memtype_mask") Tested-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Link: https://lore.kernel.org/r/20250523011756.3243624-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	x86/traps: Initialize DR7 by writing its architectural reset value	Xin Li (Intel)
	Initialize DR7 by writing its architectural reset value to always set bit 10, which is reserved to '1', when "clearing" DR7 so as not to trigger unanticipated behavior if said bit is ever unreserved, e.g. as a feature enabling flag with inverted polarity. Signed-off-by: Xin Li (Intel) <xin@zytor.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Sean Christopherson <seanjc@google.com> Tested-by: Sohil Mehta <sohil.mehta@intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/20250620231504.2676902-3-xin%40zytor.com
2025-06-24	KVM: x86/mmu: Defer allocation of shadow MMU's hashed page list	Sean Christopherson
	When the TDP MMU is enabled, i.e. when the shadow MMU isn't used until a nested TDP VM is run, defer allocation of the array of hashed lists used to track shadow MMU pages until the first shadow root is allocated. Setting the list outside of mmu_lock is safe, as concurrent readers must hold mmu_lock in some capacity, shadow pages can only be added (or removed) from the list when mmu_lock is held for write, and tasks that are creating a shadow root are serialized by slots_arch_lock. I.e. it's impossible for the list to become non-empty until all readers go away, and so readers are guaranteed to see an empty list even if they make multiple calls to kvm_get_mmu_page_hash() in a single mmu_lock critical section. Use smp_store_release() and smp_load_acquire() to access the hash table pointer to ensure the stores to zero the lists are retired before readers start to walk the list. E.g. if the compiler hoisted the store before the zeroing of memory, for_each_gfn_valid_sp_with_gptes() could consume stale kernel data. Cc: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250523001138.3182794-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86: Use kvzalloc() to allocate VM struct	Sean Christopherson
	Allocate VM structs via kvzalloc(), i.e. try to use a contiguous physical allocation before falling back to __vmalloc(), to avoid the overhead of establishing the virtual mappings. For non-debug builds, The SVM and VMX (and TDX) structures are now just below 7000 bytes in the worst case scenario (see below), i.e. are order-1 allocations, and will likely remain that way for quite some time. Add compile-time assertions in vendor code to ensure the size of the structures, sans the memslot hash tables, are order-0 allocations, i.e. are less than 4KiB. There's nothing fundamentally wrong with a larger kvm_{svm,vmx,tdx} size, but given that the size of the structure (without the memslots hash tables) is below 2KiB after 18+ years of existence, more than doubling the size would be quite notable. Add sanity checks on the memslot hash table sizes, partly to ensure they aren't resized without accounting for the impact on VM structure size, and partly to document that the majority of the size of VM structures comes from the memslots. Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86/mmu: Dynamically allocate shadow MMU's hashed page list	Sean Christopherson
	Dynamically allocate the (massive) array of hashed lists used to track shadow pages, as the array itself is 32KiB, i.e. is an order-3 allocation all on its own, and is exactly an order-3 allocation. Dynamically allocating the array will allow allocating "struct kvm" using kvmalloc(), and will also allow deferring allocation of the array until it's actually needed, i.e. until the first shadow root is allocated. Opportunistically use kvmalloc() for the hashed lists, as an order-3 allocation is (stating the obvious) less likely to fail than an order-4 allocation, and the overhead of vmalloc() is undesirable given that the size of the allocation is fixed. Cc: Vipin Sharma <vipinsh@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Link: https://lore.kernel.org/r/20250523001138.3182794-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.	David Woodhouse
	To avoid imposing an ordering constraint on userspace, allow 'invalid' event channel targets to be configured in the IRQ routing table. This is the same as accepting interrupts targeted at vCPUs which don't exist yet, which is already the case for both Xen event channels and for MSIs (which don't do any filtering of permitted APIC ID targets at all). If userspace actually triggers an IRQ with an invalid target, that will fail cleanly, as kvm_xen_set_evtchn_fast() also does the same range check. If KVM enforced that the IRQ target must be valid at the time it is configured, that would force userspace to create all vCPUs and do various other parts of setup (in this case, setting the Xen long_mode) before restoring the IRQ table. Cc: stable@vger.kernel.org Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/e489252745ac4b53f1f7f50570b03fb416aa2065.camel@infradead.org [sean: massage comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks	Sean Christopherson
	Use a preallocated per-vCPU bitmap for tracking the unpacked set of vCPUs being targeted for Hyper-V's paravirt TLB flushing. If KVM_MAX_NR_VCPUS is set to 4096 (which is allowed even for MAXSMP=n builds), putting the vCPU mask on-stack pushes kvm_hv_flush_tlb() past the default FRAME_WARN limit. arch/x86/kvm/hyperv.c:2001:12: error: stack frame size (1288) exceeds limit (1024) in 'kvm_hv_flush_tlb' [-Werror,-Wframe-larger-than] 2001 \| static u64 kvm_hv_flush_tlb(struct kvm_vcpu vcpu, struct kvm_hv_hcall hc) \| ^ 1 error generated. Note, sparse_banks was given the same treatment by commit 7d5e88d301f8 ("KVM: x86: hyper-v: Use preallocated buffer in 'struct kvm_vcpu_hv' instead of on-stack 'sparse_banks'"), for the exact same reason. Reported-by: Abinash Lalotra <abinashsinghlalotra@gmail.com> Closes: https://lore.kernel.org/all/20250613111023.786265-1-abinashsinghlalotra@gmail.com Link: https://lore.kernel.org/all/aEylI-O8kFnFHrOH@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL	Sean Christopherson
	When creating an SEV-ES vCPU for intra-host migration, set its vmsa_pa to INVALID_PAGE to harden against doing VMRUN with a bogus VMSA (KVM checks for a valid VMSA page in pre_sev_run()). Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Link: https://lore.kernel.org/r/20250602224459.41505-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-24	KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight	Sean Christopherson
	Reject migration of SEV{-ES} state if either the source or destination VM is actively creating a vCPU, i.e. if kvm_vm_ioctl_create_vcpu() is in the section between incrementing created_vcpus and online_vcpus. The bulk of vCPU creation runs _outside_ of kvm->lock to allow creating multiple vCPUs in parallel, and so sev_info.es_active can get toggled from false=>true in the destination VM after (or during) svm_vcpu_create(), resulting in an SEV{-ES} VM effectively having a non-SEV{-ES} vCPU. The issue manifests most visibly as a crash when trying to free a vCPU's NULL VMSA page in an SEV-ES VM, but any number of things can go wrong. BUG: unable to handle page fault for address: ffffebde00000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP KASAN NOPTI CPU: 227 UID: 0 PID: 64063 Comm: syz.5.60023 Tainted: G U O 6.15.0-smp-DEV #2 NONE Tainted: [U]=USER, [O]=OOT_MODULE Hardware name: Google, Inc. Arcadia_IT_80/Arcadia_IT_80, BIOS 12.52.0-0 10/28/2024 RIP: 0010:constant_test_bit arch/x86/include/asm/bitops.h:206 [inline] RIP: 0010:arch_test_bit arch/x86/include/asm/bitops.h:238 [inline] RIP: 0010:_test_bit include/asm-generic/bitops/instrumented-non-atomic.h:142 [inline] RIP: 0010:PageHead include/linux/page-flags.h:866 [inline] RIP: 0010:___free_pages+0x3e/0x120 mm/page_alloc.c:5067 Code: <49> f7 06 40 00 00 00 75 05 45 31 ff eb 0c 66 90 4c 89 f0 4c 39 f0 RSP: 0018:ffff8984551978d0 EFLAGS: 00010246 RAX: 0000777f80000001 RBX: 0000000000000000 RCX: ffffffff918aeb98 RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffffebde00000000 RBP: 0000000000000000 R08: ffffebde00000007 R09: 1ffffd7bc0000000 R10: dffffc0000000000 R11: fffff97bc0000001 R12: dffffc0000000000 R13: ffff8983e19751a8 R14: ffffebde00000000 R15: 1ffffd7bc0000000 FS: 0000000000000000(0000) GS:ffff89ee661d3000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffebde00000000 CR3: 000000793ceaa000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000b5f DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: <TASK> sev_free_vcpu+0x413/0x630 arch/x86/kvm/svm/sev.c:3169 svm_vcpu_free+0x13a/0x2a0 arch/x86/kvm/svm/svm.c:1515 kvm_arch_vcpu_destroy+0x6a/0x1d0 arch/x86/kvm/x86.c:12396 kvm_vcpu_destroy virt/kvm/kvm_main.c:470 [inline] kvm_destroy_vcpus+0xd1/0x300 virt/kvm/kvm_main.c:490 kvm_arch_destroy_vm+0x636/0x820 arch/x86/kvm/x86.c:12895 kvm_put_kvm+0xb8e/0xfb0 virt/kvm/kvm_main.c:1310 kvm_vm_release+0x48/0x60 virt/kvm/kvm_main.c:1369 __fput+0x3e4/0x9e0 fs/file_table.c:465 task_work_run+0x1a9/0x220 kernel/task_work.c:227 exit_task_work include/linux/task_work.h:40 [inline] do_exit+0x7f0/0x25b0 kernel/exit.c:953 do_group_exit+0x203/0x2d0 kernel/exit.c:1102 get_signal+0x1357/0x1480 kernel/signal.c:3034 arch_do_signal_or_restart+0x40/0x690 arch/x86/kernel/signal.c:337 exit_to_user_mode_loop kernel/entry/common.c:111 [inline] exit_to_user_mode_prepare include/linux/entry-common.h:329 [inline] __syscall_exit_to_user_mode_work kernel/entry/common.c:207 [inline] syscall_exit_to_user_mode+0x67/0xb0 kernel/entry/common.c:218 do_syscall_64+0x7c/0x150 arch/x86/entry/syscall_64.c:100 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f87a898e969 </TASK> Modules linked in: gq(O) gsmi: Log Shutdown Reason 0x03 CR2: ffffebde00000000 ---[ end trace 0000000000000000 ]--- Deliberately don't check for a NULL VMSA when freeing the vCPU, as crashing the host is likely desirable due to the VMSA being consumed by hardware. E.g. if KVM manages to allow VMRUN on the vCPU, hardware may read/write a bogus VMSA page. Accessing PFN 0 is "fine"-ish now that it's sequestered away thanks to L1TF, but panicking in this scenario is preferable to potentially running with corrupted state. Reported-by: Alexander Potapenko <glider@google.com> Tested-by: Alexander Potapenko <glider@google.com> Fixes: 0b020f5af092 ("KVM: SEV: Add support for SEV-ES intra host migration") Fixes: b56639318bb2 ("KVM: SEV: Add support for SEV intra host migration") Cc: stable@vger.kernel.org Cc: James Houghton <jthoughton@google.com> Cc: Peter Gonda <pgonda@google.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: James Houghton <jthoughton@google.com> Link: https://lore.kernel.org/r/20250602224459.41505-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Rename kvm_set_msi_irq() => kvm_msi_to_lapic_irq()	Sean Christopherson
	Rename kvm_set_msi_irq() to kvm_msi_to_lapic_irq() to better capture what it actually does, e.g. it's _really_ easy to conflate kvm_set_msi_irq() with kvm_set_msi(). Opportunistically delete the public declaration and export, as they are no longer used/needed. No functional change intended. Link: https://lore.kernel.org/r/20250611224604.313496-64-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Generate GA log IRQs only if the associated vCPUs is blocking	Sean Christopherson
	Configure IRTEs to GA log interrupts for device posted IRQs that hit non-running vCPUs if and only if the target vCPU is blocking, i.e. actually needs a wake event. If the vCPU has exited to userspace or was preempted, generating GA log entries and interrupts is wasteful and unnecessary, as the vCPU will be re-loaded and/or scheduled back in irrespective of the GA log notification (avic_ga_log_notifier() is just a fancy wrapper for kvm_vcpu_wake_up()). Use a should-be-zero bit in the vCPU's Physical APIC ID Table Entry to track whether or not the vCPU's associated IRTEs are configured to generate GA logs, but only set the synthetic bit in KVM's "cache", i.e. never set the should-be-zero bit in tables that are used by hardware. Use a synthetic bit instead of a dedicated boolean to minimize the odds of messing up the locking, i.e. so that all the existing rules that apply to avic_physical_id_entry for IS_RUNNING are reused verbatim for GA_LOG_INTR. Note, because KVM (by design) "puts" AVIC state in a "pre-blocking" phase, using kvm_vcpu_is_blocking() to track the need for notifications isn't a viable option. Link: https://lore.kernel.org/r/20250611224604.313496-63-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Allow KVM to control need for GA log interrupts	Sean Christopherson
	Add plumbing to the AMD IOMMU driver to allow KVM to control whether or not an IRTE is configured to generate GA log interrupts. KVM only needs a notification if the target vCPU is blocking, so the vCPU can be awakened. If a vCPU is preempted or exits to userspace, KVM clears is_run, but will set the vCPU back to running when userspace does KVM_RUN and/or the vCPU task is scheduled back in, i.e. KVM doesn't need a notification. Unconditionally pass "true" in all KVM paths to isolate the IOMMU changes from the KVM changes insofar as possible. Opportunistically swap the ordering of parameters for amd_iommu_update_ga() so that the match amd_iommu_activate_guest_mode(). Note, as of this writing, the AMD IOMMU manual doesn't list GALogIntr as a non-cached field, but per AMD hardware architects, it's not cached and can be safely updated without an invalidation. Link: https://lore.kernel.org/all/b29b8c22-2fd4-4b5e-b755-9198874157c7@amd.com Cc: Vasant Hegde <vasant.hegde@amd.com> Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-62-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Consolidate IRTE update when toggling AVIC on/off	Sean Christopherson
	Fold the IRTE modification logic in avic_refresh_apicv_exec_ctrl() into __avic_vcpu_{load,put}(), and add a param to the helpers to communicate whether or not AVIC is being toggled, i.e. if IRTE needs a "full" update, or just a quick update to set the CPU and IsRun. Link: https://lore.kernel.org/r/20250611224604.313496-61-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check vCPU's blocking status when toggling AVIC on/off	Sean Christopherson
	Don't query a vCPU's blocking status when toggling AVIC on/off; barring KVM bugs, the vCPU can't be blocking when refreshing AVIC controls. And if there are KVM bugs, ensuring the vCPU and its associated IRTEs are in the correct state is desirable, i.e. well worth any overhead in a buggy scenario. Isolating the "real" load/put flows will allow moving the IOMMU IRTE (de)activation logic from avic_refresh_apicv_exec_ctrl() to avic_update_iommu_vcpu_affinity(), i.e. will allow updating the vCPU's physical ID entry and its IRTEs in a common path, under a single critical section of ir_list_lock. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-60-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller	Sean Christopherson
	Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in anticipation of moving the __avic_vcpu_{load,put}() calls into the critical section, and because having a one-off helper with a name that's easily confused with avic_pi_update_irte() is unnecessary. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata	Sean Christopherson
	Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to find and kick vCPUs when a device posted interrupt serves as a wake event. Lookups on a vCPU index are O(fast) (not sure what xa_load() actually provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match its index. Unlike the Physical APIC Table, which is accessed by hardware when virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_ use APIC IDs to fill the Physical APIC Table, but KVM has free rein over the format/meaning of the GA tag. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass	Sean Christopherson
	WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are disabled, to make it obvious that the logic is a sanity check, and so that a bug related to nr_possible_bypass_irqs is more like to cause noisy failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs. Link: https://lore.kernel.org/r/20250611224604.313496-56-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Decouple device assignment from IRQ bypass	Sean Christopherson
	Use a dedicated counter to track the number of IRQs that can utilize IRQ bypass instead of piggybacking the assigned device count. As evidenced by commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), it's possible for a device to be able to post IRQs to a vCPU without said device being assigned to a VM. Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment to avoid regressing the MMIO stale data mitigation. KVM is abusing the assigned device count when applying mmio_stale_data_clear, and it's not at all clear if vDPA devices rely on this behavior. This will hopefully be cleaned up in the future, as the number of assigned devices is a terrible heuristic for detecting if a VM has access to host MMIO. Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if ir_list is non-empty at vCPU free	Sean Christopherson
	Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC	Sean Christopherson
	Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC, as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any and all postable IRQs to an APIC-less VM. Link: https://lore.kernel.org/r/20250611224604.313496-53-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()	Sean Christopherson
	WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the code is only reachable if the VM already has an IRQ bypass producer (see kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(), which, stating the obvious, are called if and only if KVM enables its IRQ bypass hooks. Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250611224604.313496-52-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()	Sean Christopherson
	Don't bother checking if the VM has an assigned device when updating IRTE entries. kvm_arch_irq_bypass_add_producer() explicitly increments the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly decrements the count before invoking kvm_pi_update_irte(), and kvm_irq_routing_update() only updates IRTE entries if there's an active IRQ bypass producer. Link: https://lore.kernel.org/r/20250611224604.313496-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails	Sean Christopherson
	WARN if updating GA information for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata, and because returning an error that is always ignored is pointless. Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Process all IRTEs on affinity change even if one update fails	Sean Christopherson
	When updating IRTE GA fields, keep processing all other IRTEs if an update fails, as not updating later entries risks making a bad situation worse. Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if (de)activating guest mode in IOMMU fails	Sean Christopherson
	WARN if (de)activating "guest mode" for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata. Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check for assigned device(s) when activating AVIC	Sean Christopherson
	Don't short-circuit IRTE updating when (de)activating AVIC based on the VM having assigned devices, as nothing prevents AVIC (de)activation from racing with device (de)assignment. And from a performance perspective, bailing early when there is no assigned device doesn't add much, as ir_list_lock will never be contended if there's no assigned device. Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check for assigned device(s) when updating affinity	Sean Christopherson
	Don't bother checking if a VM has an assigned device when updating AVIC vCPU affinity, querying ir_list is just as cheap and nothing prevents racing with changes in device assignment. Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is ↵	Sean Christopherson
	inhibited If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave the actual IRTE in remapped mode. KVM already handles the case where AVIC is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()), but doesn't handle the scenario where a postable IRQ comes along while AVIC is inhibited. Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity	Sean Christopherson
	Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination	Sean Christopherson
	Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU	Sean Christopherson
	Now that svm_ir_list_add() isn't overloaded with all manner of weird things, fold it into avic_pi_update_irte(), and more importantly take ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info that's shoved into the IRTE is fresh. While preemption (and IRQs) is disabled on the task performing the IRTE update, thanks to irqfds.lock, that task doesn't hold the vCPU's mutex, i.e. preemption being disabled is irrelevant. Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata	Sean Christopherson
	Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up and doesn't provide the necessary metadata. Returning an error up the stack without actually handling the error is useless and confusing. Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Don't update IRTE entries when old and new routes were !MSI	Sean Christopherson
	Skip the entirety of IRTE updates on a GSI routing change if neither the old nor the new routing is for an MSI, i.e. if the neither routing setup allows for posting to a vCPU. If the IRTE isn't already host controlled, KVM has bigger problems. Link: https://lore.kernel.org/r/20250611224604.313496-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted	Sean Christopherson
	Don't "reconfigure" an IRTE into host controlled mode when it's already in the state, i.e. if KVM's GSI routing changes but the IRQ wasn't and still isn't being posted to a vCPU. Link: https://lore.kernel.org/r/20250611224604.313496-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Track irq_bypass_vcpu in common x86 code	Sean Christopherson
	Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted IRQ, in common x86 code. This will allow for additional consolidation of the SVM and VMX code. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()	Sean Christopherson
	Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing(). Calling arch code to know whether or not to call arch code is absurd. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Don't WARN if updating IRQ bypass route fails	Sean Christopherson
	Don't bother WARNing if updating an IRTE route fails now that vendor code provides much more precise WARNs. The generic WARN doesn't provide enough information to actually debug the problem, and has obviously done nothing to surface the myriad bugs in KVM x86's implementation. Drop all of the associated return code plumbing that existed just so that common KVM could WARN. Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs	Sean Christopherson
	Split the vcpu_data structure that serves as a handoff from KVM to IOMMU drivers into vendor specific structures. Overloading a single structure makes the code hard to read and maintain, is very misleading as it suggests that mixing vendors is actually supported, and bastardizing Intel's posted interrupt descriptor address when AMD's IOMMU already has its own structure is quite unnecessary. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Clean up return handling in avic_pi_update_irte()	Sean Christopherson
	Clean up the return paths for avic_pi_update_irte() now that the refactoring dust has settled. Opportunistically drop the pr_err() on IRTE update failures. Logging that a failure occurred without _any_ context is quite useless. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Move posted interrupt tracepoint to common code	Sean Christopherson
	Move the pi_irte_update tracepoint to common x86, and call it whenever the IRTE is modified. Tracing only the modifications that result in an IRQ being posted to a vCPU makes the tracepoint useless for debugging. Drop the vendor specific address; plumbing that into common code isn't worth the trouble, as the address is meaningless without a whole pile of other information that isn't provided in any tracepoint. Link: https://lore.kernel.org/r/20250611224604.313496-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU	Sean Christopherson
	Hoist the logic for identifying the target vCPU for a posted interrupt into common x86. The code is functionally identical between Intel and AMD. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Nullify irqfd->producer after updating IRTEs	Sean Christopherson
	Nullify irqfd->producer (when it's going away) _after_ updating IRTEs so that the producer can be queried during the update. Link: https://lore.kernel.org/r/20250611224604.313496-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c	Sean Christopherson
	Move a bunch of IRQ routing and delivery APIs from x86.c to irq.c. x86.c has grown quite fat, and irq.c is the perfect landing spot. Opportunistically rewrite kvm_arch_irq_bypass_del_producer()'s comment, as the existing comment has several typos and is rather confusing. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20250611224604.313496-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>