linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2025-06-23	KVM: SVM: Fold avic_set_pi_irte_mode() into its sole caller	Sean Christopherson
	Fold avic_set_pi_irte_mode() into avic_refresh_apicv_exec_ctrl() in anticipation of moving the __avic_vcpu_{load,put}() calls into the critical section, and because having a one-off helper with a name that's easily confused with avic_pi_update_irte() is unnecessary. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-59-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: WARN if KVM calls GA IRTE helpers without virtual APIC support	Sean Christopherson
	WARN if KVM attempts to update IRTE entries when virtual APIC isn't fully supported, as KVM should guard all such calls on IRQ posting being enabled. Link: https://lore.kernel.org/r/20250611224604.313496-58-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Use vcpu_idx, not vcpu_id, for GA log tag/metadata	Sean Christopherson
	Use a vCPU's index, not its ID, for the GA log tag/metadata that's used to find and kick vCPUs when a device posted interrupt serves as a wake event. Lookups on a vCPU index are O(fast) (not sure what xa_load() actually provides), whereas a vCPU ID lookup is O(n) if a vCPU's ID doesn't match its index. Unlike the Physical APIC Table, which is accessed by hardware when virtualizing IPIs, hardware doesn't consume the GA tag, i.e. KVM _must_ use APIC IDs to fill the Physical APIC Table, but KVM has free rein over the format/meaning of the GA tag. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-57-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: VMX: WARN if VT-d Posted IRQs aren't possible when starting IRQ bypass	Sean Christopherson
	WARN if KVM attempts to "start" IRQ bypass when VT-d Posted IRQs are disabled, to make it obvious that the logic is a sanity check, and so that a bug related to nr_possible_bypass_irqs is more like to cause noisy failures, e.g. so that KVM doesn't silently fail to wake blocking vCPUs. Link: https://lore.kernel.org/r/20250611224604.313496-56-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Decouple device assignment from IRQ bypass	Sean Christopherson
	Use a dedicated counter to track the number of IRQs that can utilize IRQ bypass instead of piggybacking the assigned device count. As evidenced by commit 2edd9cb79fb3 ("kvm: detect assigned device via irqbypass manager"), it's possible for a device to be able to post IRQs to a vCPU without said device being assigned to a VM. Leave the calls to kvm_arch_{start,end}_assignment() alone for the moment to avoid regressing the MMIO stale data mitigation. KVM is abusing the assigned device count when applying mmio_stale_data_clear, and it's not at all clear if vDPA devices rely on this behavior. This will hopefully be cleaned up in the future, as the number of assigned devices is a terrible heuristic for detecting if a VM has access to host MMIO. Link: https://lore.kernel.org/r/20250611224604.313496-55-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if ir_list is non-empty at vCPU free	Sean Christopherson
	Now that AVIC IRTE tracking is in a mostly sane state, WARN if a vCPU is freed with ir_list entries, i.e. if KVM leaves a dangling IRTE. Initialize the per-vCPU interrupt remapping list and its lock even if AVIC is disabled so that the WARN doesn't hit false positives (and so that KVM doesn't need to call into AVIC code for a simple sanity check). Link: https://lore.kernel.org/r/20250611224604.313496-54-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: WARN if IRQ bypass routing is updated without in-kernel local APIC	Sean Christopherson
	Yell if kvm_pi_update_irte() is reached without an in-kernel local APIC, as kvm_arch_irqfd_allowed() should prevent attaching an irqfd and thus any and all postable IRQs to an APIC-less VM. Link: https://lore.kernel.org/r/20250611224604.313496-53-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: WARN if IRQ bypass isn't supported in kvm_pi_update_irte()	Sean Christopherson
	WARN if kvm_pi_update_irte() is reached without IRQ bypass support, as the code is only reachable if the VM already has an IRQ bypass producer (see kvm_irq_routing_update()), or from kvm_arch_irq_bypass_{add,del}_producer(), which, stating the obvious, are called if and only if KVM enables its IRQ bypass hooks. Cc: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20250611224604.313496-52-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Drop superfluous "has assigned device" check in kvm_pi_update_irte()	Sean Christopherson
	Don't bother checking if the VM has an assigned device when updating IRTE entries. kvm_arch_irq_bypass_add_producer() explicitly increments the assigned device count, kvm_arch_irq_bypass_del_producer() explicitly decrements the count before invoking kvm_pi_update_irte(), and kvm_irq_routing_update() only updates IRTE entries if there's an active IRQ bypass producer. Link: https://lore.kernel.org/r/20250611224604.313496-51-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if updating IRTE GA fields in IOMMU fails	Sean Christopherson
	WARN if updating GA information for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata, and because returning an error that is always ignored is pointless. Link: https://lore.kernel.org/r/20250611224604.313496-50-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Process all IRTEs on affinity change even if one update fails	Sean Christopherson
	When updating IRTE GA fields, keep processing all other IRTEs if an update fails, as not updating later entries risks making a bad situation worse. Link: https://lore.kernel.org/r/20250611224604.313496-49-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: WARN if (de)activating guest mode in IOMMU fails	Sean Christopherson
	WARN if (de)activating "guest mode" for an IRTE entry fails as modifying an IRTE should only fail if KVM is buggy, e.g. has stale metadata. Link: https://lore.kernel.org/r/20250611224604.313496-48-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check for assigned device(s) when activating AVIC	Sean Christopherson
	Don't short-circuit IRTE updating when (de)activating AVIC based on the VM having assigned devices, as nothing prevents AVIC (de)activation from racing with device (de)assignment. And from a performance perspective, bailing early when there is no assigned device doesn't add much, as ir_list_lock will never be contended if there's no assigned device. Link: https://lore.kernel.org/r/20250611224604.313496-47-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Don't check for assigned device(s) when updating affinity	Sean Christopherson
	Don't bother checking if a VM has an assigned device when updating AVIC vCPU affinity, querying ir_list is just as cheap and nothing prevents racing with changes in device assignment. Link: https://lore.kernel.org/r/20250611224604.313496-46-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Add IRTE metadata to affined vCPU's list if AVIC is ↵	Sean Christopherson
	inhibited If an IRQ can be posted to a vCPU, but AVIC is currently inhibited on the vCPU, go through the dance of "affining" the IRTE to the vCPU, but leave the actual IRTE in remapped mode. KVM already handles the case where AVIC is inhibited => uninhibited with posted IRQs (see avic_set_pi_irte_mode()), but doesn't handle the scenario where a postable IRQ comes along while AVIC is inhibited. Link: https://lore.kernel.org/r/20250611224604.313496-45-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Set pCPU info in IRTE when setting vCPU affinity	Sean Christopherson
	Now that setting vCPU affinity is guarded with ir_list_lock, i.e. now that avic_physical_id_entry can be safely accessed, set the pCPU info straight-away when setting vCPU affinity. Putting the IRTE into posted mode, and then immediately updating the IRTE a second time if the target vCPU is running is wasteful and confusing. This also fixes a flaw where a posted IRQ that arrives between putting the IRTE into guest_mode and setting the correct destination could cause the IOMMU to ring the doorbell on the wrong pCPU. Link: https://lore.kernel.org/r/20250611224604.313496-44-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: Factor out helper for manipulating IRTE GA/CPU info	Sean Christopherson
	Split the guts of amd_iommu_update_ga() to a dedicated helper so that the logic can be shared with flows that put the IRTE into posted mode. Opportunistically move amd_iommu_update_ga() and its new helper above amd_iommu_activate_guest_mode() so that it's all co-located. Link: https://lore.kernel.org/r/20250611224604.313496-43-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Infer IsRun from validity of pCPU destination	Sean Christopherson
	Infer whether or not a vCPU should be marked running from the validity of the pCPU on which it is running. amd_iommu_update_ga() already skips the IRTE update if the pCPU is invalid, i.e. passing %true for is_run with an invalid pCPU would be a blatant and egregrious KVM bug. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-42-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: Document which IRTE fields amd_iommu_update_ga() can modify	Sean Christopherson
	Add a comment to amd_iommu_update_ga() to document what fields it can safely modify without issuing an invalidation of the IRTE, and to explain its role in keeping GA IRTEs up-to-date. Per page 93 of the IOMMU spec dated Feb 2025: When virtual interrupts are enabled by setting MMIO Offset 0018h[GAEn] and IRTE[GuestMode=1], IRTE[IsRun], IRTE[Destination], and if present IRTE[GATag], are not cached by the IOMMU. Modifications to these fields do not require an invalidation of the Interrupt Remapping Table. Link: https://lore.kernel.org/all/9b7ceea3-8c47-4383-ad9c-1a9bbdc9044a@oracle.com Cc: Joao Martins <joao.m.martins@oracle.com> Link: https://lore.kernel.org/r/20250611224604.313496-41-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Take and hold ir_list_lock across IRTE updates in IOMMU	Sean Christopherson
	Now that svm_ir_list_add() isn't overloaded with all manner of weird things, fold it into avic_pi_update_irte(), and more importantly take ir_list_lock across the irq_set_vcpu_affinity() calls to ensure the info that's shoved into the IRTE is fresh. While preemption (and IRQs) is disabled on the task performing the IRTE update, thanks to irqfds.lock, that task doesn't hold the vCPU's mutex, i.e. preemption being disabled is irrelevant. Link: https://lore.kernel.org/r/20250611224604.313496-40-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Revert IRTE to legacy mode if IOMMU doesn't provide IR metadata	Sean Christopherson
	Revert the IRTE back to remapping mode if the AMD IOMMU driver mucks up and doesn't provide the necessary metadata. Returning an error up the stack without actually handling the error is useless and confusing. Link: https://lore.kernel.org/r/20250611224604.313496-39-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Don't update IRTE entries when old and new routes were !MSI	Sean Christopherson
	Skip the entirety of IRTE updates on a GSI routing change if neither the old nor the new routing is for an MSI, i.e. if the neither routing setup allows for posting to a vCPU. If the IRTE isn't already host controlled, KVM has bigger problems. Link: https://lore.kernel.org/r/20250611224604.313496-38-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Skip IOMMU IRTE updates if there's no old or new vCPU being targeted	Sean Christopherson
	Don't "reconfigure" an IRTE into host controlled mode when it's already in the state, i.e. if KVM's GSI routing changes but the IRQ wasn't and still isn't being posted to a vCPU. Link: https://lore.kernel.org/r/20250611224604.313496-37-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Track irq_bypass_vcpu in common x86 code	Sean Christopherson
	Track the vCPU that is being targeted for IRQ bypass, a.k.a. for a posted IRQ, in common x86 code. This will allow for additional consolidation of the SVM and VMX code. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-36-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing()	Sean Christopherson
	Fold kvm_arch_irqfd_route_changed() into kvm_arch_update_irqfd_routing(). Calling arch code to know whether or not to call arch code is absurd. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20250611224604.313496-35-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: Don't WARN if updating IRQ bypass route fails	Sean Christopherson
	Don't bother WARNing if updating an IRTE route fails now that vendor code provides much more precise WARNs. The generic WARN doesn't provide enough information to actually debug the problem, and has obviously done nothing to surface the myriad bugs in KVM x86's implementation. Drop all of the associated return code plumbing that existed just so that common KVM could WARN. Link: https://lore.kernel.org/r/20250611224604.313496-34-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu: KVM: Split "struct vcpu_data" into separate AMD vs. Intel structs	Sean Christopherson
	Split the vcpu_data structure that serves as a handoff from KVM to IOMMU drivers into vendor specific structures. Overloading a single structure makes the code hard to read and maintain, is very misleading as it suggests that mixing vendors is actually supported, and bastardizing Intel's posted interrupt descriptor address when AMD's IOMMU already has its own structure is quite unnecessary. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-33-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Clean up return handling in avic_pi_update_irte()	Sean Christopherson
	Clean up the return paths for avic_pi_update_irte() now that the refactoring dust has settled. Opportunistically drop the pr_err() on IRTE update failures. Logging that a failure occurred without _any_ context is quite useless. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-32-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Move posted interrupt tracepoint to common code	Sean Christopherson
	Move the pi_irte_update tracepoint to common x86, and call it whenever the IRTE is modified. Tracing only the modifications that result in an IRQ being posted to a vCPU makes the tracepoint useless for debugging. Drop the vendor specific address; plumbing that into common code isn't worth the trouble, as the address is meaningless without a whole pile of other information that isn't provided in any tracepoint. Link: https://lore.kernel.org/r/20250611224604.313496-31-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Dedup AVIC vs. PI code for identifying target vCPU	Sean Christopherson
	Hoist the logic for identifying the target vCPU for a posted interrupt into common x86. The code is functionally identical between Intel and AMD. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Nullify irqfd->producer after updating IRTEs	Sean Christopherson
	Nullify irqfd->producer (when it's going away) _after_ updating IRTEs so that the producer can be queried during the update. Link: https://lore.kernel.org/r/20250611224604.313496-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: x86: Move IRQ routing/delivery APIs from x86.c => irq.c	Sean Christopherson
	Move a bunch of IRQ routing and delivery APIs from x86.c to irq.c. x86.c has grown quite fat, and irq.c is the perfect landing spot. Opportunistically rewrite kvm_arch_irq_bypass_del_producer()'s comment, as the existing comment has several typos and is rather confusing. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20250611224604.313496-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Extract SVM specific code out of get_pi_vcpu_info()	Sean Christopherson
	Genericize SVM's get_pi_vcpu_info() so that it can be shared with VMX. The only SVM specific information it provides is the AVIC back page, and that can be trivially retrieved by its sole caller. No functional change intended. Cc: Francesco Lavra <francescolavra.fl@gmail.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: VMX: Stop walking list of routing table entries when updating IRTE	Sean Christopherson
	Now that KVM provides the to-be-updated routing entry, stop walking the routing table to find that entry. KVM, via setup_routing_entry() and sanity checked by kvm_get_msi_route(), disallows having a GSI configured to trigger multiple MSIs, i.e. the for-loop can only process one entry. Link: https://lore.kernel.org/r/20250611224604.313496-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Stop walking list of routing table entries when updating IRTE	Sean Christopherson
	Now that KVM explicitly passes the new/current GSI routing to pi_update_irte(), simply use the provided routing entry and stop walking the routing table to find that entry. KVM, via setup_routing_entry() and sanity checked by kvm_get_msi_route(), disallows having a GSI configured to trigger multiple MSIs. I.e. this is subtly a glorified nop, as KVM allows at most one MSI per GSI, the for-loop can only ever process one entry, and that entry is the new/current entry (see the WARN_ON_ONCE() added by "KVM: x86: Pass new routing entries and irqfd when updating IRTEs" to ensure @new matches the entry found in the routing table). Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Pass NULL @vcpu_info to indicate "not guest mode"	Sean Christopherson
	Pass NULL to amd_ir_set_vcpu_affinity() to communicate "don't post to a vCPU" now that there's no need to communicate information back to KVM about the previous vCPU (KVM does its own tracking). Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	iommu/amd: KVM: SVM: Use pi_desc_addr to derive ga_root_ptr	Sean Christopherson
	Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and KVM's one and only path for writing amd_iommu_pi_data.base computes the exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base, and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is valid, i.e. amd_iommu_pi_data.base is fully redundant. Cc: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Add a comment to explain why avic_vcpu_blocking() ignores IRQ blocking	Sean Christopherson
	Add a comment to explain why KVM clears IsRunning when putting a vCPU, even though leaving IsRunning=1 would be ok from a functional perspective. Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so fast as to induce a 50%+ loss in performance. Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com Cc: Maxim Levitsky <mlevitsk@redhat.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: VMX: Suppress PI notifications whenever the vCPU is put	Sean Christopherson
	Suppress posted interrupt notifications (set PID.SN=1) whenever the vCPU is put, i.e. unloaded, not just when the vCPU is preempted, as KVM doesn't do anything in response to a notification IRQ that arrives in the host, nor does KVM rely on the Outstanding Notification (PID.ON) flag when the vCPU is unloaded. And, the cost of scanning the PIR to manually set PID.ON when loading the vCPU is quite small, especially relative to the cost of loading (and unloading) a vCPU. On the flip side, leaving SN clear means a notification for the vCPU will result in a spurious IRQ for the pCPU, even if vCPU task is scheduled out, running in userspace, etc. Even worse, if the pCPU is running a different vCPU, the spurious IRQ could trigger posted interrupt processing for the wrong vCPU, which is technically a violation of the architecture, as setting bits in PIR aren't supposed to be propagated to the vIRR until a notification IRQ is received. The saving grace of the current behavior is that hardware sends notification interrupts if and only if PID.ON=0, i.e. only the first posted interrupt for a vCPU will trigger a spurious IRQ (for each window where the vCPU is unloaded). Ideally, KVM would suppress notifications before enabling IRQs in the VM-Exit, but KVM relies on PID.ON as an indicator that there is a posted interrupt pending in PIR, e.g. in vmx_sync_pir_to_irr(), and sadly there is no way to ask hardware to set PID.ON, but not generate an interrupt. That could be solved by using pi_has_pending_interrupt() instead of checking only PID.ON, but it's not at all clear that would be a performance win, as KVM would end up scanning the entire PIR whenever an interrupt isn't pending. And long term, the spurious IRQ window, i.e. where a vCPU is loaded with IRQs enabled, can effectively be made smaller for hot paths by moving performance critical VM-Exit handlers into the fastpath, i.e. by never enabling IRQs for hot path VM-Exits. Link: https://lore.kernel.org/r/20250611224604.313496-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-23	KVM: SVM: Disable (x2)AVIC IPI virtualization if CPU has erratum #1235	Maxim Levitsky
	Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR write emulation, and might fail to VM-Exit on the sending vCPU, if IsRunning was recently cleared. The absence of the VM-Exit leads to KVM not waking (or triggering nested VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs, unbounded delays in L2 execution, etc. To workaround the erratum, simply disable IPI virtualization, which prevents KVM from setting IsRunning and thus eliminates the race where hardware sees a stale IsRunning=1. As a result, all ICR writes (except when "Self" shorthand is used) will VM-Exit and therefore be correctly emulated by KVM. Disabling IPI virtualization does carry a performance penalty, but benchmarkng shows that enabling AVIC without IPI virtualization is still much better than not using AVIC at all, because AVIC still accelerates posted interrupts and the receiving end of the IPIs. Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID table and updates the vIRR directly (because the vCPU is by definition actively running), i.e. Self-IPI isn't susceptible to the erratum and is still accelerated by hardware. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: rebase, massage changelog, disallow user override] Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Add enable_ipiv param, never set IsRunning if disabled	Maxim Levitsky
	Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv module param, by never setting IsRunning. SVM doesn't provide a way to disable IPI virtualization in hardware, but by ensuring CPUs never see IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a VM-Exit. To avoid setting the real IsRunning bit, while still allowing KVM to use each vCPU's entry to update GA log entries, simply maintain a shadow of the entry, without propagating IsRunning updates to the real table when IPI virtualization is disabled. Providing a way to effectively disable IPI virtualization will allow KVM to safely enable AVIC on hardware that is susceptible to erratum #1235, which causes hardware to sometimes fail to detect that the IsRunning bit has been cleared by software. Note, the table _must_ be fully populated, as broadcast IPIs skip invalid entries, i.e. won't generate VM-Exit if every entry is invalid, and so simply pointing the VMCB at a common dummy table won't work. Alternatively, KVM could allocate a shadow of the entire table, but that'd be a waste of 4KiB since the per-vCPU entry doesn't actually consume an additional 8 bytes of memory (vCPU structures are large enough that they are backed by order-N pages). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> [sean: keep "entry" variables, reuse enable_ipiv, split from erratum] Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: VMX: Move enable_ipiv knob to common x86	Sean Christopherson
	Move enable_ipiv to common x86 so that it can be reused by SVM to control IPI virtualization when AVIC is enabled. SVM doesn't actually provide a way to truly disable IPI virtualization, but KVM can get close enough by skipping the necessary table programming. Link: https://lore.kernel.org/r/20250611224604.313496-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop superfluous "cache" of AVIC Physical ID entry pointer	Sean Christopherson
	Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index the table directly. Caching a pointer address is completely unnecessary for performance, and while the field technically caches the result of the pointer calculation, it's all too easy to misinterpret the name and think that the field somehow caches the _data_ in the table. No functional change intended. Suggested-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Track AVIC tables as natively sized pointers, not "struct pages"	Sean Christopherson
	Allocate and track AVIC's logical and physical tables as u32 and u64 pointers respectively, as managing the pages as "struct page" pointers adds an almost absurd amount of boilerplate and complexity. E.g. with page_address() out of the way, svm->avic_physical_id_cache becomes completely superfluous, and will be removed in a future cleanup. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Acked-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop redundant check in AVIC code on ID during vCPU creation	Sean Christopherson
	Drop avic_get_physical_id_entry()'s compatibility check on the incoming ID, as its sole caller, avic_init_backing_page(), performs the exact same check. Drop avic_get_physical_id_entry() entirely as the only remaining functionality is getting the address of the Physical ID table, and accessing the array without an immediate bounds check is kludgy. Opportunistically add a compile-time assertion to ensure the vcpu_id can't result in a bounds overflow, e.g. if KVM (really) messed up a maximum physical ID #define, as well as run-time assertions so that a NULL pointer dereference is morphed into a safer WARN(). No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-15-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Inhibit AVIC if ID is too big instead of rejecting vCPU creation	Sean Christopherson
	Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with an ID that is too big, but otherwise allow vCPU creation to succeed. Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger than 254 (AVIC) or 511 (x2AVIC). Alternatively, KVM could advertise an accurate value depending on which AVIC mode is in use, but that wouldn't really solve the underlying problem, e.g. would be a breaking change if KVM were to ever try and enable AVIC or x2AVIC by default. Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop vcpu_svm's pointless avic_backing_page field	Sean Christopherson
	Drop vcpu_svm's avic_backing_page pointer and instead grab the physical address of KVM's vAPIC page directly from the source. Getting a physical address from a kernel virtual address is not an expensive operation, and getting the physical address from a struct page is more expensive for CONFIG_SPARSEMEM=y kernels. Regardless, none of the paths that consume the address are hot paths, i.e. shaving cycles is not a priority. Eliminating the "cache" means KVM doesn't have to worry about the cache being invalid, which will simplify a future fix when dealing with vCPU IDs that are too big. WARN if KVM attempts to allocate a vCPU's AVIC backing page without an in-kernel local APIC. avic_init_vcpu() bails early if the APIC is not in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have been created, i.e. it should be impossible to reach avic_init_backing_page() without the vAPIC being allocated. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-13-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Add helper to deduplicate code for getting AVIC backing page	Sean Christopherson
	Add a helper to get the physical address of the AVIC backing page, both to deduplicate code and to prepare for getting the address directly from apic->regs, at which point it won't be all that obvious that the address in question is what SVM calls the AVIC backing page. No functional change intended. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-12-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop pointless masking of kernel page pa's with AVIC HPA masks	Sean Christopherson
	Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned maximum theoretical physical address for x86-64 CPUs, as x86-64 is currently defined (going beyond PA52 would require an entirely new paging mode, which would arguably create a new, different architecture). All usage in KVM masks the result of page_to_phys(), which on x86-64 is guaranteed to be 4KiB aligned and a legal physical address; if either of those requirements doesn't hold true, KVM has far bigger problems. Drop masking the avic_backing_page with AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but keep the macro even though it's unused in functional code. It's a distinct architectural define, and having the definition in software helps visualize the layout of an entry. And to be hyper-paranoid about MAXPA going beyond 52, add a compile-time assert to ensure the kernel's maximum supported physical address stays in bounds. The unnecessary masking in avic_init_vmcb() also incorrectly assumes that SME's C-bit resides between bits 51:11; that holds true for current CPUs, but isn't required by AMD's architecture: In some implementations, the bit used may be a physical address bit Key word being "may". Opportunistically use the GENMASK_ULL() version for AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable than a set of repeating Fs. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2025-06-20	KVM: SVM: Drop pointless masking of default APIC base when setting V_APIC_BAR	Sean Christopherson
	Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum theoretical 4KiB-aligned physical address, i.e. is not novel in any way, and its only usage is to mask the default APIC base, which is 4KiB aligned and (obviously) a legal physical address. No functional change intended. Tested-by: Sairaj Kodilkar <sarunkod@amd.com> Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org> Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>