Age | Commit message (Collapse) | Author |
|
Use vcpu_data.pi_desc_addr instead of amd_iommu_pi_data.base to get the
GA root pointer. KVM is the only source of amd_iommu_pi_data.base, and
KVM's one and only path for writing amd_iommu_pi_data.base computes the
exact same value for vcpu_data.pi_desc_addr and amd_iommu_pi_data.base,
and fills amd_iommu_pi_data.base if and only if vcpu_data.pi_desc_addr is
valid, i.e. amd_iommu_pi_data.base is fully redundant.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-23-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a comment to explain why KVM clears IsRunning when putting a vCPU,
even though leaving IsRunning=1 would be ok from a functional perspective.
Per Maxim's experiments, a misbehaving VM could spam the AVIC doorbell so
fast as to induce a 50%+ loss in performance.
Link: https://lore.kernel.org/all/8d7e0d0391df4efc7cb28557297eb2ec9904f1e5.camel@redhat.com
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-22-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Suppress posted interrupt notifications (set PID.SN=1) whenever the vCPU
is put, i.e. unloaded, not just when the vCPU is preempted, as KVM doesn't
do anything in response to a notification IRQ that arrives in the host,
nor does KVM rely on the Outstanding Notification (PID.ON) flag when the
vCPU is unloaded. And, the cost of scanning the PIR to manually set PID.ON
when loading the vCPU is quite small, especially relative to the cost of
loading (and unloading) a vCPU.
On the flip side, leaving SN clear means a notification for the vCPU will
result in a spurious IRQ for the pCPU, even if vCPU task is scheduled out,
running in userspace, etc. Even worse, if the pCPU is running a different
vCPU, the spurious IRQ could trigger posted interrupt processing for the
wrong vCPU, which is technically a violation of the architecture, as
setting bits in PIR aren't supposed to be propagated to the vIRR until a
notification IRQ is received.
The saving grace of the current behavior is that hardware sends
notification interrupts if and only if PID.ON=0, i.e. only the first
posted interrupt for a vCPU will trigger a spurious IRQ (for each window
where the vCPU is unloaded).
Ideally, KVM would suppress notifications before enabling IRQs in the
VM-Exit, but KVM relies on PID.ON as an indicator that there is a posted
interrupt pending in PIR, e.g. in vmx_sync_pir_to_irr(), and sadly there
is no way to ask hardware to set PID.ON, but not generate an interrupt.
That could be solved by using pi_has_pending_interrupt() instead of
checking only PID.ON, but it's not at all clear that would be a performance
win, as KVM would end up scanning the entire PIR whenever an interrupt
isn't pending.
And long term, the spurious IRQ window, i.e. where a vCPU is loaded with
IRQs enabled, can effectively be made smaller for hot paths by moving
performance critical VM-Exit handlers into the fastpath, i.e. by never
enabling IRQs for hot path VM-Exits.
Link: https://lore.kernel.org/r/20250611224604.313496-21-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Disable IPI virtualization on AMD Family 17h CPUs (Zen2 and Zen1), as
hardware doesn't reliably detect changes to the 'IsRunning' bit during ICR
write emulation, and might fail to VM-Exit on the sending vCPU, if
IsRunning was recently cleared.
The absence of the VM-Exit leads to KVM not waking (or triggering nested
VM-Exit of) the target vCPU(s) of the IPI, which can lead to hung vCPUs,
unbounded delays in L2 execution, etc.
To workaround the erratum, simply disable IPI virtualization, which
prevents KVM from setting IsRunning and thus eliminates the race where
hardware sees a stale IsRunning=1. As a result, all ICR writes (except
when "Self" shorthand is used) will VM-Exit and therefore be correctly
emulated by KVM.
Disabling IPI virtualization does carry a performance penalty, but
benchmarkng shows that enabling AVIC without IPI virtualization is still
much better than not using AVIC at all, because AVIC still accelerates
posted interrupts and the receiving end of the IPIs.
Note, when virtualizing Self-IPIs, the CPU skips reading the physical ID
table and updates the vIRR directly (because the vCPU is by definition
actively running), i.e. Self-IPI isn't susceptible to the erratum *and*
is still accelerated by hardware.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: rebase, massage changelog, disallow user override]
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-20-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allow ITS to enable stuffing independent of retbleed. The dependency is only
on retpoline. It is a valid case for retbleed to be mitigated by eIBRS while
ITS deploys stuffing at the same time.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-6-5ff86cac6c61@linux.intel.com
|
|
In preparation to allow ITS to also enable stuffing aka Call Depth
Tracking (CDT) independently of retbleed, introduce a helper
cdt_possible().
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-5-5ff86cac6c61@linux.intel.com
|
|
Prepare to apply stuffing mitigation in its_apply_mitigation(). This is
currently only done via retbleed mitigation. Also using switch/case
makes it evident that mitigation mode like VMEXIT_ONLY doesn't need any
special handling.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-4-5ff86cac6c61@linux.intel.com
|
|
The purpose of the warning is to prevent an unexpected change to the return
thunk mitigation. However, there are legitimate cases where the return
thunk is intentionally set more than once. For example, ITS and SRSO both
can set the return thunk after retbleed has set it. In both the cases
retbleed is still mitigated.
Replace the warning with an info about the active return thunk.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-3-5ff86cac6c61@linux.intel.com
|
|
Simplify the nested checks, remove redundant print and comment.
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Nikolay Borisov <nik.borisov@suse.com>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-2-5ff86cac6c61@linux.intel.com
|
|
The retbleed select function leaves the mitigation to AUTO in some cases.
Moreover, the update function can also set the mitigation to AUTO. This
is inconsistent with other mitigations and requires explicit handling of
AUTO at the end of update step.
Make sure a mitigation gets selected in the select step, and do not change
it to AUTO in the update step. When no mitigation can be selected leave it
to NONE, which is what AUTO was getting changed to in the end.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250611-eibrs-fix-v4-1-5ff86cac6c61@linux.intel.com
|
|
I got some build warnings with W=1:
arch/x86/coco/sev/core.c:
arch/x86/crypto/aria_aesni_avx2_glue.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/aria_aesni_avx_glue.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/camellia_aesni_avx_glue.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/camellia_glue.c: warning:
EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/curve25519-x86_64.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/serpent_avx_glue.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/sm4_aesni_avx_glue.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/twofish_glue.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
arch/x86/crypto/twofish_glue_3way.c:
warning: EXPORT_SYMBOL() is used,
but #include <linux/export.h> is missing
so I fixed these build warnings for x86_64.
Signed-off-by: ChengZhenghan <chengzhenghan@uniontech.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
|
|
We need the driver-core fixes that are in 6.16-rc3 into here as well
to build on top of.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Make sure the array tracking which kernel text positions need to be
alternatives-patched doesn't get mishandled by out-of-order
modifications, leading to it overflowing and causing page faults when
patching
- Avoid an infinite loop when early code does a ranged TLB invalidation
before the broadcast TLB invalidation count of how many pages it can
flush, has been read from CPUID
- Fix a CONFIG_MODULES typo
- Disable broadcast TLB invalidation when PTI is enabled to avoid an
overflow of the bitmap tracking dynamic ASIDs which need to be
flushed when the kernel switches between the user and kernel address
space
- Handle the case of a CPU going offline and thus reporting zeroes when
reading top-level events in the resctrl code
* tag 'x86_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/alternatives: Fix int3 handling failure from broken text_poke array
x86/mm: Fix early boot use of INVPLGB
x86/its: Fix an ifdef typo in its_alloc()
x86/mm: Disable INVLPGB when PTI is enabled
x86,fs/resctrl: Remove inappropriate references to cacheinfo in the resctrl subsystem
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Avoid a crash on a heterogeneous machine where not all cores support
the same hw events features
- Avoid a deadlock when throttling events
- Document the perf event states more
- Make sure a number of perf paths switching off or rescheduling events
call perf_cgroup_event_disable()
- Make sure perf does task sampling before its userspace mapping is
torn down, and not after
* tag 'perf_urgent_for_v6.16_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Fix crash in icl_update_topdown_event()
perf: Fix the throttle error of some clock events
perf: Add comment to enum perf_event_state
perf/core: Fix WARN in perf_cgroup_switch()
perf: Fix dangling cgroup pointer in cpuctx
perf: Fix cgroup state vs ERROR
perf: Fix sample vs do_exit()
|
|
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Fix another set of FP/SIMD/SVE bugs affecting NV, and plugging some
missing synchronisation
- A small fix for the irqbypass hook fixes, tightening the check and
ensuring that we only deal with MSI for both the old and the new
route entry
- Rework the way the shadow LRs are addressed in a nesting
configuration, plugging an embarrassing bug as well as simplifying
the whole process
- Add yet another fix for the dreaded arch_timer_edge_cases selftest
RISC-V:
- Fix the size parameter check in SBI SFENCE calls
- Don't treat SBI HFENCE calls as NOPs
x86 TDX:
- Complete API for handling complex TDVMCALLs in userspace.
This was delayed because the spec lacked a way for userspace to
deny supporting these calls; the new exit code is now approved"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: TDX: Exit to userspace for GetTdVmCallInfo
KVM: TDX: Handle TDG.VP.VMCALL<GetQuote>
KVM: TDX: Add new TDVMCALL status code for unsupported subfuncs
KVM: arm64: VHE: Centralize ISBs when returning to host
KVM: arm64: Remove cpacr_clear_set()
KVM: arm64: Remove ad-hoc CPTR manipulation from kvm_hyp_handle_fpsimd()
KVM: arm64: Remove ad-hoc CPTR manipulation from fpsimd_sve_sync()
KVM: arm64: Reorganise CPTR trap manipulation
KVM: arm64: VHE: Synchronize CPTR trap deactivation
KVM: arm64: VHE: Synchronize restore of host debug registers
KVM: arm64: selftests: Close the GIC FD in arch_timer_edge_cases
KVM: arm64: Explicitly treat routing entry type changes as changes
KVM: arm64: nv: Fix tracking of shadow list registers
RISC-V: KVM: Don't treat SBI HFENCE calls as NOPs
RISC-V: KVM: Fix the size parameter check in SBI SFENCE calls
|
|
Add CPUID faulting support on AMD using the same user interface.
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20250528213105.1149-1-bp@kernel.org
|
|
Similar to zboot architectures, implement support for embedding SBAT data
for x86. Put '.sbat' section in between '.data' and '.text' as the former
also covers '.bss' and '.pgtable' and thus must be the last one in the
file.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/20250603091951.57775-1-vkuznets@redhat.com
|
|
Let userspace "disable" IPI virtualization for AVIC via the enable_ipiv
module param, by never setting IsRunning. SVM doesn't provide a way to
disable IPI virtualization in hardware, but by ensuring CPUs never see
IsRunning=1, every IPI in the guest (except for self-IPIs) will generate a
VM-Exit.
To avoid setting the real IsRunning bit, while still allowing KVM to use
each vCPU's entry to update GA log entries, simply maintain a shadow of
the entry, without propagating IsRunning updates to the real table when
IPI virtualization is disabled.
Providing a way to effectively disable IPI virtualization will allow KVM
to safely enable AVIC on hardware that is susceptible to erratum #1235,
which causes hardware to sometimes fail to detect that the IsRunning bit
has been cleared by software.
Note, the table _must_ be fully populated, as broadcast IPIs skip invalid
entries, i.e. won't generate VM-Exit if every entry is invalid, and so
simply pointing the VMCB at a common dummy table won't work.
Alternatively, KVM could allocate a shadow of the entire table, but that'd
be a waste of 4KiB since the per-vCPU entry doesn't actually consume an
additional 8 bytes of memory (vCPU structures are large enough that they
are backed by order-N pages).
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
[sean: keep "entry" variables, reuse enable_ipiv, split from erratum]
Link: https://lore.kernel.org/r/20250611224604.313496-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move enable_ipiv to common x86 so that it can be reused by SVM to control
IPI virtualization when AVIC is enabled. SVM doesn't actually provide a
way to truly disable IPI virtualization, but KVM can get close enough by
skipping the necessary table programming.
Link: https://lore.kernel.org/r/20250611224604.313496-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the vCPU's pointer to its AVIC Physical ID entry, and simply index
the table directly. Caching a pointer address is completely unnecessary
for performance, and while the field technically caches the result of the
pointer calculation, it's all too easy to misinterpret the name and think
that the field somehow caches the _data_ in the table.
No functional change intended.
Suggested-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Allocate and track AVIC's logical and physical tables as u32 and u64
pointers respectively, as managing the pages as "struct page" pointers
adds an almost absurd amount of boilerplate and complexity. E.g. with
page_address() out of the way, svm->avic_physical_id_cache becomes
completely superfluous, and will be removed in a future cleanup.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Acked-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-16-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop avic_get_physical_id_entry()'s compatibility check on the incoming
ID, as its sole caller, avic_init_backing_page(), performs the exact same
check. Drop avic_get_physical_id_entry() entirely as the only remaining
functionality is getting the address of the Physical ID table, and
accessing the array without an immediate bounds check is kludgy.
Opportunistically add a compile-time assertion to ensure the vcpu_id can't
result in a bounds overflow, e.g. if KVM (really) messed up a maximum
physical ID #define, as well as run-time assertions so that a NULL pointer
dereference is morphed into a safer WARN().
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Inhibit AVIC with a new "ID too big" flag if userspace creates a vCPU with
an ID that is too big, but otherwise allow vCPU creation to succeed.
Rejecting KVM_CREATE_VCPU with EINVAL violates KVM's ABI as KVM advertises
that the max vCPU ID is 4095, but disallows creating vCPUs with IDs bigger
than 254 (AVIC) or 511 (x2AVIC).
Alternatively, KVM could advertise an accurate value depending on which
AVIC mode is in use, but that wouldn't really solve the underlying problem,
e.g. would be a breaking change if KVM were to ever try and enable AVIC or
x2AVIC by default.
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop vcpu_svm's avic_backing_page pointer and instead grab the physical
address of KVM's vAPIC page directly from the source. Getting a physical
address from a kernel virtual address is not an expensive operation, and
getting the physical address from a struct page is *more* expensive for
CONFIG_SPARSEMEM=y kernels. Regardless, none of the paths that consume
the address are hot paths, i.e. shaving cycles is not a priority.
Eliminating the "cache" means KVM doesn't have to worry about the cache
being invalid, which will simplify a future fix when dealing with vCPU IDs
that are too big.
WARN if KVM attempts to allocate a vCPU's AVIC backing page without an
in-kernel local APIC. avic_init_vcpu() bails early if the APIC is not
in-kernel, and KVM disallows enabling an in-kernel APIC after vCPUs have
been created, i.e. it should be impossible to reach
avic_init_backing_page() without the vAPIC being allocated.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a helper to get the physical address of the AVIC backing page, both
to deduplicate code and to prepare for getting the address directly from
apic->regs, at which point it won't be all that obvious that the address
in question is what SVM calls the AVIC backing page.
No functional change intended.
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop AVIC_HPA_MASK and all its users, the mask is just the 4KiB-aligned
maximum theoretical physical address for x86-64 CPUs, as x86-64 is
currently defined (going beyond PA52 would require an entirely new paging
mode, which would arguably create a new, different architecture).
All usage in KVM masks the result of page_to_phys(), which on x86-64 is
guaranteed to be 4KiB aligned and a legal physical address; if either of
those requirements doesn't hold true, KVM has far bigger problems.
Drop masking the avic_backing_page with
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK for all the same reasons, but
keep the macro even though it's unused in functional code. It's a
distinct architectural define, and having the definition in software
helps visualize the layout of an entry. And to be hyper-paranoid about
MAXPA going beyond 52, add a compile-time assert to ensure the kernel's
maximum supported physical address stays in bounds.
The unnecessary masking in avic_init_vmcb() also incorrectly assumes that
SME's C-bit resides between bits 51:11; that holds true for current CPUs,
but isn't required by AMD's architecture:
In some implementations, the bit used may be a physical address bit
Key word being "may".
Opportunistically use the GENMASK_ULL() version for
AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK, which is far more readable
than a set of repeating Fs.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop VMCB_AVIC_APIC_BAR_MASK, it's just a regurgitation of the maximum
theoretical 4KiB-aligned physical address, i.e. is not novel in any way,
and its only usage is to mask the default APIC base, which is 4KiB aligned
and (obviously) a legal physical address.
No functional change intended.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Reviewed-by: Naveen N Rao (AMD) <naveen@kernel.org>
Link: https://lore.kernel.org/r/20250611224604.313496-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the IRTE link from the previous vCPU irrespective of the new
routing state, i.e. even if the IRTE won't be configured to post IRQs to a
vCPU. Whether or not the new route is postable as no bearing on the *old*
route. Failure to delete the link can result in KVM incorrectly updating
the IRTE, e.g. if the "old" vCPU is scheduled in/out.
Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt")
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the amd_ir_data.prev_ga_tag field now that all usage is
superfluous.
Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Delete the previous per-vCPU IRTE link prior to modifying the IRTE. If
forcing the IRTE back to remapped mode fails, the IRQ is already broken;
keeping stale metadata won't change that, and the IOMMU should be
sufficiently paranoid to sanitize the IRTE when the IRQ is freed and
reallocated.
This will allow hoisting the vCPU tracking to common x86, which in turn
will allow most of the IRTE update code to be deduplicated.
Tested-by: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Track the IRTEs that are posting to an SVM vCPU via the associated irqfd
structure and GSI routing instead of dynamically allocating a separate
data structure. In addition to eliminating an atomic allocation, this
will allow hoisting much of the IRTE update logic to common x86.
Cc: Sairaj Kodilkar <sarunkod@amd.com>
Link: https://lore.kernel.org/r/20250611224604.313496-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When updating IRTEs in response to a GSI routing or IRQ bypass change,
pass the new/current routing information along with the associated irqfd.
This will allow KVM x86 to harden, simplify, and deduplicate its code.
Since adding/removing a bypass producer is now conveniently protected with
irqfds.lock, i.e. can't run concurrently with kvm_irq_routing_update(),
use the routing information cached in the irqfd instead of looking up
the information in the current GSI routing tables.
Opportunistically convert an existing printk() to pr_info() and put its
string onto a single line (old code that strictly adhered to 80 chars).
Link: https://lore.kernel.org/r/20250611224604.313496-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop irq_comm.c, a.k.a. common IRQ APIs, as there has been no non-x86 user
since commit 003f7de62589 ("KVM: ia64: remove") (at the time, irq_comm.c
lived in virt/kvm, not arch/x86/kvm).
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-19-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the IRQ mask logic to ioapic.c as KVM's only user is its in-kernel
I/O APIC emulation. In addition to encapsulating more I/O APIC specific
code, trimming down irq_comm.c helps pave the way for removing it entirely.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-18-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a Kconfig to allow building KVM without support for emulating a I/O
APIC, PIC, and PIT, which is desirable for deployments that effectively
don't support a fully in-kernel IRQ chip, i.e. never expect any VMM to
create an in-kernel I/O APIC. E.g. compiling out support eliminates a few
thousand lines of guest-facing code and gives security folks warm fuzzies.
As a bonus, wrapping relevant paths with CONFIG_KVM_IOAPIC #ifdefs makes
it much easier for readers to understand which bits and pieces exist
specifically for fully in-kernel IRQ chips.
Opportunistically convert all two in-kernel uses of __KVM_HAVE_IOAPIC to
CONFIG_KVM_IOAPIC, e.g. rather than add a second #ifdef to generate a stub
for kvm_arch_post_irq_routing_update().
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-15-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the I/O APIC tracepoints and trace_kvm_msi_set_irq() to x86, as
__KVM_HAVE_IOAPIC is just code for "x86", and trace_kvm_msi_set_irq()
isn't unique to I/O APIC emulation.
Opportunistically clean up the absurdly messy #includes in ioapic.c.
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-14-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Explicitly check for an in-kernel PIC when checking for a pending ExtINT
in the PIC. Effectively swapping the split vs. full irqchip logic will
allow guarding the in-kernel I/O APIC (and PIC) emulation with a Kconfig,
and also makes it more obvious that kvm_pic_read_irq() won't result in a
NULL pointer dereference.
Opportunistically add WARNs in the fallthrough path, mostly to document
that the userspace ExtINT logic is only relevant to split IRQ chips.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't bother clearing the PIT's IRQ line status when destroying the PIT,
as userspace can't possibly rely on KVM to lower the IRQ line in any sane
use case, and it's not at all obvious that clearing the PIT's IRQ line is
correct/desirable in kvm_create_pit()'s error path.
When called from kvm_arch_pre_destroy_vm(), the entire VM is being torn
down and thus {kvm_pic,kvm_ioapic}.irq_states are unreachable.
As for the error path in kvm_create_pit(), the only way the PIT's bit in
irq_states can be set is if userspace raises the associated IRQ before
KVM_CREATE_PIT{2} completes. Forcefully clearing the bit would clobber
userspace's input, nonsensical though that input may be. Not to mention
that no known VMM will continue on if PIT creation fails.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Hardcode the PIT's source IRQ ID to '2' instead of "finding" that bit 2
is always the first available bit in irq_sources_bitmap. Bits 0 and 1 are
set/reserved by kvm_arch_init_vm(), i.e. long before kvm_create_pit() can
be invoked, and KVM allows at most one in-kernel PIT instance, i.e. it's
impossible for the PIT to find a different free bit (there are no other
users of kvm_request_irq_source_id().
Delete the now-defunct irq_sources_bitmap and all its associated code.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move kvm_{request,free}_irq_source_id() to i8254.c, i.e. the dedicated PIT
emulation file, in anticipation of removing them entirely in favor of
hardcoding the PIT's "requested" source ID (the source ID can only ever be
'2', and the request can never fail).
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the default IRQ routing table used for in-kernel I/O APIC and PIC
routing to irq.c, and tweak the name to make it explicitly clear what
routing is being initialized.
In addition to making it more obvious that the so called "default" routing
only applies to an in-kernel I/O APIC, getting it out of irq_comm.c will
allow removing irq_comm.c entirely. And placing the function alongside
other I/O APIC and PIC code will allow for guarding KVM's in-kernel I/O
APIC and PIC emulation with a Kconfig with minimal #ifdefs.
No functional change intended.
Cc: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-9-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Rename irqchip_kernel() to irqchip_full(), as "kernel" is very ambiguous
due to the existence of split IRQ chip support, where only some of the
"irqchip" is in emulated by the kernel/KVM. E.g. irqchip_kernel() often
gets confused with irqchip_in_kernel().
Opportunistically hoist the definition up in irq.h so that it's
co-located with other "full" irqchip code in anticipation of wrapping it
all with a Kconfig/#ifdef.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the ioctl helpers for getting/setting fully in-kernel IRQ chip state
to irq.c, partly to trim down x86.c, but mostly in preparation for adding
a Kconfig to control support for in-kernel I/O APIC, PIC, and PIT
emulation.
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-7-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the PIT ioctl helpers to i8254.c, i.e. to the file that implements
PIT emulation. Eliminating PIT code in x86.c will allow adding a Kconfig
to control support for in-kernel I/O APIC, PIC, and PIT emulation with
minimal #ifdefs.
Opportunistically make kvm_pit_set_reinject() and kvm_pit_load_count()
local to i8254.c as they were only publicly visible to make them available
to the ioctl helpers.
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-6-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the superfluous kvm_hv_set_sint() and instead wire up ->set() directly
to its final destination, kvm_hv_synic_set_irq(). Keep hv_synic_set_irq()
instead of kvm_hv_set_sint() to provide some amount of consistency in the
->set() helpers, e.g. to match kvm_pic_set_irq() and kvm_ioapic_set_irq().
kvm_set_msi() is arguably the oddball, e.g. kvm_set_msi_irq() should be
something like kvm_msi_to_lapic_irq() so that kvm_set_msi() can instead be
kvm_set_msi_irq(), but that's a future problem to solve.
No functional change intended.
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Kai Huang <kai.huang@intel.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the superfluous and confusing kvm_set_ioapic_irq() and instead wire
up ->set() directly to its final destination.
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Drop the superfluous and confusing kvm_set_pic_irq() => kvm_pic_set_irq()
wrapper, and instead wire up ->set() directly to its final destination.
Opportunistically move the declaration kvm_pic_set_irq() to irq.h to
start gathering more of the in-kernel APIC/IO-APIC logic in irq.{c,h}.
No functional change intended.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Trigger the I/O APIC route rescan that's performed for a split IRQ chip
after userspace updates IRQ routes in kvm_arch_irq_routing_update(), i.e.
before dropping kvm->irq_lock. Calling kvm_make_all_cpus_request() under
a mutex is perfectly safe, and the smp_wmb()+smp_mb__after_atomic() pair
in __kvm_make_request()+kvm_check_request() ensures the new routing is
visible to vCPUs prior to the request being visible to vCPUs.
In all likelihood, commit b053b2aef25d ("KVM: x86: Add EOI exit bitmap
inference") somewhat arbitrarily made the request outside of irq_lock to
avoid holding irq_lock any longer than is strictly necessary. And then
commit abdb080f7ac8 ("kvm/irqchip: kvm_arch_irq_routing_update renaming
split") took the easy route of adding another arch hook instead of risking
a functional change.
Note, the call to synchronize_srcu_expedited() does NOT provide ordering
guarantees with respect to vCPUs scanning the new routing; as above, the
request infrastructure provides the necessary ordering. I.e. there's no
need to wait for kvm_scan_ioapic_routes() to complete if it's actively
running, because regardless of whether it grabs the old or new table, the
vCPU will have another KVM_REQ_SCAN_IOAPIC pending, i.e. will rescan again
and see the new mappings.
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20250611213557.294358-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move ownership of IRQ bypass token tracking into irqbypass.ko, and
explicitly require callers to pass an eventfd_ctx structure instead of a
completely opaque token. Relying on producers and consumers to set the
token appropriately is error prone, and hiding the fact that the token must
be an eventfd_ctx pointer (for all intents and purposes) unnecessarily
obfuscates the code and makes it more brittle.
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Link: https://lore.kernel.org/r/20250516230734.2564775-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
KVM currently returns -EINVAL when it attempts to create an SNP guest if
the SINGLE_SOCKET guest policy bit is set. The reason for this action is
that KVM would need specific support (SNP_ACTIVATE_EX command support) to
achieve this when running on a system with more than one socket. However,
the SEV firmware will make the proper check and return POLICY_FAILURE
during SNP_ACTIVATE if the single socket guest policy bit is set and the
system has more than one socket:
- System with one socket
- Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds
- Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE succeeds
- System with more than one socket
- Guest policy SINGLE_SOCKET == 0 ==> SNP_ACTIVATE succeeds
- Guest policy SINGLE_SOCKET == 1 ==> SNP_ACTIVATE fails with
POLICY_FAILURE
Remove the check for the SINGLE_SOCKET policy bit from snp_launch_start()
and allow the firmware to perform the proper checking.
This does have the effect of allowing an SNP guest with the SINGLE_SOCKET
policy bit set to run on a single socket system, but fail when run on a
system with more than one socket. However, this should not affect existing
SNP guests as setting the SINGLE_SOCKET policy bit is not allowed today.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Link: https://lore.kernel.org/r/4c51018dd3e4f2c543935134d2c4f47076f109f6.1748553480.git.thomas.lendacky@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|