summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-11-13Merge branch 'kvm-docs-6.13' into HEADPaolo Bonzini
- Drop obsolete references to PPC970 KVM, which was removed 10 years ago. - Fix incorrect references to non-existing ioctls - List registers supported by KVM_GET/SET_ONE_REG on s390 - Use rST internal links - Reorganize the introduction to the API document
2024-11-13Merge tag 'kvm-x86-misc-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 misc changes for 6.13 - Clean up and optimize KVM's handling of writes to MSR_IA32_APICBASE. - Quirk KVM's misguided behavior of initialized certain feature MSRs to their maximum supported feature set, which can result in KVM creating invalid vCPU state. E.g. initializing PERF_CAPABILITIES to a non-zero value results in the vCPU having invalid state if userspace hides PDCM from the guest, which can lead to save/restore failures. - Fix KVM's handling of non-canonical checks for vCPUs that support LA57 to better follow the "architecture", in quotes because the actual behavior is poorly documented. E.g. most MSR writes and descriptor table loads ignore CR4.LA57 and operate purely on whether the CPU supports LA57. - Bypass the register cache when querying CPL from kvm_sched_out(), as filling the cache from IRQ context is generally unsafe, and harden the cache accessors to try to prevent similar issues from occuring in the future. - Advertise AMD_IBPB_RET to userspace, and fix a related bug where KVM over-advertises SPEC_CTRL when trying to support cross-vendor VMs. - Minor cleanups
2024-11-13Merge tag 'kvm-x86-vmx-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM VMX change for 6.13 - Remove __invept()'s unused @gpa param, which was left behind when KVM dropped code for invalidating a specific GPA (Intel never officially documented support for single-address INVEPT; presumably pre-production CPUs supported it at some point).
2024-11-13Merge tag 'kvm-x86-selftests-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM selftests changes for 6.13 - Enable XFAM-based features by default for all selftests VMs, which will allow removing the "no AVX" restriction.
2024-11-13Merge tag 'kvm-x86-mmu-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 MMU changes for 6.13 - Cleanup KVM's handling of Accessed and Dirty bits to dedup code, improve documentation, harden against unexpected changes, and to simplify A/D-disabled MMUs by using the hardware-defined A/D bits to track if a PFN is Accessed and/or Dirty. - Elide TLB flushes when aging SPTEs, as has been done in x86's primary MMU for over 10 years. - Batch TLB flushes when zapping collapsible TDP MMU SPTEs, i.e. when dirty logging is toggled off, which reduces the time it takes to disable dirty logging by ~3x. - Recover huge pages in-place in the TDP MMU instead of zapping the SP and waiting until the page is re-accessed to create a huge mapping. Proactively installing huge pages can reduce vCPU jitter in extreme scenarios. - Remove support for (poorly) reclaiming page tables in shadow MMUs via the primary MMU's shrinker interface.
2024-11-13Merge tag 'kvm-x86-generic-6.13' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM generic changes for 6.13 - Rework kvm_vcpu_on_spin() to use a single for-loop instead of making two partial poasses over "all" vCPUs. Opportunistically expand the comment to better explain the motivation and logic. - Protect vcpu->pid accesses outside of vcpu->mutex with a rwlock instead of RCU, so that running a vCPU on a different task doesn't encounter long stalls due to having to wait for all CPUs become quiescent.
2024-11-12Merge tag 'kvm-s390-next-6.13-1' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD - second part of the ucontrol selftest - cpumodel sanity check selftest - gen17 cpumodel changes
2024-11-11KVM: s390: selftests: Add regression tests for PFCR subfunctionsHendrik Brueckner
Check if the PFCR query reported in userspace coincides with the kernel reported function list. Right now we don't mask the functions in the kernel so they have to be the same. Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com> Reviewed-by: Hariharan Mari <hari55@linux.ibm.com> Link: https://lore.kernel.org/r/20241107152319.77816-5-brueckner@linux.ibm.com [frankja@linux.ibm.com: Added commit description] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107152319.77816-5-brueckner@linux.ibm.com>
2024-11-11KVM: s390: add gen17 facilities to CPU modelHendrik Brueckner
Add gen17 facilities and let KVM_CAP_S390_VECTOR_REGISTERS handle the enablement of the vector extension facilities. Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241107152319.77816-4-brueckner@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107152319.77816-4-brueckner@linux.ibm.com>
2024-11-11KVM: s390: add msa11 to cpu modelHendrik Brueckner
Message-security-assist 11 introduces pckmo subfunctions to encrypt hmac keys. Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241107152319.77816-3-brueckner@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107152319.77816-3-brueckner@linux.ibm.com>
2024-11-11KVM: s390: add concurrent-function facility to cpu modelHendrik Brueckner
Adding support for concurrent-functions facility which provides additional subfunctions. Signed-off-by: Hendrik Brueckner <brueckner@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20241107152319.77816-2-brueckner@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107152319.77816-2-brueckner@linux.ibm.com>
2024-11-11KVM: s390: selftests: correct IP.b length in uc_handle_sieic debug outputChristoph Schlameuss
The length of the interrupt parameters (IP) are: a: 2 bytes b: 4 bytes Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20241107141024.238916-6-schlameuss@linux.ibm.com [frankja@linux.ibm.com: Fixed patch prefix] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107141024.238916-6-schlameuss@linux.ibm.com>
2024-11-11KVM: s390: selftests: Fix whitespace confusion in ucontrol testChristoph Schlameuss
Checkpatch thinks that we're doing a multiplication but we're obviously not. Fix 4 instances where we adhered to wrong checkpatch advice. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20241107141024.238916-5-schlameuss@linux.ibm.com [frankja@linux.ibm.com: Fixed patch prefix] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107141024.238916-5-schlameuss@linux.ibm.com>
2024-11-11KVM: s390: selftests: Verify reject memory region operations for ucontrol VMsChristoph Schlameuss
Add a test case verifying KVM_SET_USER_MEMORY_REGION and KVM_SET_USER_MEMORY_REGION2 cannot be executed on ucontrol VMs. Executing this test case on not patched kernels will cause a null pointer dereference in the host kernel. This is fixed with commit: commit 7816e58967d0 ("kvm: s390: Reject memory region operations for ucontrol VMs") Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20241107141024.238916-4-schlameuss@linux.ibm.com [frankja@linux.ibm.com: Fixed patch prefix] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107141024.238916-4-schlameuss@linux.ibm.com>
2024-11-11KVM: s390: selftests: Add uc_skey VM test caseChristoph Schlameuss
Add a test case manipulating s390 storage keys from within the ucontrol VM. Storage key instruction (ISKE, SSKE and RRBE) intercepts and Keyless-subset facility are disabled on first use, where the skeys are setup by KVM in non ucontrol VMs. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Link: https://lore.kernel.org/r/20241108091620.289406-1-schlameuss@linux.ibm.com Acked-by: Janosch Frank <frankja@linux.ibm.com> [frankja@linux.ibm.com: Fixed patch prefix] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241108091620.289406-1-schlameuss@linux.ibm.com>
2024-11-11KVM: s390: selftests: Add uc_map_unmap VM test caseChristoph Schlameuss
Add a test case verifying basic running and interaction of ucontrol VMs. Fill the segment and page tables for allocated memory and map memory on first access. * uc_map_unmap Store and load data to mapped and unmapped memory and use pic segment translation handling to map memory on access. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20241107141024.238916-2-schlameuss@linux.ibm.com [frankja@linux.ibm.com: Fixed patch prefix] Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20241107141024.238916-2-schlameuss@linux.ibm.com>
2024-11-08Merge tag 'kvm-riscv-6.13-1' of https://github.com/kvm-riscv/linux into HEADPaolo Bonzini
KVM/riscv changes for 6.13 - Accelerate KVM RISC-V when running as a guest - Perf support to collect KVM guest statistics from host side
2024-11-08Documentation: kvm: reorganize introductionPaolo Bonzini
Reorganize the text to mention file descriptors as early as possible. Also mention capabilities early as they are a central part of KVM's API. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Documentation: kvm: replace section numbers with linksPaolo Bonzini
In order to simplify further introduction of hyperlinks, replace explicit section numbers with rST hyperlinks. The section numbers could actually be removed now, but I'm not going to do a huge change throughout the file for an RFC... Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Documentation: kvm: fix a few mistakesPaolo Bonzini
The only occurrence "Capability: none" actually meant the same as "basic". Fix that and a few more aesthetic or content issues in the document. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: powerpc: remove remaining traces of KVM_CAP_PPC_RMAPaolo Bonzini
This was only needed for PPC970 support, which is long gone: the implementation was removed in 2014. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241023124507.280382-2-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: VMX: Bury Intel PT virtualization (guest/host mode) behind CONFIG_BROKENSean Christopherson
Hide KVM's pt_mode module param behind CONFIG_BROKEN, i.e. disable support for virtualizing Intel PT via guest/host mode unless BROKEN=y. There are myriad bugs in the implementation, some of which are fatal to the guest, and others which put the stability and health of the host at risk. For guest fatalities, the most glaring issue is that KVM fails to ensure tracing is disabled, and *stays* disabled prior to VM-Enter, which is necessary as hardware disallows loading (the guest's) RTIT_CTL if tracing is enabled (enforced via a VMX consistency check). Per the SDM: If the logical processor is operating with Intel PT enabled (if IA32_RTIT_CTL.TraceEn = 1) at the time of VM entry, the "load IA32_RTIT_CTL" VM-entry control must be 0. On the host side, KVM doesn't validate the guest CPUID configuration provided by userspace, and even worse, uses the guest configuration to decide what MSRs to save/load at VM-Enter and VM-Exit. E.g. configuring guest CPUID to enumerate more address ranges than are supported in hardware will result in KVM trying to passthrough, save, and load non-existent MSRs, which generates a variety of WARNs, ToPA ERRORs in the host, a potential deadlock, etc. Fixes: f99e3daf94ff ("KVM: x86: Add Intel PT virtualization work mode") Cc: stable@vger.kernel.org Cc: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Tested-by: Adrian Hunter <adrian.hunter@intel.com> Message-ID: <20241101185031.1799556-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: x86: Unconditionally set irr_pending when updating APICv stateSean Christopherson
Always set irr_pending (to true) when updating APICv status to fix a bug where KVM fails to set irr_pending when userspace sets APIC state and APICv is disabled, which ultimate results in KVM failing to inject the pending interrupt(s) that userspace stuffed into the vIRR, until another interrupt happens to be emulated by KVM. Only the APICv-disabled case is flawed, as KVM forces apic->irr_pending to be true if APICv is enabled, because not all vIRR updates will be visible to KVM. Hit the bug with a big hammer, even though strictly speaking KVM can scan the vIRR and set/clear irr_pending as appropriate for this specific case. The bug was introduced by commit 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it"), which as the shortlog suggests, deleted code that updated irr_pending. Before that commit, kvm_apic_update_apicv() did indeed scan the vIRR, with with the crucial difference that kvm_apic_update_apicv() did the scan even when APICv was being *disabled*, e.g. due to an AVIC inhibition. struct kvm_lapic *apic = vcpu->arch.apic; if (vcpu->arch.apicv_active) { /* irr_pending is always true when apicv is activated. */ apic->irr_pending = true; apic->isr_count = 1; } else { apic->irr_pending = (apic_search_irr(apic) != -1); apic->isr_count = count_vectors(apic->regs + APIC_ISR); } And _that_ bug (clearing irr_pending) was introduced by commit b26a695a1d78 ("kvm: lapic: Introduce APICv update helper function"), prior to which KVM unconditionally set irr_pending to true in kvm_apic_set_state(), i.e. assumed that the new virtual APIC state could have a pending IRQ. Furthermore, in addition to introducing this issue, commit 755c2bf87860 also papered over the underlying bug: KVM doesn't ensure CPUs and devices see APICv as disabled prior to searching the IRR. Waiting until KVM emulates an EOI to update irr_pending "works", but only because KVM won't emulate EOI until after refresh_apicv_exec_ctrl(), and there are plenty of memory barriers in between. I.e. leaving irr_pending set is basically hacking around bad ordering. So, effectively revert to the pre-b26a695a1d78 behavior for state restore, even though it's sub-optimal if no IRQs are pending, in order to provide a minimal fix, but leave behind a FIXME to document the ugliness. With luck, the ordering issue will be fixed and the mess will be cleaned up in the not-too-distant future. Fixes: 755c2bf87860 ("KVM: x86: lapic: don't touch irr_pending in kvm_apic_update_apicv when inhibiting it") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Yong He <zhuangel570@gmail.com> Closes: https://lkml.kernel.org/r/20241023124527.1092810-1-alexyonghe%40tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241106015135.2462147-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08kvm: svm: Fix gctx page leak on invalid inputsDionna Glaze
Ensure that snp gctx page allocation is adequately deallocated on failure during snp_launch_start. Fixes: 136d8bc931c8 ("KVM: SEV: Add KVM_SEV_SNP_LAUNCH_START command") CC: Sean Christopherson <seanjc@google.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Thomas Gleixner <tglx@linutronix.de> CC: Ingo Molnar <mingo@redhat.com> CC: Borislav Petkov <bp@alien8.de> CC: Dave Hansen <dave.hansen@linux.intel.com> CC: Ashish Kalra <ashish.kalra@amd.com> CC: Tom Lendacky <thomas.lendacky@amd.com> CC: John Allen <john.allen@amd.com> CC: Herbert Xu <herbert@gondor.apana.org.au> CC: "David S. Miller" <davem@davemloft.net> CC: Michael Roth <michael.roth@amd.com> CC: Luis Chamberlain <mcgrof@kernel.org> CC: Russ Weight <russ.weight@linux.dev> CC: Danilo Krummrich <dakr@redhat.com> CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org> CC: "Rafael J. Wysocki" <rafael@kernel.org> CC: Tianfei zhang <tianfei.zhang@intel.com> CC: Alexey Kardashevskiy <aik@amd.com> Signed-off-by: Dionna Glaze <dionnaglaze@google.com> Message-ID: <20241105010558.1266699-2-dionnaglaze@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08KVM: selftests: use X86_MEMTYPE_WB instead of VMX_BASIC_MEM_TYPE_WBJohn Sperbeck
In 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the kernel sources"), VMX_BASIC_MEM_TYPE_WB was removed. Use X86_MEMTYPE_WB instead. Fixes: 08a7d2525511 ("tools arch x86: Sync the msr-index.h copy with the kernel sources") Signed-off-by: John Sperbeck <jsperbeck@google.com> Message-ID: <20241106034031.503291-1-jsperbeck@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-11-08Merge tag 'kvm-x86-fixes-6.12-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini
KVM x86 and selftests fixes for 6.12: - Increase the timeout for the memslot performance selftest to avoid false failures on arm64 and nested x86 platforms. - Fix a goof in the guest_memfd selftest where a for-loop initialized a bit mask to zero instead of BIT(0). - Disable strict aliasing when building KVM selftests to prevent the compiler from treating things like "u64 *" to "uint64_t *" cases as undefined behavior, which can lead to nasty, hard to debug failures. - Force -march=x86-64-v2 for KVM x86 selftests if and only if the uarch is supported by the compiler. - When emulating a guest TLB flush for a nested guest, flush vpid01, not vpid02, if L2 is active but VPID is disabled in vmcs12, i.e. if L2 and L1 are sharing VPID '0' (from L1's perspective). - Fix a bug in the SNP initialization flow where KVM would return '0' to userspace instead of -errno on failure.
2024-11-05riscv: kvm: Fix out-of-bounds array accessBjörn Töpel
In kvm_riscv_vcpu_sbi_init() the entry->ext_idx can contain an out-of-bound index. This is used as a special marker for the base extensions, that cannot be disabled. However, when traversing the extensions, that special marker is not checked prior indexing the array. Add an out-of-bounds check to the function. Fixes: 56d8a385b605 ("RISC-V: KVM: Allow some SBI extensions to be disabled by default") Signed-off-by: Björn Töpel <bjorn@rivosinc.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20241104191503.74725-1-bjorn@kernel.org Signed-off-by: Anup Patel <anup@brainfault.org>
2024-11-05RISC-V: KVM: Fix APLIC in_clrip and clripnum write emulationYong-Xuan Wang
In the section "4.7 Precise effects on interrupt-pending bits" of the RISC-V AIA specification defines that: "If the source mode is Level1 or Level0 and the interrupt domain is configured in MSI delivery mode (domaincfg.DM = 1): The pending bit is cleared whenever the rectified input value is low, when the interrupt is forwarded by MSI, or by a relevant write to an in_clrip register or to clripnum." Update the aplic_write_pending() to match the spec. Fixes: d8dd9f113e16 ("RISC-V: KVM: Fix APLIC setipnum_le/be write emulation") Signed-off-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Vincent Chen <vincent.chen@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Link: https://lore.kernel.org/r/20241029085542.30541-1-yongxuan.wang@sifive.com Signed-off-by: Anup Patel <anup@brainfault.org>
2024-11-04KVM: SVM: Propagate error from snp_guest_req_init() to userspaceSean Christopherson
If snp_guest_req_init() fails, return the provided error code up the stack to userspace, e.g. so that userspace can log that KVM_SEV_INIT2 failed, as opposed to some random operation later in VM setup failing because SNP wasn't actually enabled for the VM. Note, KVM itself doesn't consult the return value from __sev_guest_init(), i.e. the fallout is purely that userspace may be confused. Fixes: 88caf544c930 ("KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202410192220.MeTyHPxI-lkp@intel.com Link: https://lore.kernel.org/r/20241031203214.1585751-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: nVMX: Treat vpid01 as current if L2 is active, but with VPID disabledSean Christopherson
When getting the current VPID, e.g. to emulate a guest TLB flush, return vpid01 if L2 is running but with VPID disabled, i.e. if VPID is disabled in vmcs12. Architecturally, if VPID is disabled, then the guest and host effectively share VPID=0. KVM emulates this behavior by using vpid01 when running an L2 with VPID disabled (see prepare_vmcs02_early_rare()), and so KVM must also treat vpid01 as the current VPID while L2 is active. Unconditionally treating vpid02 as the current VPID when L2 is active causes KVM to flush TLB entries for vpid02 instead of vpid01, which results in TLB entries from L1 being incorrectly preserved across nested VM-Enter to L2 (L2=>L1 isn't problematic, because the TLB flush after nested VM-Exit flushes vpid01). The bug manifests as failures in the vmx_apicv_test KVM-Unit-Test, as KVM incorrectly retains TLB entries for the APIC-access page across a nested VM-Enter. Opportunisticaly add comments at various touchpoints to explain the architectural requirements, and also why KVM uses vpid01 instead of vpid02. All credit goes to Chao, who root caused the issue and identified the fix. Link: https://lore.kernel.org/all/ZwzczkIlYGX+QXJz@intel.com Fixes: 2b4a5a5d5688 ("KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST") Cc: stable@vger.kernel.org Cc: Like Xu <like.xu.linux@gmail.com> Debugged-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Tested-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20241031202011.1580522-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: selftests: Don't force -march=x86-64-v2 if it's unsupportedSean Christopherson
Force -march=x86-64-v2 to avoid SSE/AVX instructions if and only if the uarch definition is supported by the compiler, e.g. gcc 7.5 only supports x86-64. Fixes: 9a400068a158 ("KVM: selftests: x86: Avoid using SSE/AVX instructions") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-and-tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20241031045333.1209195-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: selftests: Disable strict aliasingSean Christopherson
Disable strict aliasing, as has been done in the kernel proper for decades (literally since before git history) to fix issues where gcc will optimize away loads in code that looks 100% correct, but is _technically_ undefined behavior, and thus can be thrown away by the compiler. E.g. arm64's vPMU counter access test casts a uint64_t (unsigned long) pointer to a u64 (unsigned long long) pointer when setting PMCR.N via u64p_replace_bits(), which gcc-13 detects and optimizes away, i.e. ignores the result and uses the original PMCR. The issue is most easily observed by making set_pmcr_n() noinline and wrapping the call with printf(), e.g. sans comments, for this code: printf("orig = %lx, next = %lx, want = %lu\n", pmcr_orig, pmcr, pmcr_n); set_pmcr_n(&pmcr, pmcr_n); printf("orig = %lx, next = %lx, want = %lu\n", pmcr_orig, pmcr, pmcr_n); gcc-13 generates: 0000000000401c90 <set_pmcr_n>: 401c90: f9400002 ldr x2, [x0] 401c94: b3751022 bfi x2, x1, #11, #5 401c98: f9000002 str x2, [x0] 401c9c: d65f03c0 ret 0000000000402660 <test_create_vpmu_vm_with_pmcr_n>: 402724: aa1403e3 mov x3, x20 402728: aa1503e2 mov x2, x21 40272c: aa1603e0 mov x0, x22 402730: aa1503e1 mov x1, x21 402734: 940060ff bl 41ab30 <_IO_printf> 402738: aa1403e1 mov x1, x20 40273c: 910183e0 add x0, sp, #0x60 402740: 97fffd54 bl 401c90 <set_pmcr_n> 402744: aa1403e3 mov x3, x20 402748: aa1503e2 mov x2, x21 40274c: aa1503e1 mov x1, x21 402750: aa1603e0 mov x0, x22 402754: 940060f7 bl 41ab30 <_IO_printf> with the value stored in [sp + 0x60] ignored by both printf() above and in the test proper, resulting in a false failure due to vcpu_set_reg() simply storing the original value, not the intended value. $ ./vpmu_counter_access Random seed: 0x6b8b4567 orig = 3040, next = 3040, want = 0 orig = 3040, next = 3040, want = 0 ==== Test Assertion Failure ==== aarch64/vpmu_counter_access.c:505: pmcr_n == get_pmcr_n(pmcr) pid=71578 tid=71578 errno=9 - Bad file descriptor 1 0x400673: run_access_test at vpmu_counter_access.c:522 2 (inlined by) main at vpmu_counter_access.c:643 3 0x4132d7: __libc_start_call_main at libc-start.o:0 4 0x413653: __libc_start_main at ??:0 5 0x40106f: _start at ??:0 Failed to update PMCR.N to 0 (received: 6) Somewhat bizarrely, gcc-11 also exhibits the same behavior, but only if set_pmcr_n() is marked noinline, whereas gcc-13 fails even if set_pmcr_n() is inlined in its sole caller. Cc: stable@vger.kernel.org Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116912 Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: selftests: fix unintentional noop test in guest_memfd_test.cPatrick Roy
The loop in test_create_guest_memfd_invalid() that is supposed to test that nothing is accepted as a valid flag to KVM_CREATE_GUEST_MEMFD was initializing `flag` as 0 instead of BIT(0). This caused the loop to immediately exit instead of iterating over BIT(0), BIT(1), ... . Fixes: 8a89efd43423 ("KVM: selftests: Add basic selftest for guest_memfd()") Signed-off-by: Patrick Roy <roypat@amazon.co.uk> Reviewed-by: James Gowans <jgowans@amazon.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20241024095956.3668818-1-roypat@amazon.co.uk Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: selftests: memslot_perf_test: increase guest sync timeoutMaxim Levitsky
When memslot_perf_test is run nested, first iteration of test_memslot_rw_loop testcase, sometimes takes more than 2 seconds due to build of shadow page tables. Following iterations are fast. To be on the safe side, bump the timeout to 10 seconds. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: Liam Merwick <liam.merwick@oracle.com> Reviewed-by: Liam Merwick <liam.merwick@oracle.com> Link: https://lore.kernel.org/r/20241004220153.287459-1-mlevitsk@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Short-circuit all of kvm_apic_set_base() if MSR value is unchangedSean Christopherson
Do nothing in all of kvm_apic_set_base(), not just __kvm_apic_set_base(), if the incoming MSR value is the same as the current value. Validating the mode transitions is obviously unnecessary, and rejecting the write is pointless if the vCPU already has an invalid value, e.g. if userspace is doing weird things and modified guest CPUID after setting MSR_IA32_APICBASE. Bailing early avoids kvm_recalculate_apic_map()'s slow path in the rare scenario where the map is DIRTY due to some other vCPU dirtying the map, in which case it's the other vCPU/task's responsibility to recalculate the map. Note, kvm_lapic_reset() calls __kvm_apic_set_base() only when emulating RESET, in which case the old value is guaranteed to be zero, and the new value is guaranteed to be non-zero. I.e. all callers of __kvm_apic_set_base() effectively pre-check for the MSR value actually changing. Don't bother keeping the check in __kvm_apic_set_base(), as no additional callers are expected, and implying that the MSR might already be non-zero at the time of kvm_lapic_reset() could confuse readers. Link: https://lore.kernel.org/r/20241101183555.1794700-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Unpack msr_data structure prior to calling kvm_apic_set_base()Sean Christopherson
Pass in the new value and "host initiated" as separate parameters to kvm_apic_set_base(), as forcing the KVM_SET_SREGS path to declare and fill an msr_data structure is awkward and kludgy, e.g. __set_sregs_common() doesn't even bother to set the proper MSR index. No functional change intended. Suggested-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20241101183555.1794700-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Make kvm_recalculate_apic_map() local to lapic.cSean Christopherson
Make kvm_recalculate_apic_map() local to lapic.c now that all external callers are gone. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-8-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Rename APIC base setters to better capture their relationshipSean Christopherson
Rename kvm_set_apic_base() and kvm_lapic_set_base() to kvm_apic_set_base() and __kvm_apic_set_base() respectively to capture that the underscores version is a "special" variant (it exists purely to avoid recalculating the optimized map multiple times when stuffing the RESET value). Opportunistically add a comment explaining why kvm_lapic_reset() uses the inner helper. Note, KVM deliberately invokes kvm_arch_vcpu_create() while kvm->lock is NOT held so that vCPU setup isn't serialized if userspace is creating multiple/all vCPUs in parallel. I.e. triggering an extra recalculation is not limited to theoretical/rare edge cases, and so is worth avoiding. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-7-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Move kvm_set_apic_base() implementation to lapic.c (from x86.c)Sean Christopherson
Move kvm_set_apic_base() to lapic.c so that the bulk of KVM's local APIC code resides in lapic.c, regardless of whether or not KVM is emulating the local APIC in-kernel. This will also allow making various helpers visible only to lapic.c. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-6-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Inline kvm_get_apic_mode() in lapic.hSean Christopherson
Inline kvm_get_apic_mode() in lapic.h to avoid a CALL+RET as well as an export. The underlying kvm_apic_mode() helper is public information, i.e. there is no state/information that needs to be hidden from vendor modules. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-5-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Get vcpu->arch.apic_base directly and drop kvm_get_apic_base()Sean Christopherson
Access KVM's emulated APIC base MSR value directly instead of bouncing through a helper, as there is no reason to add a layer of indirection, and there are other MSRs with a "set" but no "get", e.g. EFER. No functional change intended. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-4-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Drop superfluous kvm_lapic_set_base() call when setting APIC stateSean Christopherson
Now that kvm_lapic_set_base() does nothing if the "new" APIC base MSR is the same as the current value, drop the kvm_lapic_set_base() call in the KVM_SET_LAPIC flow that passes in the current value, as it too does nothing. Note, the purpose of invoking kvm_lapic_set_base() was purely to set apic->base_address (see commit 5dbc8f3fed0b ("KVM: use kvm_lapic_set_base() to change apic_base")). And there is no evidence that explicitly setting apic->base_address in KVM_SET_LAPIC ever had any functional impact; even in the original commit 96ad2cc61324 ("KVM: in-kernel LAPIC save and restore support"), all flows that set apic_base also set apic->base_address to the same address. E.g. svm_create_vcpu() did open code a write to apic_base, svm->vcpu.apic_base = 0xfee00000 | MSR_IA32_APICBASE_ENABLE; but it also called kvm_create_lapic() when irqchip_in_kernel() is true. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-3-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86: Short-circuit all kvm_lapic_set_base() if MSR value isn't changingSean Christopherson
Do nothing in kvm_lapic_set_base() if the APIC base MSR value is the same as the current value. All flows except the handling of the base address explicitly take effect if and only if relevant bits are changing. For the base address, invoking kvm_lapic_set_base() before KVM initializes the base to APIC_DEFAULT_PHYS_BASE during vCPU RESET would be a KVM bug, i.e. KVM _must_ initialize apic->base_address before exposing the vCPU (to userspace or KVM at-large). Note, the inhibit is intended to be set if the base address is _changed_ from the default, i.e. is also covered by the RESET behavior. Reviewed-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20241009181742.1128779-2-seanjc@google.com Link: https://lore.kernel.org/r/20241101183555.1794700-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Drop per-VM zapped_obsolete_pages listVipin Sharma
Drop the per-VM zapped_obsolete_pages list now that the usage from the defunct mmu_shrinker is gone, and instead use a local list to track pages in kvm_zap_obsolete_pages(), the sole remaining user of zapped_obsolete_pages. Opportunistically add an assertion to verify and document that slots_lock must be held, i.e. that there can only be one active instance of kvm_zap_obsolete_pages() at any given time, and by doing so also prove that using a local list instead of a per-VM list doesn't change any functionality (beyond trivialities like list initialization). Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: split to separate patch, write changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Remove KVM's MMU shrinkerVipin Sharma
Remove KVM's MMU shrinker and (almost) all of its related code, as the current implementation is very disruptive to VMs (if it ever runs), without providing any meaningful benefit[1]. Alternatively, KVM could repurpose its shrinker, e.g. to reclaim pages from the per-vCPU caches[2], but given that no one has complained about lack of TDP MMU support for the shrinker in the 3+ years since the TDP MMU was enabled by default, it's safe to say that there is likely no real use case for initiating reclaim of KVM's page tables from the shrinker. And while clever/cute, reclaiming the per-vCPU caches doesn't scale the same way that reclaiming in-use page table pages does. E.g. the amount of memory being used by a VM doesn't always directly correlate with the number vCPUs, and even when it does, reclaiming a few pages from per-vCPU caches likely won't make much of a dent in the VM's total memory usage, especially for VMs with huge amounts of memory. Lastly, if it turns out that there is a strong use case for dropping the per-vCPU caches, re-introducing the shrinker registration is trivial compared to the complexity of actually reclaiming pages from the caches. [1] https://lore.kernel.org/lkml/Y45dldZnI6OIf+a5@google.com [2] https://lore.kernel.org/kvm/20241004195540.210396-3-vipinsh@google.com Suggested-by: Sean Christopherson <seanjc@google.com> Suggested-by: David Matlack <dmatlack@google.com> Signed-off-by: Vipin Sharma <vipinsh@google.com> Link: https://lore.kernel.org/r/20241101201437.1604321-2-vipinsh@google.com [sean: keep zapped_obsolete_pages for now, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: WARN if huge page recovery triggered during dirty loggingDavid Matlack
WARN and bail out of recover_huge_pages_range() if dirty logging is enabled. KVM shouldn't be recovering huge pages during dirty logging anyway, since KVM needs to track writes at 4KiB. However it's not out of the possibility that that changes in the future. If KVM wants to recover huge pages during dirty logging, make_huge_spte() must be updated to write-protect the new huge page mapping. Otherwise, writes through the newly recovered huge page mapping will not be tracked. Note that this potential risk did not exist back when KVM zapped to recover huge page mappings, since subsequent accesses would just be faulted in at PG_LEVEL_4K if dirty logging was enabled. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-7-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Rename make_huge_page_split_spte() to make_small_spte()David Matlack
Rename make_huge_page_split_spte() to make_small_spte(). This ensures that the usage of "small_spte" and "huge_spte" are consistent between make_huge_spte() and make_small_spte(). This should also reduce some confusion as make_huge_page_split_spte() almost reads like it will create a huge SPTE, when in fact it is creating a small SPTE to split the huge SPTE. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-6-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Recover TDP MMU huge page mappings in-place instead of zappingDavid Matlack
Recover TDP MMU huge page mappings in-place instead of zapping them when dirty logging is disabled, and rename functions that recover huge page mappings when dirty logging is disabled to move away from the "zap collapsible spte" terminology. Before KVM flushes TLBs, guest accesses may be translated through either the (stale) small SPTE or the (new) huge SPTE. This is already possible when KVM is doing eager page splitting (where TLB flushes are also batched), and when vCPUs are faulting in huge mappings (where TLBs are flushed after the new huge SPTE is installed). Recovering huge pages reduces the number of page faults when dirty logging is disabled: $ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g Before: 393,599 kvm:kvm_page_fault After: 262,575 kvm:kvm_page_fault vCPU throughput and the latency of disabling dirty-logging are about equal compared to zapping, but avoiding faults can be beneficial to remove vCPU jitter in extreme scenarios. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Refactor TDP MMU iter need resched checkSean Christopherson
Refactor the TDP MMU iterator "need resched" checks into a helper function so they can be called from a different code path in a subsequent commit. No functional change intended. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240823235648.3236880-4-dmatlack@google.com [sean: rebase on a swapped order of checks] Signed-off-by: Sean Christopherson <seanjc@google.com>
2024-11-04KVM: x86/mmu: Demote the WARN on yielded in xxx_cond_resched() to ↵Sean Christopherson
KVM_MMU_WARN_ON Convert the WARN in tdp_mmu_iter_cond_resched() that the iterator hasn't already yielded to a KVM_MMU_WARN_ON() so the code is compiled out for production kernels (assuming production kernels disable KVM_PROVE_MMU). Checking for a needed reschedule is a hot path, and KVM sanity checks iter->yielded in several other less-hot paths, i.e. the odds of KVM not flagging that something went sideways are quite low. Furthermore, the odds of KVM not noticing *and* the WARN detecting something worth investigating are even lower. Link: https://lore.kernel.org/r/20241031170633.1502783-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>