linux.git - Linus' kernel tree

Age	Commit message (Collapse)	Author
2024-10-25	KVM: x86/mmu: Drop kvm_page_fault.hva, i.e. don't track intermediate hva	Sean Christopherson
	Remove kvm_page_fault.hva as it is never read, only written. This will allow removing the @hva param from __gfn_to_pfn_memslot(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-18-seanjc@google.com>
2024-10-25	KVM: Replace "async" pointer in gfn=>pfn with "no_wait" and error code	David Stevens
	Add a pfn error code to communicate that hva_to_pfn() failed because I/O was needed and disallowed, and convert @async to a constant @no_wait boolean. This will allow eliminating the @no_wait param by having callers pass in FOLL_NOWAIT along with other FOLL_* flags. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: David Stevens <stevensd@chromium.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-17-seanjc@google.com>
2024-10-25	KVM: Drop extra GUP (via check_user_page_hwpoison()) to detect poisoned page	Sean Christopherson
	Remove check_user_page_hwpoison() as it's effectively dead code. Prior to commit 234b239bea39 ("kvm: Faults which trigger IO release the mmap_sem"), hva_to_pfn_slow() wasn't actually a slow path in all cases, i.e. would do get_user_pages_fast() without ever doing slow GUP with FOLL_HWPOISON. Now that hva_to_pfn_slow() is a straight shot to get_user_pages_unlocked(), and unconditionally passes FOLL_HWPOISON, it is impossible for hva_to_pfn() to get an -errno that needs to be morphed to -EHWPOISON. There are essentially four cases in KVM: - npages == 0, then FOLL_NOWAIT, a.k.a. @async, must be true, and thus check_user_page_hwpoison() will not be called - npages == 1 \|\| npages == -EHWPOISON, all good - npages == -EINTR \|\| npages == -EAGAIN, bail early, all good - everything else, including -EFAULT, can go down the vma_lookup() path, as npages < 0 means KVM went through hva_to_pfn_slow() which passes FOLL_HWPOISON Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-16-seanjc@google.com>
2024-10-25	KVM: Return ERR_SIGPENDING from hva_to_pfn() if GUP returns -EGAIN	Sean Christopherson
	Treat an -EAGAIN return from GUP the same as -EINTR and immediately report to the caller that a signal is pending. GUP only returns -EAGAIN if the _initial_ mmap_read_lock_killable() fails, which in turn onnly fails if a signal is pending Note, rwsem_down_read_slowpath() actually returns -EINTR, so GUP is really just making life harder than it needs to be. And the call to mmap_read_lock_killable() in the retry path returns its -errno verbatim, i.e. GUP (and thus KVM) is already handling locking failure this way, but only some of the time. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-15-seanjc@google.com>
2024-10-25	KVM: Annotate that all paths in hva_to_pfn() might sleep	Sean Christopherson
	Now that hva_to_pfn() no longer supports being called in atomic context, move the might_sleep() annotation from hva_to_pfn_slow() to hva_to_pfn(). Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-14-seanjc@google.com>
2024-10-25	KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIs	Sean Christopherson
	Drop @atomic from the myriad "to_pfn" APIs now that all callers pass "false", and remove a comment blurb about KVM running only the "GUP fast" part in atomic context. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-25	KVM: Rename gfn_to_page_many_atomic() to kvm_prefetch_pages()	Sean Christopherson
	Rename gfn_to_page_many_atomic() to kvm_prefetch_pages() to try and communicate its true purpose, as the "atomic" aspect is essentially a side effect of the fact that x86 uses the API while holding mmu_lock. E.g. even if mmu_lock weren't held, KVM wouldn't want to fault-in pages, as the goal is to opportunistically grab surrounding pages that have already been accessed and/or dirtied by the host, and to do so quickly. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-12-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Use gfn_to_page_many_atomic() when prefetching indirect PTEs	Sean Christopherson
	Use gfn_to_page_many_atomic() instead of gfn_to_pfn_memslot_atomic() when prefetching indirect PTEs (direct_pte_prefetch_many() already uses the "to page" APIS). Functionally, the two are subtly equivalent, as the "to pfn" API short-circuits hva_to_pfn() if hva_to_pfn_fast() fails, i.e. is just a wrapper for get_user_page_fast_only()/get_user_pages_fast_only(). Switching to the "to page" API will allow dropping the @atomic parameter from the entire hva_to_pfn() callchain. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-11-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEs	Sean Christopherson
	Now that KVM doesn't clobber Accessed bits of shadow-present SPTEs, e.g. when prefetching, mark folios as accessed only when zapping leaf SPTEs, which is a rough heuristic for "only in response to an mmu_notifier invalidation". Page aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to be precise for correctness, and re-marking folios as accessed when zapping entire roots or when zapping collapsible SPTEs is expensive and adds very little value. E.g. when a VM is dying, all of its memory is being freed; marking folios accessed at that time provides no known value. Similarly, because KVM marks folios as accessed when creating SPTEs, marking all folios as accessed when userspace happens to delete a memslot doesn't add value. The folio was marked access when the old SPTE was created, and will be marked accessed yet again if a vCPU accesses the pfn again after reloading a new root. Zapping collapsible SPTEs is a similar story; marking folios accessed just because userspace disable dirty logging is a side effect of KVM behavior, not a deliberate goal. As an intermediate step, a.k.a. bisection point, towards never marking folios accessed when dropping SPTEs, mark folios accessed when the primary MMU might be invalidating mappings, as such zappings are not KVM initiated, i.e. might actually be related to page aging and LRU activity. Note, x86 is the only KVM architecture that "double dips"; every other arch marks pfns as accessed only when mapping into the guest, not when mapping into the guest _and_ when removing from the guest. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-10-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifying	Sean Christopherson
	Mark pages/folios dirty when creating SPTEs to map PFNs into the guest, not when zapping or modifying SPTEs, as marking folios dirty when zapping or modifying SPTEs can be extremely inefficient. E.g. when KVM is zapping collapsible SPTEs to reconstitute a hugepage after disbling dirty logging, KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are guaranteed to be in a single folio (the SPTE couldn't potentially be huge if that weren't the case). The problem only becomes worse for 1GiB HugeTLB pages, as KVM can mark a single folio dirty 512512 times. Marking a folio dirty when mapping is functionally safe as KVM drops all relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures that the guest can't dirty a folio after access has been removed. And because KVM already marks folios dirty when zapping/modifying SPTEs for KVM reasons, i.e. not in response to an mmu_notifier invalidation, there is no danger of "prematurely" marking a folio dirty. E.g. if a filesystems cleans a folio without first removing write access, then there already exists races where KVM could mark a folio dirty before remote TLBs are flushed, i.e. before guest writes are guaranteed to stop. Furthermore, x86 is literally the only architecture that marks folios dirty on the backend; every other KVM architecture marks folios dirty at map time. x86's unique behavior likely stems from the fact that x86's MMU predates mmu_notifiers. Long, long ago, before mmu_notifiers were added, marking pages dirty when zapping SPTEs was logical, and perhaps even necessary, as KVM held references to pages, i.e. kept a page's refcount elevated while the page was mapped into the guest. At the time, KVM's rmap_remove() simply did: if (is_writeble_pte(spte)) kvm_release_pfn_dirty(pfn); else kvm_release_pfn_clean(pfn); i.e. dropped the refcount and marked the page dirty at the same time. After mmu_notifiers were introduced, commit acb66dd051d0 ("KVM: MMU: don't hold pagecount reference for mapped sptes pages") removed the refcount logic, but kept the dirty logic, i.e. converted the above to: if (is_writeble_pte(spte)) kvm_release_pfn_dirty(pfn); And for KVM x86, that's essentially how things have stayed over the last ~15 years, without anyone revisiting why* KVM marks pages/folios dirty at zap/modification time, e.g. the behavior was blindly carried forward to the TDP MMU. Practically speaking, the only downside to marking a folio dirty during mapping is that KVM could trigger writeback of memory that was never actually written. Except that can't actually happen if KVM marks folios dirty if and only if a writable SPTE is created (as done here), because KVM always marks writable SPTEs as dirty during make_spte(). See commit 9b51a63024bd ("KVM: MMU: Explicitly set D-bit for writable spte."), circa 2015. Note, KVM's access tracking logic for prefetched SPTEs is a bit odd. If a guest PTE is dirty and writable, KVM will create a writable SPTE, but then mark the SPTE for access tracking. Which isn't wrong, just a bit odd, as it results in _more_ precise dirty tracking for MMUs _without_ A/D bits. To keep things simple, mark the folio dirty before access tracking comes into play, as an access-tracked SPTE can be restored in the fast page fault path, i.e. without holding mmu_lock. While writing SPTEs and accessing memslots outside of mmu_lock is safe, marking a folio dirty is not. E.g. if the fast path gets interrupted _just_ after setting a SPTE, the primary MMU could theoretically invalidate and free a folio before KVM marks it dirty. Unlike the shadow MMU, which waits for CPUs to respond to an IPI, the TDP MMU only guarantees the page tables themselves won't be freed (via RCU). Opportunistically update a few stale comments. Cc: David Matlack <dmatlack@google.com> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-9-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Mark new SPTE as Accessed when synchronizing existing SPTE	Sean Christopherson
	Set the Accessed bit when making a "new" SPTE during SPTE synchronization, as _clearing_ the Accessed bit is counter-productive, and even if the Accessed bit wasn't set in the old SPTE, odds are very good the guest will access the page in the near future, as the most common case where KVM synchronizes a shadow-present SPTE is when the guest is making the gPTE read-only for Copy-on-Write (CoW). Preserving the Accessed bit will allow dropping the logic that propagates the Accessed bit to the underlying struct page when overwriting an existing SPTE, without undue risk of regressing page aging. Note, KVM's current behavior is very deliberate, as SPTE synchronization was the only "speculative" access type as of commit 947da5383069 ("KVM: MMU: Set the accessed bit on non-speculative shadow ptes"). But, much has changed since 2008, and more changes are on the horizon. Spurious clearing of the Accessed (and Dirty) was mitigated by commit e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)"), which changed FNAME(sync_spte) to only overwrite SPTEs if the protections are actually changing. I.e. KVM is already preserving Accessed information for SPTEs that aren't dropping protections. And with the aforementioned future change to NOT mark the page/folio as accessed, KVM's SPTEs will become the "source of truth" so to speak, in which case clearing the Accessed bit outside of page aging becomes very undesirable. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-8-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Invert @can_unsync and renamed to @synchronizing	Sean Christopherson
	Invert the polarity of "can_unsync" and rename the parameter to "synchronizing" to allow a future change to set the Accessed bit if KVM is synchronizing an existing SPTE. Querying "can_unsync" in that case is nonsensical, as the fact that KVM can't unsync SPTEs doesn't provide any justification for setting the Accessed bit. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-7-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Don't overwrite shadow-present MMU SPTEs when prefaulting	Sean Christopherson
	Treat attempts to prefetch/prefault MMU SPTEs as spurious if there's an existing shadow-present SPTE, as overwriting a SPTE that may have been create by a "real" fault is at best confusing, and at worst potentially harmful. E.g. mmu_try_to_unsync_pages() doesn't unsync when prefetching, which creates a scenario where KVM could try to replace a Writable SPTE with a !Writable SPTE, as sp->unsync is checked prior to acquiring mmu_unsync_pages_lock. Note, this applies to three of the four flavors of "prefetch" in KVM: - KVM_PRE_FAULT_MEMORY - Async #PF (host or PV) - Prefetching The fourth flavor, SPTE synchronization, i.e. FNAME(sync_spte), _only_ overwrites shadow-present SPTEs when calling make_spte(). But SPTE synchronization specifically uses mmu_spte_update(), and so naturally avoids the @prefetch check in mmu_set_spte(). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-6-seanjc@google.com>
2024-10-25	KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf SPTE	Sean Christopherson
	Apply make_spte()'s optimization to skip trying to unsync shadow pages if and only if the old SPTE was a leaf SPTE, as non-leaf SPTEs in direct MMUs are always writable, i.e. could trigger a false positive and incorrectly lead to KVM creating a SPTE without write-protecting or marking shadow pages unsync. This bug only affects the TDP MMU, as the shadow MMU only overwrites a shadow-present SPTE when synchronizing SPTEs (and only 4KiB SPTEs can be unsync). Specifically, mmu_set_spte() drops any non-leaf SPTEs before calling make_spte(), whereas the TDP MMU can do a direct replacement of a page table with the leaf SPTE. Opportunistically update the comment to explain why skipping the unsync stuff is safe, as opposed to simply saying "it's someone else's problem". Cc: stable@vger.kernel.org Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-5-seanjc@google.com>
2024-10-25	KVM: Add kvm_release_page_unused() API to put pages that KVM never consumes	Sean Christopherson
	Add an API to release an unused page, i.e. to put a page without marking it accessed or dirty. The API will be used when KVM faults-in a page but bails before installing the guest mapping (and other similar flows). Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-4-seanjc@google.com>
2024-10-25	KVM: Allow calling kvm_release_page_{clean,dirty}() on a NULL page pointer	Sean Christopherson
	Allow passing a NULL @page to kvm_release_page_{clean,dirty}(), there's no tangible benefit to forcing the callers to pre-check @page, and it ends up generating a lot of duplicate boilerplate code. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-3-seanjc@google.com>
2024-10-25	KVM: Drop KVM_ERR_PTR_BAD_PAGE and instead return NULL to indicate an error	Sean Christopherson
	Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just a leftover bit of weirdness from days of old when KVM stuffed a "bad" page into the guest instead of actually handling missing pages. See commit cea7bb21280e ("KVM: MMU: Make gfn_to_page() always safe"). Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-2-seanjc@google.com>
2024-10-20	Merge tag 'kvmarm-fixes-6.12-3' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #3 - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled
2024-10-20	Merge tag 'kvmarm-fixes-6.12-2' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #2 - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends
2024-10-20	RISCV: KVM: use raw_spinlock for critical section in imsic	Cyan Yang
	For the external interrupt updating procedure in imsic, there was a spinlock to protect it already. But since it should not be preempted in any cases, we should turn to use raw_spinlock to prevent any preemption in case PREEMPT_RT was enabled. Signed-off-by: Cyan Yang <cyan.yang@sifive.com> Reviewed-by: Yong-Xuan Wang <yongxuan.wang@sifive.com> Reviewed-by: Anup Patel <anup@brainfault.org> Message-ID: <20240919160126.44487-1-cyan.yang@sifive.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups	Sean Christopherson
	When looking for a "mangled", i.e. dynamic, CPUID entry, terminate the walk based on the number of array _entries_, not the size in bytes of the array. Iterating based on the total size of the array can result in false passes, e.g. if the random data beyond the array happens to match a CPUID entry's function and index. Fixes: fb18d053b7f8 ("selftest: kvm: x86: test KVM_GET_CPUID2 and guest visible CPUIDs against KVM_GET_SUPPORTED_CPUID") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-ID: <20241003234337.273364-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: selftests: x86: Avoid using SSE/AVX instructions	Vitaly Kuznetsov
	Some distros switched gcc to '-march=x86-64-v3' by default and while it's hard to find a CPU which doesn't support it today, many KVM selftests fail with ==== Test Assertion Failure ==== lib/x86_64/processor.c:570: Unhandled exception in guest pid=72747 tid=72747 errno=4 - Interrupted system call Unhandled exception '0x6' at guest RIP '0x4104f7' The failure is easy to reproduce elsewhere with $ make clean && CFLAGS='-march=x86-64-v3' make -j && ./x86_64/kvm_pv_test The root cause of the problem seems to be that with '-march=x86-64-v3' GCC uses AVX* instructions (VMOVQ in the example above) and without prior XSETBV() in the guest this results in #UD. It is certainly possible to add it there, e.g. the following saves the day as well: Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-ID: <20240920154422.2890096-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory	Sean Christopherson
	Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits 4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't enforce 32-byte alignment of nCR3. In the absolute worst case scenario, failure to ignore bits 4:0 can result in an out-of-bounds read, e.g. if the target page is at the end of a memslot, and the VMM isn't using guard pages. Per the APM: The CR3 register points to the base address of the page-directory-pointer table. The page-directory-pointer table is aligned on a 32-byte boundary, with the low 5 address bits 4:0 assumed to be 0. And the SDM's much more explicit: 4:0 Ignored Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow that is broken. Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") Reported-by: Kirk Swidowski <swidowski@google.com> Cc: Andy Nguyen <theflow@google.com> Cc: 3pvd <3pvd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009140838.1036226-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()	Maxim Levitsky
	Reset the segment cache after segment initialization in vmx_vcpu_reset() to harden KVM against caching stale/uninitialized data. Without the recent fix to bypass the cache in kvm_arch_vcpu_put(), the following scenario is possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Invalidating the cache after writing the VMCS doesn't address the general issue of cache accesses from IRQ context being unsafe, but it does prevent KVM from clobbering the VMCS, i.e. mitigates the harm done _if_ KVM has a bug that results in an unsafe cache access. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields") [sean: rework changelog to account for previous patch] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009175002.1118178-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL	Sean Christopherson
	Massage the documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL to call out that it applies to moved memslots as well as deleted memslots, to avoid KVM's "fast zap" terminology (which has no meaning for userspace), and to reword the documented targeted zap behavior to specifically say that KVM _may_ zap a subset of all SPTEs. As evidenced by the fix to zap non-leafs SPTEs with gPTEs, formally documenting KVM's exact internal behavior is risky and unnecessary. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()	Sean Christopherson
	Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either mmu_invalidate_in_progress is elevated, or that the range is being zapped due to memslot removal (loosely detected by slots_lock being held). Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe as KVM's page fault path snapshots state before acquiring mmu_lock, and thus can create SPTEs with stale information if vCPUs aren't forced to retry faults (due to seeing an in-progress or past MMU invalidation). Memslot removal is a special case, as the memslot is retrieved outside of mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and instead relies on SRCU synchronization to ensure any in-flight page faults are fully resolved before zapping SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot	Sean Christopherson
	When performing a targeted zap on memslot removal, zap only MMU pages that shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't exist, because it doesn't do what most people would it expect it to do. The "round gfn for level" adjustment that is done for direct SPs (no gPTE) means that the exact gfn comparison will not get a match, even when a SP does "cover" a gfn, or was even created specifically for a gfn. For memslot deletion specifically, KVM's behavior will vary significantly based on the size and alignment of a memslot, and in weird ways. E.g. for a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned. And as described below, zapping SPs in the aligned case overzaps for direct MMUs, as odds are good the upper-level SPs are serving other memslots. To iterate over all potentially-relevant gfns, KVM would need to make a pass over the hash table for each level, with the gfn used for lookup rounded for said level. And then check that the SP is of the correct level, too, e.g. to avoid over-zapping. But even then, KVM would massively overzap, as processing every level is all but guaranteed to zap SPs that serve other memslots, especially if the memslot being removed is relatively small. KVM could mitigate that issue by processing only levels that can be possible guest huge pages, i.e. are less likely to be re-used for other memslot, but while somewhat logical, that's quite arbitrary and would be a bit of a mess to implement. So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped. Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241009192345.1148353-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	x86/kvm: Override default caching mode for SEV-SNP and TDX	Kirill A. Shutemov
	AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not advertised in CPUID or it cannot be programmed (on TDX, due to #VE on CR0.CD clear). This results in guests using uncached mappings where it shouldn't and pmd/pud_set_huge() failures due to non-uniform memory type reported by mtrr_type_lookup(). Override MTRR state, making it WB by default as the kernel does for Hyper-V guests. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Binbin Wu <binbin.wu@intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241015095818.357915-1-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic	Dr. David Alan Gilbert
	The last use of kvm_vcpu_gfn_to_pfn_atomic was removed by commit 1bbc60d0c7e5 ("KVM: x86/mmu: Remove MMU auditing") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-ID: <20241001141354.18009-3-linux@treblig.org> [Adjust Documentation/virt/kvm/locking.rst. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20	KVM: Remove unused kvm_vcpu_gfn_to_pfn	Dr. David Alan Gilbert
	The last use of kvm_vcpu_gfn_to_pfn was removed by commit b1624f99aa8f ("KVM: Remove kvm_vcpu_gfn_to_page() and kvm_vcpu_gpa_to_page()") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Message-ID: <20241001141354.18009-2-linux@treblig.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-17	KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration	Oliver Upton
	kvm_vgic_map_resources() prematurely marks the distributor as 'ready', potentially allowing vCPUs to enter the guest before the distributor's MMIO registration has been made visible. Plug the race by marking the distributor as ready only after MMIO registration is completed. Rely on the implied ordering of synchronize_srcu() to ensure the MMIO registration is visible before vgic_dist::ready. This also means that writers to vgic_dist::ready are now serialized by the slots_lock, which was effectively the case already as all writers held the slots_lock in addition to the config_lock. Fixes: 59112e9c390b ("KVM: arm64: vgic: Fix a circular locking issue") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241017001947.2707312-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-17	KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS	Oliver Upton
	KVM commits to a particular sizing of SPIs when the vgic is initialized, which is before the point a vgic becomes ready. On top of that, KVM supplies a default amount of SPIs should userspace not explicitly configure this. As such, the check for vgic_ready() in the handling of KVM_DEV_ARM_VGIC_GRP_NR_IRQS is completely wrong, and testing if nr_spis is nonzero is sufficient for preventing userspace from playing games with us. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241017001947.2707312-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-17	KVM: arm64: Fix shift-out-of-bounds bug	Ilkka Koskinen
	Fix a shift-out-of-bounds bug reported by UBSAN when running VM with MTE enabled host kernel. UBSAN: shift-out-of-bounds in arch/arm64/kvm/sys_regs.c:1988:14 shift exponent 33 is too large for 32-bit type 'int' CPU: 26 UID: 0 PID: 7629 Comm: qemu-kvm Not tainted 6.12.0-rc2 #34 Hardware name: IEI NF5280R7/Mitchell MB, BIOS 00.00. 2024-10-12 09:28:54 10/14/2024 Call trace: dump_backtrace+0xa0/0x128 show_stack+0x20/0x38 dump_stack_lvl+0x74/0x90 dump_stack+0x18/0x28 __ubsan_handle_shift_out_of_bounds+0xf8/0x1e0 reset_clidr+0x10c/0x1c8 kvm_reset_sys_regs+0x50/0x1c8 kvm_reset_vcpu+0xec/0x2b0 __kvm_vcpu_set_target+0x84/0x158 kvm_vcpu_set_target+0x138/0x168 kvm_arch_vcpu_ioctl_vcpu_init+0x40/0x2b0 kvm_arch_vcpu_ioctl+0x28c/0x4b8 kvm_vcpu_ioctl+0x4bc/0x7a8 __arm64_sys_ioctl+0xb4/0x100 invoke_syscall+0x70/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x158 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x194/0x198 Fixes: 7af0c2534f4c ("KVM: arm64: Normalize cache configuration") Cc: stable@vger.kernel.org Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20241017025701.67936-1-ilkka@os.amperecomputing.com Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-17	KVM: arm64: Shave a few bytes from the EL2 idmap code	Marc Zyngier
	Our idmap is becoming too big, to the point where it doesn't fit in a 4kB page anymore. There are some low-hanging fruits though, such as the el2_init_state horror that is expanded 3 times in the kernel. Let's at least limit ourselves to two copies, which makes the kernel link again. At some point, we'll have to have a better way of doing this. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
2024-10-11	KVM: arm64: Don't eagerly teardown the vgic on init error	Marc Zyngier
	As there is very little ordering in the KVM API, userspace can instanciate a half-baked GIC (missing its memory map, for example) at almost any time. This means that, with the right timing, a thread running vcpu-0 can enter the kernel without a GIC configured and get a GIC created behind its back by another thread. Amusingly, it will pick up that GIC and start messing with the data structures without the GIC having been fully initialised. Similarly, a thread running vcpu-1 can enter the kernel, and try to init the GIC that was previously created. Since this GIC isn't properly configured (no memory map), it fails to correctly initialise. And that's the point where we decide to teardown the GIC, freeing all its resources. Behind vcpu-0's back. Things stop pretty abruptly, with a variety of symptoms. Clearly, this isn't good, we should be a bit more careful about this. It is obvious that this guest is not viable, as it is missing some important part of its configuration. So instead of trying to tear bits of it down, let's just mark it as dead. It means that any further interaction from userspace will result in -EIO. The memory will be released on the "normal" path, when userspace gives up. Cc: stable@vger.kernel.org Reported-by: Alexander Potapenko <glider@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241009183603.3221824-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	KVM: arm64: Expose S1PIE to guests	Mark Brown
	Prior to commit 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") we just exposed the santised view of ID_AA64MMFR3_EL1 to guests, meaning that they saw both TCRX and S1PIE if present on the host machine. That commit added VMM control over the contents of the register and exposed S1POE but removed S1PIE, meaning that the extension is no longer visible to guests. Reenable support for S1PIE with VMM control. Fixes: 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241005-kvm-arm64-fix-s1pie-v1-1-5901f02de749@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule	Oliver Upton
	There's been a decent amount of attention around unmaps of nested MMUs, and TLBI handling is no exception to this. Add a comment clarifying why it is safe to reschedule during a TLBI unmap, even without a reference on the MMU in progress. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-5-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	KVM: arm64: nv: Punt stage-2 recycling to a vCPU request	Oliver Upton
	Currently, when a nested MMU is repurposed for some other MMU context, KVM unmaps everything during vcpu_load() while holding the MMU lock for write. This is quite a performance bottleneck for large nested VMs, as all vCPU scheduling will spin until the unmap completes. Start punting the MMU cleanup to a vCPU request, where it is then possible to periodically release the MMU lock and CPU in the presence of contention. Ensure that no vCPU winds up using a stale MMU by tracking the pending unmap on the S2 MMU itself and requesting an unmap on every vCPU that finds it. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-4-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed	Oliver Upton
	Right now the nested code allows unmap operations on a shadow stage-2 to block unconditionally. This is wrong in a couple places, such as a non-blocking MMU notifier or on the back of a sched_in() notifier as part of shadow MMU recycling. Carry through whether or not blocking is allowed to kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU reclaim would precipitate a stack overflow from a pile of kvm_sched_in() callbacks, all trying to recycle a stage-2 MMU. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled out	Oliver Upton
	If a vCPU is scheduling out and not in WFI emulation, it is highly likely it will get scheduled again soon and reuse the MMU it had before. Dropping the MMU at vcpu_put() can have some unfortunate consequences, as the MMU could get reclaimed and used in a different context, forcing another 'cold start' on an otherwise active MMU. Avoid that altogether by keeping a reference on the MMU if the vCPU is scheduling out, ensuring that another vCPU cannot reclaim it while the current vCPU is away. Since there are more MMUs than vCPUs, this does not affect the guarantee that an unused MMU is available at any time. Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible code, at least for where it matters in the stage-2 abort path. Yes, the MMU can change across WFI emulation, but there isn't even a use case where this would matter. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	KVM: arm64: Unregister redistributor for failed vCPU creation	Oliver Upton
	Alex reports that syzkaller has managed to trigger a use-after-free when tearing down a VM: BUG: KASAN: slab-use-after-free in kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769 Read of size 8 at addr ffffff801c6890d0 by task syz.3.2219/10758 CPU: 3 UID: 0 PID: 10758 Comm: syz.3.2219 Not tainted 6.11.0-rc6-dirty #64 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x17c/0x1a8 arch/arm64/kernel/stacktrace.c:317 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:324 __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x94/0xc0 lib/dump_stack.c:119 print_report+0x144/0x7a4 mm/kasan/report.c:377 kasan_report+0xcc/0x128 mm/kasan/report.c:601 __asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381 kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769 kvm_vm_release+0x4c/0x60 virt/kvm/kvm_main.c:1409 __fput+0x198/0x71c fs/file_table.c:422 ____fput+0x20/0x30 fs/file_table.c:450 task_work_run+0x1cc/0x23c kernel/task_work.c:228 do_notify_resume+0x144/0x1a0 include/linux/resume_user_mode.h:50 el0_svc+0x64/0x68 arch/arm64/kernel/entry-common.c:169 el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Upon closer inspection, it appears that we do not properly tear down the MMIO registration for a vCPU that fails creation late in the game, e.g. a vCPU w/ the same ID already exists in the VM. It is important to consider the context of commit that introduced this bug by moving the unregistration out of __kvm_vgic_vcpu_destroy(). That change correctly sought to avoid an srcu v. config_lock inversion by breaking up the vCPU teardown into two parts, one guarded by the config_lock. Fix the use-after-free while avoiding lock inversion by adding a special-cased unregistration to __kvm_vgic_vcpu_destroy(). This is safe because failed vCPUs are torn down outside of the config_lock. Cc: stable@vger.kernel.org Fixes: f616506754d3 ("KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors") Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007223909.2157336-1-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-08	Merge branch kvm-arm64/idregs-6.12 into kvmarm/fixes	Marc Zyngier
	* kvm-arm64/idregs-6.12: : . : Make some fields of ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1 : writable from userspace, so that a VMM can influence the : set of guest-visible features. : : - for ID_AA64DFR0_EL1: DoubleLock, WRPs, PMUVer and DebugVer : are writable (courtesy of Shameer Kolothum) : : - for ID_AA64PFR1_EL1: BT, SSBS, CVS2_frac are writable : (courtesy of Shaoqin Huang) : . KVM: selftests: aarch64: Add writable test for ID_AA64PFR1_EL1 KVM: arm64: Allow userspace to change ID_AA64PFR1_EL1 KVM: arm64: Use kvm_has_feat() to check if FEAT_SSBS is advertised to the guest KVM: arm64: Disable fields that KVM doesn't know how to handle in ID_AA64PFR1_EL1 KVM: arm64: Make the exposed feature bits in AA64DFR0_EL1 writable from userspace Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-06	Merge tag 'kvmarm-fixes-6.12-1' of ↵	Paolo Bonzini
	git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #1 - Fix pKVM error path on init, making sure we do not change critical system registers as we're about to fail - Make sure that the host's vector length is at capped by a value common to all CPUs - Fix kvm_has_feat*() handling of "negative" features, as the current code is pretty broken - Promote Joey to the status of official reviewer, while James steps down -- hopefully only temporarly
2024-10-06	x86/reboot: emergency callbacks are now registered by common KVM code	Paolo Bonzini
	Guard them with CONFIG_KVM_X86_COMMON rather than the two vendor modules. In practice this has no functional change, because CONFIG_KVM_X86_COMMON is set if and only if at least one vendor-specific module is being built. However, it is cleaner to specify CONFIG_KVM_X86_COMMON for functions that are used in kvm.ko. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Fixes: 6d55a94222db ("x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-06	KVM: x86: leave kvm.ko out of the build if no vendor module is requested	Paolo Bonzini
	kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko. It provides no functionality on its own and it is unnecessary unless one of the vendor-specific module is compiled. In particular, /dev/kvm is not created until one of kvm-intel.ko or kvm-amd.ko is loaded. Use CONFIG_KVM to decide if it is built-in or a module, but use the vendor-specific modules for the actual decision on whether to build it. This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD are both disabled. The cpu_emergency_register_virt_callback() function is called from kvm.ko, but it is only defined if at least one of CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided. Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03	KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU	Paolo Bonzini
	As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting for a shadow page is tied to the memslot, i.e. the shadow page holds a reference to the memslot, for all intents and purposes. Deleting the memslot without removing all relevant shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled, results in NULL pointer derefs when tearing down the VM. Reintroduce from that commit the code that walks the whole memslot when there are active shadow MMU pages. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-03	KVM: arm64: Fix kvm_has_feat*() handling of negative features	Marc Zyngier
	Oliver reports that the kvm_has_feat() helper is not behaviing as expected for negative feature. On investigation, the main issue seems to be caused by the following construct: #define get_idreg_field(kvm, id, fld) \ (id##_##fld##_SIGNED ? \ get_idreg_field_signed(kvm, id, fld) : \ get_idreg_field_unsigned(kvm, id, fld)) where one side of the expression evaluates as something signed, and the other as something unsigned. In retrospect, this is totally braindead, as the compiler converts this into an unsigned expression. When compared to something that is 0, the test is simply elided. Epic fail. Similar issue exists in the expand_field_sign() macro. The correct way to handle this is to chose between signed and unsigned comparisons, so that both sides of the ternary expression are of the same type (bool). In order to keep the code readable (sort of), we introduce new comparison primitives taking an operator as a parameter, and rewrite the kvm_has_feat*() helpers in terms of these primitives. Fixes: c62d7a23b947 ("KVM: arm64: Add feature checking helpers") Reported-by: Oliver Upton <oliver.upton@linux.dev> Tested-by: Oliver Upton <oliver.upton@linux.dev> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241002204239.2051637-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-01	KVM: selftests: Fix build on architectures other than x86_64	Mark Brown
	The recent addition of support for testing with the x86 specific quirk KVM_X86_QUIRK_SLOT_ZAP_ALL disabled in the generic memslot tests broke the build of the KVM selftests for all other architectures: In file included from include/kvm_util.h:8, from include/memstress.h:13, from memslot_modification_stress_test.c:21: memslot_modification_stress_test.c: In function ‘main’: memslot_modification_stress_test.c:176:38: error: ‘KVM_X86_QUIRK_SLOT_ZAP_ALL’ undeclared (first use in this function) 176 \| KVM_X86_QUIRK_SLOT_ZAP_ALL); \| ^~~~~~~~~~~~~~~~~~~~~~~~~~ Add __x86_64__ guard defines to avoid building the relevant code on other architectures. Fixes: 61de4c34b51c ("KVM: selftests: Test memslot move in memslot_perf_test with quirk disabled") Fixes: 218f6415004a ("KVM: selftests: Allow slot modification stress test with quirk disabled") Reported-by: Aishwarya TCV <aishwarya.tcv@arm.com> Signed-off-by: Mark Brown <broonie@kernel.org> Message-ID: <20240930-kvm-build-breakage-v1-1-866fad3cc164@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-01	KVM: arm64: Another reviewer reshuffle	Marc Zyngier
	It has been a while since James had any significant bandwidth to review KVM/arm64 patches. But in the meantime, Joey has stepped up and did a really good job reviewing some terrifying patch series. Having talked with the interested parties, it appears that James is unlikely to have time for KVM in the near future, and that Joey is willing to take more responsibilities. So let's appoint Joey as an official reviewer, and give James some breathing space, as well as my personal thanks. I'm sure he will be back one way or another! Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: Will Deacon <will@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Joey Gouly <joey.gouly@arm.com> Acked-by: Zenghui Yu <yuzenghui@huawei.com> Link: https://lore.kernel.org/r/20240927104956.1223658-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
2024-10-01	KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM	Mark Brown
	When pKVM saves and restores the host floating point state on a SVE system, it programs the vector length in ZCR_EL2.LEN to be whatever the maximum VL for the PE is. But it uses a buffer allocated with kvm_host_sve_max_vl, the maximum VL shared by all PEs in the system. This means that if we run on a system where the maximum VLs are not consistent, we will overflow the buffer on PEs which support larger VLs. Since the host will not currently attempt to make use of non-shared VLs, fix this by explicitly setting the EL2 VL to be the maximum shared VL when we save and restore. This will enforce the limit on host VL usage. Should we wish to support asymmetric VLs, this code will need to be updated along with the required changes for the host: https://lore.kernel.org/r/20240730-kvm-arm64-fix-pkvm-sve-vl-v6-0-cae8a2e0bd66@kernel.org Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Signed-off-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240912-kvm-arm64-limit-guest-vl-v2-1-dd2c29cb2ac9@kernel.org [maz: added punctuation to the commit message] Signed-off-by: Marc Zyngier <maz@kernel.org>