summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2024-10-25KVM: x86/mmu: Mark pages/folios dirty at the origin of make_spte()Sean Christopherson
Move the marking of folios dirty from make_spte() out to its callers, which have access to the _struct page_, not just the underlying pfn. Once all architectures follow suit, this will allow removing KVM's ugly hack where KVM elevates the refcount of VM_MIXEDMAP pfns that happen to be struct page memory. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-42-seanjc@google.com>
2024-10-25KVM: x86/mmu: Add helper to "finish" handling a guest page faultSean Christopherson
Add a helper to finish/complete the handling of a guest page, e.g. to mark the pages accessed and put any held references. In the near future, this will allow improving the logic without having to copy+paste changes into all page fault paths. And in the less near future, will allow sharing the "finish" API across all architectures. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-41-seanjc@google.com>
2024-10-25KVM: x86/mmu: Add common helper to handle prefetching SPTEsSean Christopherson
Deduplicate the prefetching code for indirect and direct MMUs. The core logic is the same, the only difference is that indirect MMUs need to prefetch SPTEs one-at-a-time, as contiguous guest virtual addresses aren't guaranteed to yield contiguous guest physical addresses. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-40-seanjc@google.com>
2024-10-25KVM: x86/mmu: Put direct prefetched pages via kvm_release_page_clean()Sean Christopherson
Use kvm_release_page_clean() to put prefeteched pages instead of calling put_page() directly. This will allow de-duplicating the prefetch code between indirect and direct MMUs. Note, there's a small functional change as kvm_release_page_clean() marks the page/folio as accessed. While it's not strictly guaranteed that the guest will access the page, KVM won't intercept guest accesses, i.e. won't mark the page accessed if it _is_ accessed by the guest (unless A/D bits are disabled, but running without A/D bits is effectively limited to pre-HSW Intel CPUs). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-39-seanjc@google.com>
2024-10-25KVM: x86/mmu: Add "mmu" prefix fault-in helpers to free up generic namesSean Christopherson
Prefix x86's faultin_pfn helpers with "mmu" so that the mmu-less names can be used by common KVM for similar APIs. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-38-seanjc@google.com>
2024-10-25KVM: x86: Don't fault-in APIC access page during initial allocationSean Christopherson
Drop the gfn_to_page() lookup when installing KVM's internal memslot for the APIC access page, as KVM doesn't need to immediately fault-in the page now that the page isn't pinned. In the extremely unlikely event the kernel can't allocate a 4KiB page, KVM can just as easily return -EFAULT on the future page fault. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-37-seanjc@google.com>
2024-10-25KVM: Pass in write/dirty to kvm_vcpu_map(), not kvm_vcpu_unmap()Sean Christopherson
Now that all kvm_vcpu_{,un}map() users pass "true" for @dirty, have them pass "true" as a @writable param to kvm_vcpu_map(), and thus create a read-only mapping when possible. Note, creating read-only mappings can be theoretically slower, as they don't play nice with fast GUP due to the need to break CoW before mapping the underlying PFN. But practically speaking, creating a mapping isn't a super hot path, and getting a writable mapping for reading is weird and confusing. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-34-seanjc@google.com>
2024-10-25KVM: nVMX: Mark vmcs12's APIC access page dirty when unmappingSean Christopherson
Mark the APIC access page as dirty when unmapping it from KVM. The fact that the page _shouldn't_ be written doesn't guarantee the page _won't_ be written. And while the contents are likely irrelevant, the values _are_ visible to the guest, i.e. dropping writes would be visible to the guest (though obviously highly unlikely to be problematic in practice). Marking the map dirty will allow specifying the write vs. read-only when *mapping* the memory, which in turn will allow creating read-only maps. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-33-seanjc@google.com>
2024-10-25KVM: nVMX: Add helper to put (unmap) vmcs12 pagesSean Christopherson
Add a helper to dedup unmapping the vmcs12 pages. This will reduce the amount of churn when a future patch refactors the kvm_vcpu_unmap() API. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-26-seanjc@google.com>
2024-10-25KVM: nVMX: Drop pointless msr_bitmap_map field from struct nested_vmxSean Christopherson
Remove vcpu_vmx.msr_bitmap_map and instead use an on-stack structure in the one function that uses the map, nested_vmx_prepare_msr_bitmap(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-25-seanjc@google.com>
2024-10-25KVM: nVMX: Rely on kvm_vcpu_unmap() to track validity of eVMCS mappingSean Christopherson
Remove the explicit evmptr12 validity check when deciding whether or not to unmap the eVMCS pointer, and instead rely on kvm_vcpu_unmap() to play nice with a NULL map->hva, i.e. to do nothing if the map is invalid. Note, vmx->nested.hv_evmcs_map is zero-allocated along with the rest of vcpu_vmx, i.e. the map starts out invalid/NULL. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-24-seanjc@google.com>
2024-10-25KVM: Drop unused "hva" pointer from __gfn_to_pfn_memslot()Sean Christopherson
Drop @hva from __gfn_to_pfn_memslot() now that all callers pass NULL. No functional change intended. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-19-seanjc@google.com>
2024-10-25KVM: x86/mmu: Drop kvm_page_fault.hva, i.e. don't track intermediate hvaSean Christopherson
Remove kvm_page_fault.hva as it is never read, only written. This will allow removing the @hva param from __gfn_to_pfn_memslot(). Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-18-seanjc@google.com>
2024-10-25KVM: Replace "async" pointer in gfn=>pfn with "no_wait" and error codeDavid Stevens
Add a pfn error code to communicate that hva_to_pfn() failed because I/O was needed and disallowed, and convert @async to a constant @no_wait boolean. This will allow eliminating the @no_wait param by having callers pass in FOLL_NOWAIT along with other FOLL_* flags. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: David Stevens <stevensd@chromium.org> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-17-seanjc@google.com>
2024-10-25KVM: Drop @atomic param from gfn=>pfn and hva=>pfn APIsSean Christopherson
Drop @atomic from the myriad "to_pfn" APIs now that all callers pass "false", and remove a comment blurb about KVM running only the "GUP fast" part in atomic context. No functional change intended. Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-13-seanjc@google.com>
2024-10-25KVM: Rename gfn_to_page_many_atomic() to kvm_prefetch_pages()Sean Christopherson
Rename gfn_to_page_many_atomic() to kvm_prefetch_pages() to try and communicate its true purpose, as the "atomic" aspect is essentially a side effect of the fact that x86 uses the API while holding mmu_lock. E.g. even if mmu_lock weren't held, KVM wouldn't want to fault-in pages, as the goal is to opportunistically grab surrounding pages that have already been accessed and/or dirtied by the host, and to do so quickly. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-12-seanjc@google.com>
2024-10-25KVM: x86/mmu: Use gfn_to_page_many_atomic() when prefetching indirect PTEsSean Christopherson
Use gfn_to_page_many_atomic() instead of gfn_to_pfn_memslot_atomic() when prefetching indirect PTEs (direct_pte_prefetch_many() already uses the "to page" APIS). Functionally, the two are subtly equivalent, as the "to pfn" API short-circuits hva_to_pfn() if hva_to_pfn_fast() fails, i.e. is just a wrapper for get_user_page_fast_only()/get_user_pages_fast_only(). Switching to the "to page" API will allow dropping the @atomic parameter from the entire hva_to_pfn() callchain. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-11-seanjc@google.com>
2024-10-25KVM: x86/mmu: Mark page/folio accessed only when zapping leaf SPTEsSean Christopherson
Now that KVM doesn't clobber Accessed bits of shadow-present SPTEs, e.g. when prefetching, mark folios as accessed only when zapping leaf SPTEs, which is a rough heuristic for "only in response to an mmu_notifier invalidation". Page aging and LRUs are tolerant of false negatives, i.e. KVM doesn't need to be precise for correctness, and re-marking folios as accessed when zapping entire roots or when zapping collapsible SPTEs is expensive and adds very little value. E.g. when a VM is dying, all of its memory is being freed; marking folios accessed at that time provides no known value. Similarly, because KVM marks folios as accessed when creating SPTEs, marking all folios as accessed when userspace happens to delete a memslot doesn't add value. The folio was marked access when the old SPTE was created, and will be marked accessed yet again if a vCPU accesses the pfn again after reloading a new root. Zapping collapsible SPTEs is a similar story; marking folios accessed just because userspace disable dirty logging is a side effect of KVM behavior, not a deliberate goal. As an intermediate step, a.k.a. bisection point, towards *never* marking folios accessed when dropping SPTEs, mark folios accessed when the primary MMU might be invalidating mappings, as such zappings are not KVM initiated, i.e. might actually be related to page aging and LRU activity. Note, x86 is the only KVM architecture that "double dips"; every other arch marks pfns as accessed only when mapping into the guest, not when mapping into the guest _and_ when removing from the guest. Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-10-seanjc@google.com>
2024-10-25KVM: x86/mmu: Mark folio dirty when creating SPTE, not when zapping/modifyingSean Christopherson
Mark pages/folios dirty when creating SPTEs to map PFNs into the guest, not when zapping or modifying SPTEs, as marking folios dirty when zapping or modifying SPTEs can be extremely inefficient. E.g. when KVM is zapping collapsible SPTEs to reconstitute a hugepage after disbling dirty logging, KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are guaranteed to be in a single folio (the SPTE couldn't potentially be huge if that weren't the case). The problem only becomes worse for 1GiB HugeTLB pages, as KVM can mark a single folio dirty 512*512 times. Marking a folio dirty when mapping is functionally safe as KVM drops all relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures that the guest can't dirty a folio after access has been removed. And because KVM already marks folios dirty when zapping/modifying SPTEs for KVM reasons, i.e. not in response to an mmu_notifier invalidation, there is no danger of "prematurely" marking a folio dirty. E.g. if a filesystems cleans a folio without first removing write access, then there already exists races where KVM could mark a folio dirty before remote TLBs are flushed, i.e. before guest writes are guaranteed to stop. Furthermore, x86 is literally the only architecture that marks folios dirty on the backend; every other KVM architecture marks folios dirty at map time. x86's unique behavior likely stems from the fact that x86's MMU predates mmu_notifiers. Long, long ago, before mmu_notifiers were added, marking pages dirty when zapping SPTEs was logical, and perhaps even necessary, as KVM held references to pages, i.e. kept a page's refcount elevated while the page was mapped into the guest. At the time, KVM's rmap_remove() simply did: if (is_writeble_pte(*spte)) kvm_release_pfn_dirty(pfn); else kvm_release_pfn_clean(pfn); i.e. dropped the refcount and marked the page dirty at the same time. After mmu_notifiers were introduced, commit acb66dd051d0 ("KVM: MMU: don't hold pagecount reference for mapped sptes pages") removed the refcount logic, but kept the dirty logic, i.e. converted the above to: if (is_writeble_pte(*spte)) kvm_release_pfn_dirty(pfn); And for KVM x86, that's essentially how things have stayed over the last ~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at zap/modification time, e.g. the behavior was blindly carried forward to the TDP MMU. Practically speaking, the only downside to marking a folio dirty during mapping is that KVM could trigger writeback of memory that was never actually written. Except that can't actually happen if KVM marks folios dirty if and only if a writable SPTE is created (as done here), because KVM always marks writable SPTEs as dirty during make_spte(). See commit 9b51a63024bd ("KVM: MMU: Explicitly set D-bit for writable spte."), circa 2015. Note, KVM's access tracking logic for prefetched SPTEs is a bit odd. If a guest PTE is dirty and writable, KVM will create a writable SPTE, but then mark the SPTE for access tracking. Which isn't wrong, just a bit odd, as it results in _more_ precise dirty tracking for MMUs _without_ A/D bits. To keep things simple, mark the folio dirty before access tracking comes into play, as an access-tracked SPTE can be restored in the fast page fault path, i.e. without holding mmu_lock. While writing SPTEs and accessing memslots outside of mmu_lock is safe, marking a folio dirty is not. E.g. if the fast path gets interrupted _just_ after setting a SPTE, the primary MMU could theoretically invalidate and free a folio before KVM marks it dirty. Unlike the shadow MMU, which waits for CPUs to respond to an IPI, the TDP MMU only guarantees the page tables themselves won't be freed (via RCU). Opportunistically update a few stale comments. Cc: David Matlack <dmatlack@google.com> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-9-seanjc@google.com>
2024-10-25KVM: x86/mmu: Mark new SPTE as Accessed when synchronizing existing SPTESean Christopherson
Set the Accessed bit when making a "new" SPTE during SPTE synchronization, as _clearing_ the Accessed bit is counter-productive, and even if the Accessed bit wasn't set in the old SPTE, odds are very good the guest will access the page in the near future, as the most common case where KVM synchronizes a shadow-present SPTE is when the guest is making the gPTE read-only for Copy-on-Write (CoW). Preserving the Accessed bit will allow dropping the logic that propagates the Accessed bit to the underlying struct page when overwriting an existing SPTE, without undue risk of regressing page aging. Note, KVM's current behavior is very deliberate, as SPTE synchronization was the only "speculative" access type as of commit 947da5383069 ("KVM: MMU: Set the accessed bit on non-speculative shadow ptes"). But, much has changed since 2008, and more changes are on the horizon. Spurious clearing of the Accessed (and Dirty) was mitigated by commit e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte in FNAME(sync_spte)"), which changed FNAME(sync_spte) to only overwrite SPTEs if the protections are actually changing. I.e. KVM is already preserving Accessed information for SPTEs that aren't dropping protections. And with the aforementioned future change to NOT mark the page/folio as accessed, KVM's SPTEs will become the "source of truth" so to speak, in which case clearing the Accessed bit outside of page aging becomes very undesirable. Suggested-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-8-seanjc@google.com>
2024-10-25KVM: x86/mmu: Invert @can_unsync and renamed to @synchronizingSean Christopherson
Invert the polarity of "can_unsync" and rename the parameter to "synchronizing" to allow a future change to set the Accessed bit if KVM is synchronizing an existing SPTE. Querying "can_unsync" in that case is nonsensical, as the fact that KVM can't unsync SPTEs doesn't provide any justification for setting the Accessed bit. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-7-seanjc@google.com>
2024-10-25KVM: x86/mmu: Don't overwrite shadow-present MMU SPTEs when prefaultingSean Christopherson
Treat attempts to prefetch/prefault MMU SPTEs as spurious if there's an existing shadow-present SPTE, as overwriting a SPTE that may have been create by a "real" fault is at best confusing, and at worst potentially harmful. E.g. mmu_try_to_unsync_pages() doesn't unsync when prefetching, which creates a scenario where KVM could try to replace a Writable SPTE with a !Writable SPTE, as sp->unsync is checked prior to acquiring mmu_unsync_pages_lock. Note, this applies to three of the four flavors of "prefetch" in KVM: - KVM_PRE_FAULT_MEMORY - Async #PF (host or PV) - Prefetching The fourth flavor, SPTE synchronization, i.e. FNAME(sync_spte), _only_ overwrites shadow-present SPTEs when calling make_spte(). But SPTE synchronization specifically uses mmu_spte_update(), and so naturally avoids the @prefetch check in mmu_set_spte(). Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-6-seanjc@google.com>
2024-10-25KVM: x86/mmu: Skip the "try unsync" path iff the old SPTE was a leaf SPTESean Christopherson
Apply make_spte()'s optimization to skip trying to unsync shadow pages if and only if the old SPTE was a leaf SPTE, as non-leaf SPTEs in direct MMUs are always writable, i.e. could trigger a false positive and incorrectly lead to KVM creating a SPTE without write-protecting or marking shadow pages unsync. This bug only affects the TDP MMU, as the shadow MMU only overwrites a shadow-present SPTE when synchronizing SPTEs (and only 4KiB SPTEs can be unsync). Specifically, mmu_set_spte() drops any non-leaf SPTEs *before* calling make_spte(), whereas the TDP MMU can do a direct replacement of a page table with the leaf SPTE. Opportunistically update the comment to explain why skipping the unsync stuff is safe, as opposed to simply saying "it's someone else's problem". Cc: stable@vger.kernel.org Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-5-seanjc@google.com>
2024-10-25KVM: Drop KVM_ERR_PTR_BAD_PAGE and instead return NULL to indicate an errorSean Christopherson
Remove KVM_ERR_PTR_BAD_PAGE and instead return NULL, as "bad page" is just a leftover bit of weirdness from days of old when KVM stuffed a "bad" page into the guest instead of actually handling missing pages. See commit cea7bb21280e ("KVM: MMU: Make gfn_to_page() always safe"). Reviewed-by: Alex Bennée <alex.bennee@linaro.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-ID: <20241010182427.1434605-2-seanjc@google.com>
2024-10-25x86: fix whitespace in runtime-const assembler outputLinus Torvalds
The x86 user pointer validation changes made me look at compiler output a lot, and the wrong indentation for the ".popsection" in the generated assembler triggered me. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-25x86: fix user address masking non-canonical speculation issueLinus Torvalds
It turns out that AMD has a "Meltdown Lite(tm)" issue with non-canonical accesses in kernel space. And so using just the high bit to decide whether an access is in user space or kernel space ends up with the good old "leak speculative data" if you have the right gadget using the result: CVE-2020-12965 “Transient Execution of Non-Canonical Accesses“ Now, the kernel surrounds the access with a STAC/CLAC pair, and those instructions end up serializing execution on older Zen architectures, which closes the speculation window. But that was true only up until Zen 5, which renames the AC bit [1]. That improves performance of STAC/CLAC a lot, but also means that the speculation window is now open. Note that this affects not just the new address masking, but also the regular valid_user_address() check used by access_ok(), and the asm version of the sign bit check in the get_user() helpers. It does not affect put_user() or clear_user() variants, since there's no speculative result to be used in a gadget for those operations. Reported-by: Andrew Cooper <andrew.cooper3@citrix.com> Link: https://lore.kernel.org/all/80d94591-1297-4afb-b510-c665efd37f10@citrix.com/ Link: https://lore.kernel.org/all/20241023094448.GAZxjFkEOOF_DM83TQ@fat_crate.local/ [1] Link: https://www.amd.com/en/resources/product-security/bulletin/amd-sb-1010.html Link: https://arxiv.org/pdf/2108.10771 Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Tested-by: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> # LAM case Fixes: 2865baf54077 ("x86: support user address masking instead of non-speculative conditional") Fixes: 6014bc27561f ("x86-64: make access_ok() independent of LAM") Fixes: b19b74bc99b1 ("x86/mm: Rework address range check in get_user() and put_user()") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-25x86/microcode/intel: Remove unnecessary cache writeback and invalidationChang S. Bae
Currently, an unconditional cache flush is performed during every microcode update. Although the original changelog did not mention a specific erratum, this measure was primarily intended to address a specific microcode bug, the load of which has already been blocked by is_blacklisted(). Therefore, this cache flush is no longer necessary. Additionally, the side effects of doing this have been overlooked. It increases CPU rendezvous time during late loading, where the cache flush takes between 1x to 3.5x longer than the actual microcode update. Remove native_wbinvd() and update the erratum name to align with the latest errata documentation, document ID 334163 Version 022US. [ bp: Zap the flaky documentation URL. ] Fixes: 91df9fdf5149 ("x86/microcode/intel: Writeback and invalidate caches before updating microcode") Reported-by: Yan Hua Wu <yanhua1.wu@intel.com> Reported-by: William Xie <william.xie@intel.com> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ashok Raj <ashok.raj@intel.com> Tested-by: Yan Hua Wu <yanhua1.wu@intel.com> Link: https://lore.kernel.org/r/20241001161042.465584-2-chang.seok.bae@intel.com
2024-10-23x86/sev: Ensure that RMP table fixups are reservedAshish Kalra
The BIOS reserves RMP table memory via e820 reservations. This can still lead to RMP page faults during kexec if the host tries to access memory within the same 2MB region. Commit 400fea4b9651 ("x86/sev: Add callback to apply RMP table fixups for kexec" adjusts the e820 reservations for the RMP table so that the entire 2MB range at the start/end of the RMP table is marked reserved. The e820 reservations are then passed to firmware via SNP_INIT where they get marked HV-Fixed. The RMP table fixups are done after the e820 ranges have been added to memblock, allowing the fixup ranges to still be allocated and used by the system. The problem is that this memory range is now marked reserved in the e820 tables and during SNP initialization these reserved ranges are marked as HV-Fixed. This means that the pages cannot be used by an SNP guest, only by the hypervisor. However, the memory management subsystem does not make this distinction and can allocate one of those pages to an SNP guest. This will ultimately result in RMPUPDATE failures associated with the guest, causing it to fail to start or terminate when accessing the HV-Fixed page. The issue is captured below with memblock=debug: [ 0.000000] SEV-SNP: *** DEBUG: snp_probe_rmptable_info:352 - rmp_base=0x280d4800000, rmp_end=0x28357efffff ... [ 0.000000] BIOS-provided physical RAM map: ... [ 0.000000] BIOS-e820: [mem 0x00000280d4800000-0x0000028357efffff] reserved [ 0.000000] BIOS-e820: [mem 0x0000028357f00000-0x0000028357ffffff] usable ... ... [ 0.183593] memblock add: [0x0000028357f00000-0x0000028357ffffff] e820__memblock_setup+0x74/0xb0 ... [ 0.203179] MEMBLOCK configuration: [ 0.207057] memory size = 0x0000027d0d194000 reserved size = 0x0000000009ed2c00 [ 0.215299] memory.cnt = 0xb ... [ 0.311192] memory[0x9] [0x0000028357f00000-0x0000028357ffffff], 0x0000000000100000 bytes flags: 0x0 ... ... [ 0.419110] SEV-SNP: Reserving start/end of RMP table on a 2MB boundary [0x0000028357e00000] [ 0.428514] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved [ 0.428517] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved [ 0.428520] e820: update [mem 0x28357e00000-0x28357ffffff] usable ==> reserved ... ... [ 5.604051] MEMBLOCK configuration: [ 5.607922] memory size = 0x0000027d0d194000 reserved size = 0x0000000011faae02 [ 5.616163] memory.cnt = 0xe ... [ 5.754525] memory[0xc] [0x0000028357f00000-0x0000028357ffffff], 0x0000000000100000 bytes on node 0 flags: 0x0 ... ... [ 10.080295] Early memory node ranges[ 10.168065] ... node 0: [mem 0x0000028357f00000-0x0000028357ffffff] ... ... [ 8149.348948] SEV-SNP: RMPUPDATE failed for PFN 28357f7c, pg_level: 1, ret: 2 As shown above, the memblock allocations show 1MB after the end of the RMP as available for allocation, which is what the RMP table fixups have reserved. This memory range subsequently gets allocated as SNP guest memory, resulting in an RMPUPDATE failure. This can potentially be fixed by not reserving the memory range in the e820 table, but that causes kexec failures when using the KEXEC_FILE_LOAD syscall. The solution is to use memblock_reserve() to mark the memory reserved for the system, ensuring that it cannot be allocated to an SNP guest. Since HV-Fixed memory is still readable/writable by the host, this only ends up being a problem if the memory in this range requires a page state change, which generally will only happen when allocating memory in this range to be used for running SNP guests, which is now possible with the SNP hypervisor support in kernel 6.11. Backporter note: Fixes tag points to a 6.9 change but as the last paragraph above explains, this whole thing can happen after 6.11 received SNP HV support, therefore backporting to 6.9 is not really necessary. [ bp: Massage commit message. ] Fixes: 400fea4b9651 ("x86/sev: Add callback to apply RMP table fixups for kexec") Suggested-by: Thomas Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@kernel.org> # 6.11, see Backporter note above. Link: https://lore.kernel.org/r/20240815221630.131133-1-Ashish.Kalra@amd.com
2024-10-23um: switch to regset API and depend on XSTATEBenjamin Berg
The PTRACE_GETREGSET API has now existed since Linux 2.6.33. The XSAVE CPU feature should also be sufficiently common to be able to rely on it. With this, define our internal FP state to be the hosts XSAVE data. Add discovery for the hosts XSAVE size and place the FP registers at the end of task_struct so that we can adjust the size at runtime. Next we can implement the regset API on top and update the signal handling as well as ptrace APIs to use them. Also switch coredump creation to use the regset API and finally set HAVE_ARCH_TRACEHOOK. This considerably improves the signal frames. Previously they might not have contained all the registers (i386) and also did not have the sizes and magic values set to the correct values to permit userspace to decode the frame. As a side effect, this will permit UML to run on hosts with newer CPU extensions (such as AMX) that need even more register state. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Link: https://patch.msgid.link/20241023094120.4083426-1-benjamin@sipsolutions.net Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23um: vdso: Always reject undefined references in during linkingThomas Weißschuh
Instead of using a custom script to detect and fail on undefined references, use --no-undefined for all VDSO linker invocations. Drop the now unused checkundef.sh script. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20241011-vdso-checkundef-v1-2-1a46e0352d20@linutronix.de Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-23um: make stub_exe _start() pure inline asmJohannes Berg
Since __attribute__((naked)) cannot be used with functions containing C statements, just generate the few instructions it needs in assembly directly. While at it, fix the stack usage ("1 + 2*x - 1" is odd) and document what it must do, and why it must adjust the stack. Fixes: 8508a5e0e9db ("um: Fix misaligned stack in stub_exe") Link: https://lore.kernel.org/linux-um/CABVgOSntH-uoOFMP5HwMXjx_f1osMnVdhgKRKm4uz6DFm2Lb8Q@mail.gmail.com/ Reviewed-by: David Gow <davidgow@google.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2024-10-22x86/microcode/AMD: Split load_microcode_amd()Borislav Petkov (AMD)
This function should've been split a long time ago because it is used in two paths: 1) On the late loading path, when the microcode is loaded through the request_firmware interface 2) In the save_microcode_in_initrd() path which collects all the microcode patches which are relevant for the current system before the initrd with the microcode container has been jettisoned. In that path, it is not really necessary to iterate over the nodes on a system and match a patch however it didn't cause any trouble so it was left for a later cleanup However, that later cleanup was expedited by the fact that Jens was enabling "Use L3 as a NUMA node" in the BIOS setting in his machine and so this causes the NUMA CPU masks used in cpumask_of_node() to be generated *after* 2) above happened on the first node. Which means, all those masks were funky, wrong, uninitialized and whatnot, leading to explosions when dereffing c->microcode in load_microcode_amd(). So split that function and do only the necessary work needed at each stage. Fixes: 94838d230a6c ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/91194406-3fdf-4e38-9838-d334af538f74@kernel.dk
2024-10-22x86/microcode/AMD: Pay attention to the stepping dynamicallyBorislav Petkov (AMD)
Commit in Fixes changed how a microcode patch is loaded on Zen and newer but the patch matching needs to happen with different rigidity, depending on what is being done: 1) When the patch is added to the patches cache, the stepping must be ignored because the driver still supports different steppings per system 2) When the patch is matched for loading, then the stepping must be taken into account because each CPU needs the patch matching its exact stepping Take care of that by making the matching smarter. Fixes: 94838d230a6c ("x86/microcode/AMD: Use the family,model,stepping encoded in the patch ID") Reported-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/91194406-3fdf-4e38-9838-d334af538f74@kernel.dk
2024-10-21x86/lam: Disable ADDRESS_MASKING in most casesPawan Gupta
Linear Address Masking (LAM) has a weakness related to transient execution as described in the SLAM paper[1]. Unless Linear Address Space Separation (LASS) is enabled this weakness may be exploitable. Until kernel adds support for LASS[2], only allow LAM for COMPILE_TEST, or when speculation mitigations have been disabled at compile time, otherwise keep LAM disabled. There are no processors in market that support LAM yet, so currently nobody is affected by this issue. [1] SLAM: https://download.vusec.net/papers/slam_sp24.pdf [2] LASS: https://lore.kernel.org/lkml/20230609183632.48706-1-alexander.shishkin@linux.intel.com/ [ dhansen: update SPECULATION_MITIGATIONS -> CPU_MITIGATIONS ] Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Sohil Mehta <sohil.mehta@intel.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/5373262886f2783f054256babdf5a98545dc986b.1706068222.git.pawan.kumar.gupta%40linux.intel.com
2024-10-21Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix the guest view of the ID registers, making the relevant fields writable from userspace (affecting ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1) - Correcly expose S1PIE to guests, fixing a regression introduced in 6.12-rc1 with the S1POE support - Fix the recycling of stage-2 shadow MMUs by tracking the context (are we allowed to block or not) as well as the recycling state - Address a couple of issues with the vgic when userspace misconfigures the emulation, resulting in various splats. Headaches courtesy of our Syzkaller friends - Stop wasting space in the HYP idmap, as we are dangerously close to the 4kB limit, and this has already exploded in -next - Fix another race in vgic_init() - Fix a UBSAN error when faking the cache topology with MTE enabled RISCV: - RISCV: KVM: use raw_spinlock for critical section in imsic x86: - A bandaid for lack of XCR0 setup in selftests, which causes trouble if the compiler is configured to have x86-64-v3 (with AVX) as the default ISA. Proper XCR0 setup will come in the next merge window. - Fix an issue where KVM would not ignore low bits of the nested CR3 and potentially leak up to 31 bytes out of the guest memory's bounds - Fix case in which an out-of-date cached value for the segments could by returned by KVM_GET_SREGS. - More cleanups for KVM_X86_QUIRK_SLOT_ZAP_ALL - Override MTRR state for KVM confidential guests, making it WB by default as is already the case for Hyper-V guests. Generic: - Remove a couple of unused functions" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (27 commits) RISCV: KVM: use raw_spinlock for critical section in imsic KVM: selftests: Fix out-of-bounds reads in CPUID test's array lookups KVM: selftests: x86: Avoid using SSE/AVX instructions KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memory KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset() KVM: x86: Clean up documentation for KVM_X86_QUIRK_SLOT_ZAP_ALL KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range() KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslot x86/kvm: Override default caching mode for SEV-SNP and TDX KVM: Remove unused kvm_vcpu_gfn_to_pfn_atomic KVM: Remove unused kvm_vcpu_gfn_to_pfn KVM: arm64: Ensure vgic_ready() is ordered against MMIO registration KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQS KVM: arm64: Fix shift-out-of-bounds bug KVM: arm64: Shave a few bytes from the EL2 idmap code KVM: arm64: Don't eagerly teardown the vgic on init error KVM: arm64: Expose S1PIE to guests KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to reschedule KVM: arm64: nv: Punt stage-2 recycling to a vCPU request KVM: arm64: nv: Do not block when unmapping stage-2 if disallowed ...
2024-10-21x86/platform: Switch back to struct platform_driver::remove()Uwe Kleine-König
After 0edb555a65d1 ("platform: Make platform_driver::remove() return void") .remove() is (again) the right callback to implement for platform drivers. Convert all platform drivers below arch/x86 to use .remove(), with the eventual goal to drop struct platform_driver::remove_new(). As .remove() and .remove_new() have the same prototypes, conversion is done by just changing the structure member name in the driver initializer. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241021103954.403577-2-u.kleine-koenig@baylibre.com
2024-10-20Merge tag 'x86_urgent_for_v6.12_rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: - Explicitly disable the TSC deadline timer when going idle to address some CPU errata in that area - Do not apply the Zenbleed fix on anything else except AMD Zen2 on the late microcode loading path - Clear CPU buffers later in the NMI exit path on 32-bit to avoid register clearing while they still contain sensitive data, for the RDFS mitigation - Do not clobber EFLAGS.ZF with VERW on the opportunistic SYSRET exit path on 32-bit - Fix parsing issues of memory bandwidth specification in sysfs for resctrl's memory bandwidth allocation feature - Other small cleanups and improvements * tag 'x86_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/apic: Always explicitly disarm TSC-deadline timer x86/CPU/AMD: Only apply Zenbleed fix for Zen2 during late microcode load x86/bugs: Use code segment selector for VERW operand x86/entry_32: Clear CPU buffers after register restore in NMI return x86/entry_32: Do not clobber user EFLAGS.ZF x86/resctrl: Annotate get_mem_config() functions as __init x86/resctrl: Avoid overflow in MB settings in bw_validate() x86/amd_nb: Add new PCI ID for AMD family 1Ah model 20h
2024-10-20KVM: nSVM: Ignore nCR3[4:0] when loading PDPTEs from memorySean Christopherson
Ignore nCR3[4:0] when loading PDPTEs from memory for nested SVM, as bits 4:0 of CR3 are ignored when PAE paging is used, and thus VMRUN doesn't enforce 32-byte alignment of nCR3. In the absolute worst case scenario, failure to ignore bits 4:0 can result in an out-of-bounds read, e.g. if the target page is at the end of a memslot, and the VMM isn't using guard pages. Per the APM: The CR3 register points to the base address of the page-directory-pointer table. The page-directory-pointer table is aligned on a 32-byte boundary, with the low 5 address bits 4:0 assumed to be 0. And the SDM's much more explicit: 4:0 Ignored Note, KVM gets this right when loading PDPTRs, it's only the nSVM flow that is broken. Fixes: e4e517b4be01 ("KVM: MMU: Do not unconditionally read PDPTE from guest memory") Reported-by: Kirk Swidowski <swidowski@google.com> Cc: Andy Nguyen <theflow@google.com> Cc: 3pvd <3pvd@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009140838.1036226-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: VMX: reset the segment cache after segment init in vmx_vcpu_reset()Maxim Levitsky
Reset the segment cache after segment initialization in vmx_vcpu_reset() to harden KVM against caching stale/uninitialized data. Without the recent fix to bypass the cache in kvm_arch_vcpu_put(), the following scenario is possible: - vCPU is just created, and the vCPU thread is preempted before SS.AR_BYTES is written in vmx_vcpu_reset(). - When scheduling out the vCPU task, kvm_arch_vcpu_in_kernel() => vmx_get_cpl() reads and caches '0' for SS.AR_BYTES. - vmx_vcpu_reset() => seg_setup() configures SS.AR_BYTES, but doesn't invoke vmx_segment_cache_clear() to invalidate the cache. As a result, KVM retains a stale value in the cache, which can be read, e.g. via KVM_GET_SREGS. Usually this is not a problem because the VMX segment cache is reset on each VM-Exit, but if the userspace VMM (e.g KVM selftests) reads and writes system registers just after the vCPU was created, _without_ modifying SS.AR_BYTES, userspace will write back the stale '0' value and ultimately will trigger a VM-Entry failure due to incorrect SS segment type. Invalidating the cache after writing the VMCS doesn't address the general issue of cache accesses from IRQ context being unsafe, but it does prevent KVM from clobbering the VMCS, i.e. mitigates the harm done _if_ KVM has a bug that results in an unsafe cache access. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Fixes: 2fb92db1ec08 ("KVM: VMX: Cache vmcs segment fields") [sean: rework changelog to account for previous patch] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009175002.1118178-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Add lockdep assert to enforce safe usage of kvm_unmap_gfn_range()Sean Christopherson
Add a lockdep assertion in kvm_unmap_gfn_range() to ensure that either mmu_invalidate_in_progress is elevated, or that the range is being zapped due to memslot removal (loosely detected by slots_lock being held). Zapping SPTEs without mmu_invalidate_{in_progress,seq} protection is unsafe as KVM's page fault path snapshots state before acquiring mmu_lock, and thus can create SPTEs with stale information if vCPUs aren't forced to retry faults (due to seeing an in-progress or past MMU invalidation). Memslot removal is a special case, as the memslot is retrieved outside of mmu_invalidate_seq, i.e. doesn't use the "standard" protections, and instead relies on SRCU synchronization to ensure any in-flight page faults are fully resolved before zapping SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20241009192345.1148353-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20KVM: x86/mmu: Zap only SPs that shadow gPTEs when deleting memslotSean Christopherson
When performing a targeted zap on memslot removal, zap only MMU pages that shadow guest PTEs, as zapping all SPs that "match" the gfn is inexact and unnecessary. Furthermore, for_each_gfn_valid_sp() arguably shouldn't exist, because it doesn't do what most people would it expect it to do. The "round gfn for level" adjustment that is done for direct SPs (no gPTE) means that the exact gfn comparison will not get a match, even when a SP does "cover" a gfn, or was even created specifically for a gfn. For memslot deletion specifically, KVM's behavior will vary significantly based on the size and alignment of a memslot, and in weird ways. E.g. for a 4KiB memslot, KVM will zap more SPs if the slot is 1GiB aligned than if it's only 4KiB aligned. And as described below, zapping SPs in the aligned case overzaps for direct MMUs, as odds are good the upper-level SPs are serving other memslots. To iterate over all potentially-relevant gfns, KVM would need to make a pass over the hash table for each level, with the gfn used for lookup rounded for said level. And then check that the SP is of the correct level, too, e.g. to avoid over-zapping. But even then, KVM would massively overzap, as processing every level is all but guaranteed to zap SPs that serve other memslots, especially if the memslot being removed is relatively small. KVM could mitigate that issue by processing only levels that can be possible guest huge pages, i.e. are less likely to be re-used for other memslot, but while somewhat logical, that's quite arbitrary and would be a bit of a mess to implement. So, zap only SPs with gPTEs, as the resulting behavior is easy to describe, is predictable, and is explicitly minimal, i.e. KVM only zaps SPs that absolutely must be zapped. Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Yan Zhao <yan.y.zhao@intel.com> Tested-by: Yan Zhao <yan.y.zhao@intel.com> Message-ID: <20241009192345.1148353-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-20x86/kvm: Override default caching mode for SEV-SNP and TDXKirill A. Shutemov
AMD SEV-SNP and Intel TDX have limited access to MTRR: either it is not advertised in CPUID or it cannot be programmed (on TDX, due to #VE on CR0.CD clear). This results in guests using uncached mappings where it shouldn't and pmd/pud_set_huge() failures due to non-uniform memory type reported by mtrr_type_lookup(). Override MTRR state, making it WB by default as the kernel does for Hyper-V guests. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: Binbin Wu <binbin.wu@intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Juergen Gross <jgross@suse.com> Message-ID: <20241015095818.357915-1-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2024-10-18PCI: Remove unused PCI_SUBTRACTIVE_DECODEIlpo Järvinen
2fe2abf896c1 ("PCI: augment bus resource table with a list") added PCI_SUBTRACTIVE_DECODE which is put into the struct pci_bus_resource flags field but is never read. There seems to never have been users for it. Remove both PCI_SUBTRACTIVE_DECODE and the flags field from the struct pci_bus_resource. Link: https://lore.kernel.org/r/20241017141111.44612-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
2024-10-17Merge tag 'x86_bugs_post_ibpb' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 IBPB fixes from Borislav Petkov: "This fixes the IBPB implementation of older AMDs (< gen4) that do not flush the RSB (Return Address Stack) so you can still do some leaking when using a "=ibpb" mitigation for Retbleed or SRSO. Fix it by doing the flushing in software on those generations. IBPB is not the default setting so this is not likely to affect anybody in practice" * tag 'x86_bugs_post_ibpb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Do not use UNTRAIN_RET with IBPB on entry x86/bugs: Skip RSB fill at VMEXIT x86/entry: Have entry_ibpb() invalidate return predictions x86/cpufeatures: Add a IBPB_NO_RET BUG flag x86/cpufeatures: Define X86_FEATURE_AMD_IBPB_RET
2024-10-17x86/unwind/orc: Fix unwind for newly forked tasksZheng Yejian
When arch_stack_walk_reliable() is called to unwind for newly forked tasks, the return value is negative which means the call stack is unreliable. This obviously does not meet expectations. The root cause is that after commit 3aec4ecb3d1f ("x86: Rewrite ret_from_fork() in C"), the 'ret_addr' of newly forked task is changed to 'ret_from_fork_asm' (see copy_thread()), then at the start of the unwind, it is incorrectly interprets not as a "signal" one because 'ret_from_fork' is still used to determine the initial "signal" (see __unwind_start()). Then the address gets incorrectly decremented in the call to orc_find() (see unwind_next_frame()) and resulting in the incorrect ORC data. To fix it, check 'ret_from_fork_asm' rather than 'ret_from_fork' in __unwind_start(). Fixes: 3aec4ecb3d1f ("x86: Rewrite ret_from_fork() in C") Signed-off-by: Zheng Yejian <zhengyejian@huaweicloud.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-10-17objtool: Detect non-relocated text referencesJosh Poimboeuf
When kernel IBT is enabled, objtool detects all text references in order to determine which functions can be indirectly branched to. In text, such references look like one of the following: mov $0x0,%rax R_X86_64_32S .init.text+0x7e0a0 lea 0x0(%rip),%rax R_X86_64_PC32 autoremove_wake_function-0x4 Either way the function pointer is denoted by a relocation, so objtool just reads that. However there are some "lea xxx(%rip)" cases which don't use relocations because they're referencing code in the same translation unit. Objtool doesn't have visibility to those. The only currently known instances of that are a few hand-coded asm text references which don't actually need ENDBR. So it's not actually a problem at the moment. However if we enable -fpie, the compiler would start generating them and there would definitely be bugs in the IBT sealing. Detect non-relocated text references and handle them appropriately. [ Note: I removed the manual static_call_tramp check -- that should already be handled by the noendbr check. ] Reported-by: Ard Biesheuvel <ardb@kernel.org> Tested-by: Ard Biesheuvel <ardb@kernel.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
2024-10-16x86/acpi: Switch to irq_get_nr_irqs() and irq_set_nr_irqs()Bart Van Assche
Use the irq_get_nr_irqs() and irq_set_nr_irqs() functions instead of the global variable 'nr_irqs'. Prepare for changing 'nr_irqs' from an exported global variable into a variable with file scope. Signed-off-by: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20241015190953.1266194-7-bvanassche@acm.org
2024-10-16virt: sev-guest: Carve out SNP message context structureNikunj A Dadhania
Currently, the sev-guest driver is the only user of SNP guest messaging. The snp_guest_dev structure holds all the allocated buffers, secrets page and VMPCK details. In preparation for adding messaging allocation and initialization APIs, decouple snp_guest_dev from messaging-related information by carving out the guest message context structure(snp_msg_desc). Incorporate this newly added context into snp_send_guest_request() and all related functions, replacing the use of the snp_guest_dev. No functional change. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20241009092850.197575-7-nikunj@amd.com
2024-10-16virt: sev-guest: Consolidate SNP guest messaging parameters to a structNikunj A Dadhania
Add a snp_guest_req structure to eliminate the need to pass a long list of parameters. This structure will be used to call the SNP Guest message request API, simplifying the function arguments. Update the snp_issue_guest_request() prototype to include the new guest request structure. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20241009092850.197575-5-nikunj@amd.com
2024-10-16x86/sev: Cache the secrets page addressNikunj A Dadhania
Instead of calling get_secrets_page(), which parses the CC blob every time to get the secrets page physical address (secrets_pa), save the secrets page physical address during snp_init() from the CC blob. Since get_secrets_page() is no longer used, remove the function. Signed-off-by: Nikunj A Dadhania <nikunj@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com> Link: https://lore.kernel.org/r/20241009092850.197575-4-nikunj@amd.com