Age | Commit message (Collapse) | Author |
|
Rename make_huge_page_split_spte() to make_small_spte(). This ensures
that the usage of "small_spte" and "huge_spte" are consistent between
make_huge_spte() and make_small_spte().
This should also reduce some confusion as make_huge_page_split_spte()
almost reads like it will create a huge SPTE, when in fact it is
creating a small SPTE to split the huge SPTE.
No functional change intended.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-6-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Recover TDP MMU huge page mappings in-place instead of zapping them when
dirty logging is disabled, and rename functions that recover huge page
mappings when dirty logging is disabled to move away from the "zap
collapsible spte" terminology.
Before KVM flushes TLBs, guest accesses may be translated through either
the (stale) small SPTE or the (new) huge SPTE. This is already possible
when KVM is doing eager page splitting (where TLB flushes are also
batched), and when vCPUs are faulting in huge mappings (where TLBs are
flushed after the new huge SPTE is installed).
Recovering huge pages reduces the number of page faults when dirty
logging is disabled:
$ perf stat -e kvm:kvm_page_fault -- ./dirty_log_perf_test -s anonymous_hugetlb_2mb -v 64 -e -b 4g
Before: 393,599 kvm:kvm_page_fault
After: 262,575 kvm:kvm_page_fault
vCPU throughput and the latency of disabling dirty-logging are about
equal compared to zapping, but avoiding faults can be beneficial to
remove vCPU jitter in extreme scenarios.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240823235648.3236880-5-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
disabled
When making a SPTE, set the Dirty bit in the SPTE as appropriate, even if
hardware A/D bits are disabled. Only EPT allows A/D bits to be disabled,
and for EPT, the bits are software-available (ignored by hardware) when
A/D bits are disabled, i.e. it is perfectly legal for KVM to use the Dirty
to track dirty pages in software.
Link: https://lore.kernel.org/r/20241011021051.1557902-17-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Use the Accessed bit in SPTEs even when A/D bits are disabled in hardware,
i.e. propagate accessed information to SPTE.Accessed even when KVM is
doing manual tracking by making SPTEs not-present. In addition to
eliminating a small amount of code in is_accessed_spte(), this also paves
the way for preserving Accessed information when a SPTE is zapped in
response to a mmu_notifier PROTECTION event, e.g. if a SPTE is zapped
because NUMA balancing kicks in.
Note, EPT is the only flavor of paging in which A/D bits are conditionally
enabled, and the Accessed (and Dirty) bit is software-available when A/D
bits are disabled.
Note #2, there are currently no concrete plans to preserve Accessed
information. Explorations on that front were the initial catalyst, but
the cleanup is the motivation for the actual commit.
Link: https://lore.kernel.org/r/20241011021051.1557902-13-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Set shadow_dirty_mask to the architectural EPT Dirty bit value even if
A/D bits are disabled at the module level, i.e. even if KVM will never
enable A/D bits in hardware. Doing so provides consistent behavior for
Accessed and Dirty bits, i.e. doesn't leave KVM in a state where it sets
shadow_accessed_mask but not shadow_dirty_mask.
Functionally, this should be one big nop, as consumption of
shadow_dirty_mask is always guarded by a check that hardware A/D bits are
enabled.
Link: https://lore.kernel.org/r/20241011021051.1557902-12-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that KVM doesn't use shadow_accessed_mask to detect if hardware A/D
bits are enabled, set shadow_accessed_mask for EPT even when A/D bits
are disabled in hardware. This will allow using shadow_accessed_mask for
software purposes, e.g. to preserve accessed status in a non-present SPTE
acros NUMA balancing, if something like that is ever desirable.
Link: https://lore.kernel.org/r/20241011021051.1557902-11-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Add a dedicated flag to track if KVM has enabled A/D bits at the module
level, instead of inferring the state based on whether or not the MMU's
shadow_accessed_mask is non-zero. This will allow defining and using
shadow_accessed_mask even when A/D bits aren't used by hardware.
Link: https://lore.kernel.org/r/20241011021051.1557902-10-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Now that make_spte() no longer uses a funky goto to bail out for a special
case of its unsync handling, combine all of the unsync vs. writable logic
into a single if-else statement.
No functional change intended.
Link: https://lore.kernel.org/r/20241011021051.1557902-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
When creating a SPTE, always set the Dirty bit if the Writable bit is set,
i.e. if KVM is creating a writable mapping. If two (or more) vCPUs are
racing to install a writable SPTE on a !PRESENT fault, only the "winning"
vCPU will create a SPTE with W=1 and D=1, all "losers" will generate a
SPTE with W=1 && D=0.
As a result, tdp_mmu_map_handle_target_level() will fail to detect that
the losing faults are effectively spurious, and will overwrite the D=1
SPTE with a D=0 SPTE. For normal VMs, overwriting a present SPTE is a
small performance blip; KVM blasts a remote TLB flush, but otherwise life
goes on.
For upcoming TDX VMs, overwriting a present SPTE is much more costly, and
can even lead to the VM being terminated if KVM isn't careful, e.g. if KVM
attempts TDH.MEM.PAGE.AUG because the TDX code doesn't detect that the
new SPTE is actually the same as the old SPTE (which would be a bug in its
own right).
Suggested-by: Sagi Shahar <sagis@google.com>
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Don't force a remote TLB flush if KVM happens to effectively "refresh" a
read-only SPTE that is still MMU-Writable, as KVM allows MMU-Writable SPTEs
to have Writable TLB entries, even if the SPTE is !Writable. Remote TLBs
need to be flushed only when creating a read-only SPTE for write-tracking,
i.e. when installing a !MMU-Writable SPTE.
In practice, especially now that KVM doesn't overwrite existing SPTEs when
prefetching, KVM will rarely "refresh" a read-only, MMU-Writable SPTE,
i.e. this is unlikely to eliminate many, if any, TLB flushes. But, more
precisely flushing makes it easier to understand exactly when KVM does and
doesn't need to flush.
Note, x86 architecturally requires relevant TLB entries to be invalidated
on a page fault, i.e. there is no risk of putting a vCPU into an infinite
loop of read-only page faults.
Cc: Yan Zhao <yan.y.zhao@intel.com>
Link: https://lore.kernel.org/r/20241011021051.1557902-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Move the marking of folios dirty from make_spte() out to its callers,
which have access to the _struct page_, not just the underlying pfn.
Once all architectures follow suit, this will allow removing KVM's ugly
hack where KVM elevates the refcount of VM_MIXEDMAP pfns that happen to
be struct page memory.
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-42-seanjc@google.com>
|
|
Mark pages/folios dirty when creating SPTEs to map PFNs into the guest,
not when zapping or modifying SPTEs, as marking folios dirty when zapping
or modifying SPTEs can be extremely inefficient. E.g. when KVM is zapping
collapsible SPTEs to reconstitute a hugepage after disbling dirty logging,
KVM will mark every 4KiB pfn as dirty, even though _at least_ 512 pfns are
guaranteed to be in a single folio (the SPTE couldn't potentially be huge
if that weren't the case). The problem only becomes worse for 1GiB
HugeTLB pages, as KVM can mark a single folio dirty 512*512 times.
Marking a folio dirty when mapping is functionally safe as KVM drops all
relevant SPTEs in response to an mmu_notifier invalidation, i.e. ensures
that the guest can't dirty a folio after access has been removed.
And because KVM already marks folios dirty when zapping/modifying SPTEs
for KVM reasons, i.e. not in response to an mmu_notifier invalidation,
there is no danger of "prematurely" marking a folio dirty. E.g. if a
filesystems cleans a folio without first removing write access, then there
already exists races where KVM could mark a folio dirty before remote TLBs
are flushed, i.e. before guest writes are guaranteed to stop. Furthermore,
x86 is literally the only architecture that marks folios dirty on the
backend; every other KVM architecture marks folios dirty at map time.
x86's unique behavior likely stems from the fact that x86's MMU predates
mmu_notifiers. Long, long ago, before mmu_notifiers were added, marking
pages dirty when zapping SPTEs was logical, and perhaps even necessary, as
KVM held references to pages, i.e. kept a page's refcount elevated while
the page was mapped into the guest. At the time, KVM's rmap_remove()
simply did:
if (is_writeble_pte(*spte))
kvm_release_pfn_dirty(pfn);
else
kvm_release_pfn_clean(pfn);
i.e. dropped the refcount and marked the page dirty at the same time.
After mmu_notifiers were introduced, commit acb66dd051d0 ("KVM: MMU:
don't hold pagecount reference for mapped sptes pages") removed the
refcount logic, but kept the dirty logic, i.e. converted the above to:
if (is_writeble_pte(*spte))
kvm_release_pfn_dirty(pfn);
And for KVM x86, that's essentially how things have stayed over the last
~15 years, without anyone revisiting *why* KVM marks pages/folios dirty at
zap/modification time, e.g. the behavior was blindly carried forward to
the TDP MMU.
Practically speaking, the only downside to marking a folio dirty during
mapping is that KVM could trigger writeback of memory that was never
actually written. Except that can't actually happen if KVM marks folios
dirty if and only if a writable SPTE is created (as done here), because
KVM always marks writable SPTEs as dirty during make_spte(). See commit
9b51a63024bd ("KVM: MMU: Explicitly set D-bit for writable spte."), circa
2015.
Note, KVM's access tracking logic for prefetched SPTEs is a bit odd. If a
guest PTE is dirty and writable, KVM will create a writable SPTE, but then
mark the SPTE for access tracking. Which isn't wrong, just a bit odd, as
it results in _more_ precise dirty tracking for MMUs _without_ A/D bits.
To keep things simple, mark the folio dirty before access tracking comes
into play, as an access-tracked SPTE can be restored in the fast page
fault path, i.e. without holding mmu_lock. While writing SPTEs and
accessing memslots outside of mmu_lock is safe, marking a folio dirty is
not. E.g. if the fast path gets interrupted _just_ after setting a SPTE,
the primary MMU could theoretically invalidate and free a folio before KVM
marks it dirty. Unlike the shadow MMU, which waits for CPUs to respond to
an IPI, the TDP MMU only guarantees the page tables themselves won't be
freed (via RCU).
Opportunistically update a few stale comments.
Cc: David Matlack <dmatlack@google.com>
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-9-seanjc@google.com>
|
|
Set the Accessed bit when making a "new" SPTE during SPTE synchronization,
as _clearing_ the Accessed bit is counter-productive, and even if the
Accessed bit wasn't set in the old SPTE, odds are very good the guest will
access the page in the near future, as the most common case where KVM
synchronizes a shadow-present SPTE is when the guest is making the gPTE
read-only for Copy-on-Write (CoW).
Preserving the Accessed bit will allow dropping the logic that propagates
the Accessed bit to the underlying struct page when overwriting an existing
SPTE, without undue risk of regressing page aging.
Note, KVM's current behavior is very deliberate, as SPTE synchronization
was the only "speculative" access type as of commit 947da5383069 ("KVM:
MMU: Set the accessed bit on non-speculative shadow ptes").
But, much has changed since 2008, and more changes are on the horizon.
Spurious clearing of the Accessed (and Dirty) was mitigated by commit
e6722d9211b2 ("KVM: x86/mmu: Reduce the update to the spte in
FNAME(sync_spte)"), which changed FNAME(sync_spte) to only overwrite SPTEs
if the protections are actually changing. I.e. KVM is already preserving
Accessed information for SPTEs that aren't dropping protections.
And with the aforementioned future change to NOT mark the page/folio as
accessed, KVM's SPTEs will become the "source of truth" so to speak, in
which case clearing the Accessed bit outside of page aging becomes very
undesirable.
Suggested-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-8-seanjc@google.com>
|
|
Invert the polarity of "can_unsync" and rename the parameter to
"synchronizing" to allow a future change to set the Accessed bit if KVM
is synchronizing an existing SPTE. Querying "can_unsync" in that case is
nonsensical, as the fact that KVM can't unsync SPTEs doesn't provide any
justification for setting the Accessed bit.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-7-seanjc@google.com>
|
|
Apply make_spte()'s optimization to skip trying to unsync shadow pages if
and only if the old SPTE was a leaf SPTE, as non-leaf SPTEs in direct MMUs
are always writable, i.e. could trigger a false positive and incorrectly
lead to KVM creating a SPTE without write-protecting or marking shadow
pages unsync.
This bug only affects the TDP MMU, as the shadow MMU only overwrites a
shadow-present SPTE when synchronizing SPTEs (and only 4KiB SPTEs can be
unsync). Specifically, mmu_set_spte() drops any non-leaf SPTEs *before*
calling make_spte(), whereas the TDP MMU can do a direct replacement of a
page table with the leaf SPTE.
Opportunistically update the comment to explain why skipping the unsync
stuff is safe, as opposed to simply saying "it's someone else's problem".
Cc: stable@vger.kernel.org
Tested-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Tested-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20241010182427.1434605-5-seanjc@google.com>
|
|
Replace "removed" with "frozen" in comments as appropriate to complete the
rename of REMOVED_SPTE to FROZEN_SPTE.
Fixes: 964cea817196 ("KVM: x86/tdp_mmu: Rename REMOVED_SPTE to FROZEN_SPTE")
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Link: https://lore.kernel.org/r/20240712233438.518591-1-rick.p.edgecombe@intel.com
[sean: write changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Introduces kvm_x86_call(), to streamline the usage of static calls of
kvm_x86_ops. The current implementation of these calls is verbose and
could lead to alignment challenges. This makes the code susceptible to
exceeding the "80 columns per single line of code" limit as defined in
the coding-style document. Another issue with the existing implementation
is that the addition of kvm_x86_ prefix to hooks at the static_call sites
hinders code readability and navigation. kvm_x86_call() is added to
improve code readability and maintainability, while adhering to the coding
style guidelines.
Signed-off-by: Wei Wang <wei.w.wang@intel.com>
Link: https://lore.kernel.org/r/20240507133103.15052-3-wei.w.wang@intel.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Tweak the definition of make_huge_page_split_spte() to eliminate an
unnecessarily long line, and opportunistically initialize child_spte to
make it more obvious that the child is directly derived from the huge
parent.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240712151335.1242633-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Bug the VM instead of simply warning if KVM tries to split a SPTE that is
non-present or not-huge. KVM is guaranteed to end up in a broken state as
the callers fully expect a valid SPTE, e.g. the shadow MMU will add an
rmap entry, and all MMUs will account the expected small page. Returning
'0' is also technically wrong now that SHADOW_NONPRESENT_VALUE exists,
i.e. would cause KVM to create a potential #VE SPTE.
While it would be possible to have the callers gracefully handle failure,
doing so would provide no practical value as the scenario really should be
impossible, while the error handling would add a non-trivial amount of
noise.
Fixes: a3fe5dbda0a4 ("KVM: x86/mmu: Split huge pages mapped by the TDP MMU when dirty logging is enabled")
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240712151335.1242633-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
KVM x86 misc changes for 6.11
- Add a global struct to consolidate tracking of host values, e.g. EFER, and
move "shadow_phys_bits" into the structure as "maxphyaddr".
- Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC
bus frequency, because TDX.
- Print the name of the APICv/AVIC inhibits in the relevant tracepoint.
- Clean up KVM's handling of vendor specific emulation to consistently act on
"compatible with Intel/AMD", versus checking for a specific vendor.
- Misc cleanups
|
|
Rename REMOVED_SPTE to FROZEN_SPTE so that it can be used for other
multi-part operations.
REMOVED_SPTE is used as a non-present intermediate value for multi-part
operations that can happen when a thread doesn't have an MMU write lock.
Today these operations are when removing PTEs.
However, future changes will want to use the same concept for setting a
PTE. In that case the REMOVED_SPTE name does not quite fit. So rename it
to FROZEN_SPTE so it can be used for both types of operations.
Also rename the relevant helpers and comments that refer to "removed"
within the context of the SPTE value. Take care to not update naming
referring the "remove" operations, which are still distinct.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <20240619223614.290657-2-rick.p.edgecombe@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Move shadow_phys_bits into "struct kvm_host_values", i.e. into KVM's
global "kvm_host" variable, so that it is automatically exported for use
in vendor modules. Rename the variable/field to maxphyaddr to more
clearly capture what value it holds, now that it's used outside of the
MMU (and because the "shadow" part is more than a bit misleading as the
variable is not at all unique to shadow paging).
Recomputing the raw/true host.MAXPHYADDR on every use can be subtly
expensive, e.g. it will incur a VM-Exit on the CPUID if KVM is running as
a nested hypervisor. Vendor code already has access to the information,
e.g. by directly doing CPUID or by invoking kvm_get_shadow_phys_bits(), so
there's no tangible benefit to making it MMU-only.
Link: https://lore.kernel.org/r/20240423221521.2923759-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Snapshot shadow_phys_bits when kvm.ko is loaded, not when a vendor module
is loaded, to guard against usage of shadow_phys_bits before it is
initialized. The computation isn't vendor specific in any way, i.e. there
there is no reason to wait to snapshot the value until a vendor module is
loaded, nor is there any reason to recompute the value every time a vendor
module is loaded.
Opportunistically convert it from "read mostly" to "read-only after init".
Link: https://lore.kernel.org/r/20240423221521.2923759-4-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
TDX will use a different shadow PTE entry value for MMIO from VMX. Add a
member to kvm_arch and track value for MMIO per-VM instead of a global
variable. By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working. Introduce a separate setter function so that guest
TD can use a different value later.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <229a18434e5d83f45b1fcd7bf1544d79db1becb6.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
To make use of the same value of shadow_mmio_mask and shadow_present_mask
for TDX and VMX, add Suppress-VE bit to shadow_mmio_mask and
shadow_present_mask so that they can be common for both VMX and TDX.
TDX will require shadow_mmio_mask and shadow_present_mask to include
VMX_SUPPRESS_VE for shared GPA so that EPT violation is triggered for
shared GPA. For VMX, VMX_SUPPRESS_VE doesn't matter for MMIO because the
spte value is defined so as to cause EPT misconfig.
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Message-Id: <97cc616b3563cd8277be91aaeb3e14bce23c3649.1705965635.git.isaku.yamahata@intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.
To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control. And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range. 2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfiguration unlike VMX case. For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA. The non-present SPTE value for shared GPA should
set "suppress #VE" bit.
Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE. Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <a99cb866897c7083430dce7f24c63b17d7121134.1705965635.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
The .change_pte() MMU notifier callback was intended as an
optimization. The original point of it was that KSM could tell KVM to flip
its secondary PTE to a new location without having to first zap it. At
the time there was also an .invalidate_page() callback; both of them were
*not* bracketed by calls to mmu_notifier_invalidate_range_{start,end}(),
and .invalidate_page() also doubled as a fallback implementation of
.change_pte().
Later on, however, both callbacks were changed to occur within an
invalidate_range_start/end() block.
In the case of .change_pte(), commit 6bdb913f0a70 ("mm: wrap calls to
set_pte_at_notify with invalidate_range_start and invalidate_range_end",
2012-10-09) did so to remove the fallback from .invalidate_page() to
.change_pte() and allow sleepable .invalidate_page() hooks.
This however made KVM's usage of the .change_pte() callback completely
moot, because KVM unmaps the sPTEs during .invalidate_range_start()
and therefore .change_pte() has no hope of finding a sPTE to change.
Drop the generic KVM code that dispatches to kvm_set_spte_gfn(), as
well as all the architecture specific implementations.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Anup Patel <anup@brainfault.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240405115815.3226315-2-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Convert all "runtime" assertions, i.e. assertions that can be triggered
while running vCPUs, from WARN_ON() to WARN_ON_ONCE(). Every WARN in the
MMU that is tied to running vCPUs, i.e. not contained to loading and
initializing KVM, is likely to fire _a lot_ when it does trigger. E.g. if
KVM ends up with a bug that causes a root to be invalidated before the
page fault handler is invoked, pretty much _every_ page fault VM-Exit
triggers the WARN.
If a WARN is triggered frequently, the resulting spam usually causes a lot
of damage of its own, e.g. consumes resources to log the WARN and pollutes
the kernel log, often to the point where other useful information can be
lost. In many case, the damage caused by the spam is actually worse than
the bug itself, e.g. KVM can almost always recover from an unexpectedly
invalid root.
On the flip side, warning every time is rarely helpful for debug and
triage, i.e. a single splat is usually sufficient to point a debugger in
the right direction, and automated testing, e.g. syzkaller, typically runs
with warn_on_panic=1, i.e. will never get past the first WARN anyways.
Lastly, when an assertions fails multiple times, the stack traces in KVM
are almost always identical, i.e. the full splat only needs to be captured
once. And _if_ there is value in captruing information about the failed
assert, a ratelimited printk() is sufficient and less likely to rack up a
large amount of collateral damage.
Link: https://lore.kernel.org/r/20230729004722.1056172-8-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Delete KVM's pgprintk() and all its usage, as the code is very prone
to bitrot due to being buried behind MMU_DEBUG, and the functionality has
been rendered almost entirely obsolete by the tracepoints KVM has gained
over the years. And for the situations where the information provided by
KVM's tracepoints is insufficient, pgprintk() rarely fills in the gaps,
and is almost always far too noisy, i.e. developers end up implementing
custom prints anyways.
Link: https://lore.kernel.org/r/20230729004722.1056172-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fix a small comment typo in make_spte().
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20230322013731.102955-6-minipli@grsecurity.net
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
SPTE_TDP_AD_ENABLED_MASK, SPTE_TDP_AD_DISABLED_MASK and
SPTE_TDP_AD_WRPROT_ONLY_MASK are actual value, not mask.
Remove "MASK" from their names.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Link: https://lore.kernel.org/r/20230105100204.6521-1-jiangshanlai@gmail.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
|
|
Define pr_fmt using KBUILD_MODNAME for all KVM x86 code so that printks
use consistent formatting across common x86, Intel, and AMD code. In
addition to providing consistent print formatting, using KBUILD_MODNAME,
e.g. kvm_amd and kvm_intel, allows referencing SVM and VMX (and SEV and
SGX and ...) as technologies without generating weird messages, and
without causing naming conflicts with other kernel code, e.g. "SEV: ",
"tdx: ", "sgx: " etc.. are all used by the kernel for non-KVM subsystems.
Opportunistically move away from printk() for prints that need to be
modified anyways, e.g. to drop a manual "kvm: " prefix.
Opportunistically convert a few SGX WARNs that are similarly modified to
WARN_ONCE; in the very unlikely event that the WARNs fire, odds are good
that they would fire repeatedly and spam the kernel log without providing
unique information in each print.
Note, defining pr_fmt yields undesirable results for code that uses KVM's
printk wrappers, e.g. vcpu_unimpl(). But, that's a pre-existing problem
as SVM/kvm_amd already defines a pr_fmt, and thankfully use of KVM's
wrappers is relatively limited in KVM x86 code.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Message-Id: <20221130230934.1014142-35-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if a shadow page can't be replaced by a NX
huge page doesn't get false positives. Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.
Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs. The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns. However, getting
make_spte() to play nice now and in the future is unnecessarily complex.
In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.
Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possibility of direct shadow pages
in indirect MMUs. Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Message-Id: <20221019165618.927057-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add compile-time and init-time sanity checks to ensure that the MMIO SPTE
mask doesn't overlap the MMIO SPTE generation or the MMU-present bit.
The generation currently avoids using bit 63, but that's as much
coincidence as it is strictly necessarly. That will change in the future,
as TDX support will require setting bit 63 (SUPPRESS_VE) in the mask.
Explicitly carve out the bits that are allowed in the mask so that any
future shuffling of SPTE bits doesn't silently break MMIO caching (KVM
has broken MMIO caching more than once due to overlapping the generation
with other things).
Suggested-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220805194133.86299-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Disable SEV-ES if MMIO caching is disabled as SEV-ES relies on MMIO SPTEs
generating #NPF(RSVD), which are reflected by the CPU into the guest as
a #VC. With SEV-ES, the untrusted host, a.k.a. KVM, doesn't have access
to the guest instruction stream or register state and so can't directly
emulate in response to a #NPF on an emulated MMIO GPA. Disabling MMIO
caching means guest accesses to emulated MMIO ranges cause #NPF(!PRESENT),
and those flavors of #NPF cause automatic VM-Exits, not #VC.
Adjust KVM's MMIO masks to account for the C-bit location prior to doing
SEV(-ES) setup, and document that dependency between adjusting the MMIO
SPTE mask and SEV(-ES) setup.
Fixes: b09763da4dd8 ("KVM: x86/mmu: Add module param to disable MMIO caching (for testing)")
Reported-by: Michael Roth <michael.roth@amd.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220803224957.1285926-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fully re-evaluate whether or not MMIO caching can be enabled when SPTE
masks change; simply clearing enable_mmio_caching when a configuration
isn't compatible with caching fails to handle the scenario where the
masks are updated, e.g. by VMX for EPT or by SVM to account for the C-bit
location, and toggle compatibility from false=>true.
Snapshot the original module param so that re-evaluating MMIO caching
preserves userspace's desire to allow caching. Use a snapshot approach
so that enable_mmio_caching still reflects KVM's actual behavior.
Fixes: 8b9e74bfbf8c ("KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled")
Reported-by: Michael Roth <michael.roth@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Tested-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-Id: <20220803224957.1285926-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add shadow_memtype_mask to capture that EPT needs a non-zero memtype mask
instead of relying on TDP being enabled, as NPT doesn't need a non-zero
mask. This is a glorified nop as kvm_x86_ops.get_mt_mask() returns zero
for NPT anyways.
No functional change intended.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220715230016.3762909-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.
To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that. This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-19-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.
In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.
Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions. But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Separate the macros for KVM's shadow PTEs (SPTE) from guest 64-bit PTEs
(PT64). SPTE and PT64 are _mostly_ the same, but the few differences are
quite critical, e.g. *_BASE_ADDR_MASK must differentiate between host and
guest physical address spaces, and SPTE_PERM_MASK (was PT64_PERM_MASK) is
very much specific to SPTEs.
Opportunistically (and temporarily) move most guest macros into paging.h
to clearly associate them with shadow paging, and to ensure that they're
not used as of this commit. A future patch will eliminate them entirely.
Sadly, PT32_LEVEL_BITS is left behind in mmu_internal.h because it's
needed for the quadrant calculation in kvm_mmu_get_page(). The quadrant
calculation is hot enough (when using shadow paging with 32-bit guests)
that adding a per-context helper is undesirable, and burying the
computation in paging_tmpl.h with a forward declaration isn't exactly an
improvement.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220614233328.3896033-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
All of sync_page()'s existing checks filter out only !PRESENT gPTE,
because without execute-only, all upper levels are guaranteed to be at
least READABLE. However, if EPT with execute-only support is in use by
L1, KVM can create an SPTE that is shadow-present but guest-inaccessible
(RWX=0) if the upper level combined permissions are R (or RW) and
the leaf EPTE is changed from R (or RW) to X. Because the EPTE is
considered present when viewed in isolation, and no reserved bits are set,
FNAME(prefetch_invalid_gpte) will consider the GPTE valid, and cause a
not-present SPTE to be created.
The SPTE is "correct": the guest translation is inaccessible because
the combined protections of all levels yield RWX=0, and KVM will just
redirect any vmexits to the guest. If EPT A/D bits are disabled, KVM
can mistake the SPTE for an access-tracked SPTE, but again such confusion
isn't fatal, as the "saved" protections are also RWX=0. However,
creating a useless SPTE in general means that KVM messed up something,
even if this particular goof didn't manifest as a functional bug.
So, drop SPTEs whose new protections will yield a RWX=0 SPTE, and
add a WARN in make_spte() to detect creation of SPTEs that will
result in RWX=0 protections.
Fixes: d95c55687e11 ("kvm: mmu: track read permission explicitly for shadow EPT page tables")
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220513195000.99371-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Intel MKTME KeyID bits (including Intel TDX private KeyID bits) should
never be set to SPTE. Set shadow_me_value to 0 and shadow_me_mask to
include all MKTME KeyID bits to include them to shadow_zero_check.
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <27bc10e97a3c0b58a4105ff9107448c190328239.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
high bits of physical address bits as 'KeyID' bits. Intel Trust Domain
Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
KeyID bits. TDX private KeyID bits cannot be set in any mapping in the
host kernel since they can only be accessed by software running inside a
new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set
any legacy MKTME KeyID bits to any mapping either. Therefore, it's not
legitimate for KVM to set any KeyID bits in SPTE which maps guest
memory.
KVM maintains shadow_zero_check bits to represent which bits must be
zero for SPTE which maps guest memory. MKTME KeyID bits should be set
to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set
the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
shadow_zero_check. So initializing shadow_me_mask to represent all
MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
to shadow_zero_check).
Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
and repurpose shadow_me_mask as 'all possible memory encryption bits'.
The new schematic of them will be:
- shadow_me_value: the memory encryption bit(s) that will be set to the
SPTE (the original shadow_me_mask).
- shadow_me_mask: all possible memory encryption bits (which is a super
set of shadow_me_value).
- For now, shadow_me_value is supposed to be set by SVM and VMX
respectively, and it is a constant during KVM's life time. This
perhaps doesn't fit MKTME but for now host kernel doesn't support it
(and perhaps will never do).
- Bits in shadow_me_mask are set to shadow_zero_check, except the bits
in shadow_me_value.
Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
Replace shadow_me_mask with shadow_me_value in almost all code paths,
except the one in PT64_PERM_MASK, which is used by need_remote_flush()
to determine whether remote TLB flush is needed. This should still use
shadow_me_mask as any encryption bit change should need a TLB flush.
And for AMD, move initializing shadow_me_value/shadow_me_mask from
kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock
is held for write, as volatile SPTEs can be written by other tasks/vCPUs
outside of mmu_lock.
Attempting to prove that bug exposed another notable goof, which has been
lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs
as volatile, even though KVM never clears WRITABLE outside of MMU lock.
As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to
update writable SPTEs.
The fix does not seem to have an easily-measurable affect on performance;
page faults are so slow that wasting even a few hundred cycles is dwarfed
by the base cost.
|
|
Move the is_shadow_present_pte() check out of spte_has_volatile_bits()
and into its callers. Well, caller, since only one of its two callers
doesn't already do the shadow-present check.
Opportunistically move the helper to spte.c/h so that it can be used by
the TDP MMU, which is also the primary motivation for the shadow-present
change. Unlike the legacy MMU, the TDP MMU uses a single path for clear
leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP
MMU will need to check is_last_spte() prior to calling
spte_has_volatile_bits(), and calling is_last_spte() without first
calling is_shadow_present_spte() is at best odd, and at worst a violation
of KVM's loosely defines SPTE rules.
Note, mmu_spte_clear_track_bits() could likely skip the write entirely
for SPTEs that are not shadow-present. Leave that cleanup for a future
patch to avoid introducing a functional change, and because the
shadow-present check can likely be moved further up the stack, e.g.
drop_large_spte() appears to be the only path that doesn't already
explicitly check for a shadow-present SPTE.
No functional change intended.
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Clear enable_mmio_caching if hardware can't support MMIO caching and use
the dedicated flag to detect if MMIO caching is enabled instead of
assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will
use a zero value even when caching is enabled, and is_mmio_spte() isn't
so hot that it needs to avoid an extra memory access, i.e. there's no
reason to be super clever. And the clever approach may not even be more
performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
uops for non-MMIO SPTEs.
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220420002747.3287931-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
When determining whether or not a SPTE needs to have SME/SEV's memory
encryption flag set, do the moderately expensive host MMIO pfn check if
and only if the memory encryption mask is non-zero.
Note, KVM could further optimize the host MMIO checks by making a single
call to kvm_is_mmio_pfn(), but the tdp_enabled path (for EPT's memtype
handling) will likely be split out to a separate flow[*]. At that point,
a better approach would be to shove the call to kvm_is_mmio_pfn() into
VMX code so that AMD+NPT without SME doesn't get hit with an unnecessary
lookup.
[*] https://lkml.kernel.org/r/20220321224358.1305530-3-bgardon@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20220415004909.2216670-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
enabled
When dirty logging is enabled without initially-all-set, try to split
all huge pages in the memslot down to 4KB pages so that vCPUs do not
have to take expensive write-protection faults to split huge pages.
Eager page splitting is best-effort only. This commit only adds the
support for the TDP MMU, and even there splitting may fail due to out
of memory conditions. Failures to split a huge page is fine from a
correctness standpoint because KVM will always follow up splitting by
write-protecting any remaining huge pages.
Eager page splitting moves the cost of splitting huge pages off of the
vCPU threads and onto the thread enabling dirty logging on the memslot.
This is useful because:
1. Splitting on the vCPU thread interrupts vCPUs execution and is
disruptive to customers whereas splitting on VM ioctl threads can
run in parallel with vCPU execution.
2. Splitting all huge pages at once is more efficient because it does
not require performing VM-exit handling or walking the page table for
every 4KiB page in the memslot, and greatly reduces the amount of
contention on the mmu_lock.
For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB
per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to
all of their memory after dirty logging is enabled decreased by 95% from
2.94s to 0.14s.
Eager Page Splitting is over 100x more efficient than the current
implementation of splitting on fault under the read lock. For example,
taking the same workload as above, Eager Page Splitting reduced the CPU
required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s)
* 96 vCPU threads) to only 1.55 CPU-seconds.
Eager page splitting does increase the amount of time it takes to enable
dirty logging since it has split all huge pages. For example, the time
it took to enable dirty logging in the 96GiB region of the
aforementioned test increased from 0.001s to 1.55s.
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220119230739.2234394-16-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Both "writeable" and "writable" are valid, but we should be consistent
about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in
the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE.
No functional change intended.
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230713.1700406-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Check SPTE writable invariants when setting SPTEs rather than in
spte_can_locklessly_be_made_writable(). By the time KVM checks
spte_can_locklessly_be_made_writable(), the SPTE has long been since
corrupted.
Note that these invariants only apply to shadow-present leaf SPTEs (i.e.
not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the
restriction and only instrument the code paths that set shadow-present
leaf SPTEs.
To account for access tracking, also check the SPTE writable invariants
when marking an SPTE as an access track SPTE. This also lets us remove
a redundant WARN from mark_spte_for_access_track().
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20220125230518.1697048-3-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|