summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/mmu/mmu.c
AgeCommit message (Collapse)Author
2022-06-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: - syzkaller NULL pointer dereference - TDP MMU performance issue with disabling dirty logging - 5.14 regression with SVM TSC scaling - indefinite stall on applying live patches - unstable selftest - memory leak from wrong copy-and-paste - missed PV TLB flush when racing with emulation * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: do not set st->preempted when going back to user space KVM: SVM: fix tsc scaling cache logic KVM: selftests: Make hyperv_clock selftest more stable KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set KVM: Don't null dereference ops->destroy
2022-06-07KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots()Shaoqin Huang
When freeing obsolete previous roots, check prev_roots as intended, not the current root. Signed-off-by: Shaoqin Huang <shaoqin.huang@intel.com> Fixes: 527d5cd7eece ("KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped") Message-Id: <20220607005905.2933378-1-shaoqin.huang@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-26Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "S390: - ultravisor communication device driver - fix TEID on terminating storage key ops RISC-V: - Added Sv57x4 support for G-stage page table - Added range based local HFENCE functions - Added remote HFENCE functions based on VCPU requests - Added ISA extension registers in ONE_REG interface - Updated KVM RISC-V maintainers entry to cover selftests support ARM: - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes x86: - New ioctls to get/set TSC frequency for a whole VM - Allow userspace to opt out of hypercall patching - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr AMD SEV improvements: - Add KVM_EXIT_SHUTDOWN metadata for SEV-ES - V_TSC_AUX support Nested virtualization improvements for AMD: - Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) - Allow AVIC to co-exist with a nested guest running - Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support - PAUSE filtering for nested hypervisors Guest support: - Decoupling of vcpu_is_preempted from PV spinlocks" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (199 commits) KVM: x86: Fix the intel_pt PMI handling wrongly considered from guest KVM: selftests: x86: Sync the new name of the test case to .gitignore Documentation: kvm: reorder ARM-specific section about KVM_SYSTEM_EVENT_SUSPEND x86, kvm: use correct GFP flags for preemption disabled KVM: LAPIC: Drop pending LAPIC timer injection when canceling the timer x86/kvm: Alloc dummy async #PF token outside of raw spinlock KVM: x86: avoid calling x86 emulator without a decoded instruction KVM: SVM: Use kzalloc for sev ioctl interfaces to prevent kernel data leak x86/fpu: KVM: Set the base guest FPU uABI size to sizeof(struct kvm_xsave) s390/uv_uapi: depend on CONFIG_S390 KVM: selftests: x86: Fix test failure on arch lbr capable platforms KVM: LAPIC: Trace LAPIC timer expiration on every vmentry KVM: s390: selftest: Test suppression indication on key prot exception KVM: s390: Don't indicate suppression on dirtying, failing memop selftests: drivers/s390x: Add uvdevice tests drivers/s390/char: Add Ultravisor io device MAINTAINERS: Update KVM RISC-V entry to cover selftests support RISC-V: KVM: Introduce ISA extension register RISC-V: KVM: Cleanup stale TLB entries when host CPU changes RISC-V: KVM: Add remote HFENCE functions based on VCPU requests ...
2022-05-20KVM: x86/mmu: fix NULL pointer dereference on guest INVPCIDPaolo Bonzini
With shadow paging enabled, the INVPCID instruction results in a call to kvm_mmu_invpcid_gva. If INVPCID is executed with CR0.PG=0, the invlpg callback is not set and the result is a NULL pointer dereference. Fix it trivially by checking for mmu->invlpg before every call. There are other possibilities: - check for CR0.PG, because KVM (like all Intel processors after P5) flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a nop with paging disabled - check for EFER.LMA, because KVM syncs and flushes when switching MMU contexts outside of 64-bit mode All of these are tricky, go for the simple solution. This is CVE-2022-1789. Reported-by: Yongkang Jia <kangel@zju.edu.cn> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Update number of zapped pages even if page list is stableSean Christopherson
When zapping obsolete pages, update the running count of zapped pages regardless of whether or not the list has become unstable due to zapping a shadow page with its own child shadow pages. If the VM is backed by mostly 4kb pages, KVM can zap an absurd number of SPTEs without bumping the batch count and thus without yielding. In the worst case scenario, this can cause a soft lokcup. watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [dirty_log_perf_:13020] RIP: 0010:workingset_activation+0x19/0x130 mark_page_accessed+0x266/0x2e0 kvm_set_pfn_accessed+0x31/0x40 mmu_spte_clear_track_bits+0x136/0x1c0 drop_spte+0x1a/0xc0 mmu_page_zap_pte+0xef/0x120 __kvm_mmu_prepare_zap_page+0x205/0x5e0 kvm_mmu_zap_all_fast+0xd7/0x190 kvm_mmu_invalidate_zap_pages_in_memslot+0xe/0x10 kvm_page_track_flush_slot+0x5c/0x80 kvm_arch_flush_shadow_memslot+0xe/0x10 kvm_set_memslot+0x1a8/0x5d0 __kvm_set_memory_region+0x337/0x590 kvm_vm_ioctl+0xb08/0x1040 Fixes: fbb158cb88b6 ("KVM: x86/mmu: Revert "Revert "KVM: MMU: zap pages in batch""") Reported-by: David Matlack <dmatlack@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220511145122.3133334-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Speed up slot_rmap_walk_next for sparsely populated rmapsVipin Sharma
Avoid calling handlers on empty rmap entries and skip to the next non empty rmap entry. Empty rmap entries are noop in handlers. Signed-off-by: Vipin Sharma <vipinsh@google.com> Suggested-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220502220347.174664-1-vipinsh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_maskKai Huang
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Rename reset_rsvds_bits_mask()Kai Huang
Rename reset_rsvds_bits_mask() to reset_guest_rsvds_bits_mask() to make it clearer that it resets the reserved bits check for guest's page table entries. Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <efdc174b85d55598880064b8bf09245d3791031d.1650363789.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Expand and clean up page fault statsSean Christopherson
Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Make all page fault handlers internal to the MMUSean Christopherson
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Add RET_PF_CONTINUE to eliminate bool+int* "returns"Sean Christopherson
Add RET_PF_CONTINUE and use it in handle_abnormal_pfn() and kvm_faultin_pfn() to signal that the page fault handler should continue doing its thing. Aside from being gross and inefficient, using a boolean return to signal continue vs. stop makes it extremely difficult to add more helpers and/or move existing code to a helper. E.g. hypothetically, if nested MMUs were to gain a separate page fault handler in the future, everything up to the "is self-modifying PTE" check can be shared by all shadow MMUs, but communicating up the stack whether to continue on or stop becomes a nightmare. More concretely, proposed support for private guest memory ran into a similar issue, where it'll be forced to forego a helper in order to yield sane code: https://lore.kernel.org/all/YkJbxiL%2FAz7olWlq@google.com. No functional change intended. Cc: David Matlack <dmatlack@google.com> Cc: Chao Peng <chao.p.peng@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Drop exec/NX check from "page fault can be fast"Sean Christopherson
Tweak the "page fault can be fast" logic to explicitly check for !PRESENT faults in the access tracking case, and drop the exec/NX check that becomes redundant as a result. No sane hardware will generate an access that is both an instruct fetch and a write, i.e. it's a waste of cycles. If hardware goes off the rails, or KVM runs under a misguided hypervisor, spuriously running throught fast path is benign (KVM has been uknowingly being doing exactly that for years). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Don't attempt fast page fault just because EPT is in useSean Christopherson
Check for A/D bits being disabled instead of the access tracking mask being non-zero when deciding whether or not to attempt to fix a page fault vian the fast path. Originally, the access tracking mask was non-zero if and only if A/D bits were disabled by _KVM_ (including not being supported by hardware), but that hasn't been true since nVMX was fixed to honor EPTP12's A/D enabling, i.e. since KVM allowed L1 to cause KVM to not use A/D bits while running L2 despite KVM using them while running L1. In other words, don't attempt the fast path just because EPT is enabled. Note, attempting the fast path for all !PRESENT faults can "fix" a very, _VERY_ tiny percentage of faults out of mmu_lock by detecting that the fault is spurious, i.e. has been fixed by a different vCPU, but again the odds of that happening are vanishingly small. E.g. booting an 8-vCPU VM gets less than 10 successes out of 30k+ faults, and that's likely one of the more favorable scenarios. Disabling dirty logging can likely lead to a rash of collisions between vCPUs for some workloads that operate on a common set of pages, but penalizing _all_ !PRESENT faults for that one case is unlikely to be a net positive, not to mention that that problem is best solved by not zapping in the first place. The number of spurious faults does scale with the number of vCPUs, e.g. a 255-vCPU VM using TDP "jumps" to ~60 spurious faults detected in the fast path (again out of 30k), but that's all of 0.2% of faults. Using legacy shadow paging does get more spurious faults, and a few more detected out of mmu_lock, but the percentage goes _down_ to 0.08% (and that's ignoring faults that are reflected into the guest), i.e. the extra detections are purely due to the sheer number of faults observed. On the other hand, getting a "negative" in the fast path takes in the neighborhood of 150-250 cycles. So while it is tempting to keep/extend the current behavior, such a change needs to come with hard numbers showing that it's actually a win in the grand scheme, or any scheme for that matter. Fixes: 995f00a61958 ("x86: kvm: mmu: use ept a/d in vmcs02 iff used in vmcs12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03Merge branch 'kvm-tdp-mmu-atomicity-fix' into HEADPaolo Bonzini
We are dropping A/D bits (and W bits) in the TDP MMU. Even if mmu_lock is held for write, as volatile SPTEs can be written by other tasks/vCPUs outside of mmu_lock. Attempting to prove that bug exposed another notable goof, which has been lurking for a decade, give or take: KVM treats _all_ MMU-writable SPTEs as volatile, even though KVM never clears WRITABLE outside of MMU lock. As a result, the legacy MMU (and the TDP MMU if not fixed) uses XCHG to update writable SPTEs. The fix does not seem to have an easily-measurable affect on performance; page faults are so slow that wasting even a few hundred cycles is dwarfed by the base cost.
2022-05-03KVM: x86/mmu: Move shadow-present check out of spte_has_volatile_bits()Sean Christopherson
Move the is_shadow_present_pte() check out of spte_has_volatile_bits() and into its callers. Well, caller, since only one of its two callers doesn't already do the shadow-present check. Opportunistically move the helper to spte.c/h so that it can be used by the TDP MMU, which is also the primary motivation for the shadow-present change. Unlike the legacy MMU, the TDP MMU uses a single path for clear leaf and non-leaf SPTEs, and to avoid unnecessary atomic updates, the TDP MMU will need to check is_last_spte() prior to calling spte_has_volatile_bits(), and calling is_last_spte() without first calling is_shadow_present_spte() is at best odd, and at worst a violation of KVM's loosely defines SPTE rules. Note, mmu_spte_clear_track_bits() could likely skip the write entirely for SPTEs that are not shadow-present. Leave that cleanup for a future patch to avoid introducing a functional change, and because the shadow-present check can likely be moved further up the stack, e.g. drop_large_spte() appears to be the only path that doesn't already explicitly check for a shadow-present SPTE. No functional change intended. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-03KVM: x86/mmu: Don't treat fully writable SPTEs as volatile (modulo A/D)Sean Christopherson
Don't treat SPTEs that are truly writable, i.e. writable in hardware, as being volatile (unless they're volatile for other reasons, e.g. A/D bits). KVM _sets_ the WRITABLE bit out of mmu_lock, but never _clears_ the bit out of mmu_lock, so if the WRITABLE bit is set, it cannot magically get cleared just because the SPTE is MMU-writable. Rename the wrapper of MMU-writable to be more literal, the previous name of spte_can_locklessly_be_made_writable() is wrong and misleading. Fixes: c7ba5b48cc8d ("KVM: MMU: fast path of handling guest page fault") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: X86/MMU: Fix shadowing 5-level NPT for 4-level NPT L1 guestLai Jiangshan
When shadowing 5-level NPT for 4-level NPT L1 guest, the root_sp is allocated with role.level = 5 and the guest pagetable's root gfn. And root_sp->spt[0] is also allocated with the same gfn and the same role except role.level = 4. Luckily that they are different shadow pages, but only root_sp->spt[0] is the real translation of the guest pagetable. Here comes a problem: If the guest switches from gCR4_LA57=0 to gCR4_LA57=1 (or vice verse) and uses the same gfn as the root page for nested NPT before and after switching gCR4_LA57. The host (hCR4_LA57=1) might use the same root_sp for the guest even the guest switches gCR4_LA57. The guest will see unexpected page mapped and L2 may exploit the bug and hurt L1. It is lucky that the problem can't hurt L0. And three special cases need to be handled: The root_sp should be like role.direct=1 sometimes: its contents are not backed by gptes, root_sp->gfns is meaningless. (For a normal high level sp in shadow paging, sp->gfns is often unused and kept zero, but it could be relevant and meaningful if sp->gfns is used because they are backed by concrete gptes.) For such root_sp in the case, root_sp is just a portal to contribute root_sp->spt[0], and root_sp->gfns should not be used and root_sp->spt[0] should not be dropped if gpte[0] of the guest root pagetable is changed. Such root_sp should not be accounted too. So add role.passthrough to distinguish the shadow pages in the hash when gCR4_LA57 is toggled and fix above special cases by using it in kvm_mmu_page_{get|set}_gfn() and sp_has_gptes(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220420131204.2850-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: X86/MMU: Add sp_has_gptes()Lai Jiangshan
Add sp_has_gptes() which equals to !sp->role.direct currently. Shadow page having gptes needs to be write-protected, accounted and responded to kvm_mmu_pte_write(). Use it in these places to replace !sp->role.direct and rename for_each_gfn_indirect_valid_sp. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220420131204.2850-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: replace direct_map with root_role.directPaolo Bonzini
direct_map is always equal to the direct field of the root page's role: - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is copied from cpu_role.base.direct - for TDP, it is always true and root_role.direct is also always true - for shadow TDP, it is always false and root_role.direct is also always false Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: replace root_level with cpu_role.base.levelPaolo Bonzini
Remove another duplicate field of struct kvm_mmu. This time it's the root level for page table walking; the separate field is always initialized as cpu_role.base.level, so its users can look up the CPU mode directly instead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: replace shadow_root_level with root_role.levelPaolo Bonzini
root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: pull CPU mode computation to kvm_init_mmuPaolo Bonzini
Do not lead init_kvm_*mmu into the temptation of poking into struct kvm_mmu_role_regs, by passing to it directly the CPU mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: simplify and/or inline computation of shadow MMU rolesPaolo Bonzini
Shadow MMUs compute their role from cpu_role.base, simply by adjusting the root level. It's one line of code, so do not place it in a separate function. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: remove redundant bits from extended rolePaolo Bonzini
Before the separation of the CPU and the MMU role, CR0.PG was not available in the base MMU role, because two-dimensional paging always used direct=1 in the MMU role. However, now that the raw role is snapshotted in mmu->cpu_role, the value of CR0.PG always matches both !cpu_role.base.direct and cpu_role.base.level > 0. There is no need to store it again in union kvm_mmu_extended_role; instead, write an is_cr0_pg accessor by hand that takes care of the conversion. Use cpu_role.base.level since the future of the direct field is unclear. Likewise, CR4.PAE is now always present in the CPU role as !cpu_role.base.has_4_byte_gpte. The inversion makes certain tests on the MMU role easier, and is easily hidden by the is_cr4_pae accessor when operating on the CPU role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: rename kvm_mmu_role unionPaolo Bonzini
It is quite confusing that the "full" union is called kvm_mmu_role but is used for the "cpu_role" field of struct kvm_mmu. Rename it to kvm_cpu_role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: remove extended bits from mmu_role, rename fieldPaolo Bonzini
mmu_role represents the role of the root of the page tables. It does not need any extended bits, as those govern only KVM's page table walking; the is_* functions used for page table walking always use the CPU role. ext.valid is not present anymore in the MMU role, but an all-zero MMU role is impossible because the level field is never zero in the MMU role. So just zap the whole mmu_role in order to force invalidation after CPUID is updated. While making this change, which requires touching almost every occurrence of "mmu_role", rename it to "root_role". Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: store shadow EFER.NX in the MMU rolePaolo Bonzini
Now that the MMU role is separate from the CPU role, it can be a truthful description of the format of the shadow pages. This includes whether the shadow pages use the NX bit; so force the efer_nx field of the MMU role when TDP is disabled, and remove the hardcoding it in the callers of reset_shadow_zero_bits_mask. In fact, the initialization of reserved SPTE bits can now be made common to shadow paging and shadow NPT; move it to shadow_mmu_init_context. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: cleanup computation of MMU roles for shadow pagingPaolo Bonzini
Pass the already-computed CPU role, instead of redoing it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional pagingPaolo Bonzini
Inline kvm_calc_mmu_role_common into its sole caller, and simplify it by removing the computation of unnecessary bits. Extended bits are unnecessary because page walking uses the CPU role, and EFER.NX/CR0.WP can be set to one unconditionally---matching the format of shadow pages rather than the format of guest pages. The MMU role for two dimensional paging does still depend on the CPU role, even if only barely so, due to SMM and guest mode; for consistency, pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying the vcpu with is_smm or is_guest_mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: remove kvm_calc_shadow_root_page_role_commonPaolo Bonzini
kvm_calc_shadow_root_page_role_common is the same as kvm_calc_cpu_role except for the level, which is overwritten afterwards in kvm_calc_shadow_mmu_root_page_role and kvm_calc_shadow_npt_root_page_role. role.base.direct is already set correctly for the CPU role, and CR0.PG=1 is required for VMRUN so it will also be correct for nested NPT. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: remove ept_ad fieldPaolo Bonzini
The ept_ad field is used during page walk to determine if the guest PTEs have accessed and dirty bits. In the MMU role, the ad_disabled bit represents whether the *shadow* PTEs have the bits, so it would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just !mmu->mmu_role.base.ad_disabled. However, the similar field in the CPU mode, ad_disabled, is initialized correctly: to the opposite value of ept_ad for shadow EPT, and zero for non-EPT guest paging modes (which always have A/D bits). It is therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode, like other page-format fields; it just has to be inverted to account for the different polarity. In fact, now that the CPU mode is distinct from the MMU roles, it would even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and use !mmu->cpu_role.base.ad_disabled instead. I am not doing this because the macro has a small effect in terms of dead code elimination: text data bss dec hex 103544 16665 112 120321 1d601 # as of this patch 103746 16665 112 120523 1d6cb # without PT_HAVE_ACCESSED_DIRTY Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: do not recompute root level from kvm_mmu_role_regsPaolo Bonzini
The root_level can be found in the cpu_role (in fact the field is superfluous and could be removed, but one thing at a time). Since there is only one usage left of role_regs_to_root_level, inline it into kvm_calc_cpu_role. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: split cpu_role from mmu_rolePaolo Bonzini
Snapshot the state of the processor registers that govern page walk into a new field of struct kvm_mmu. This is a more natural representation than having it *mostly* in mmu_role but not exclusively; the delta right now is represented in other fields, such as root_level. The nested MMU now has only the CPU role; and in fact the new function kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role, except that it has role.base.direct equal to !CR0.PG. For a walk-only MMU, "direct" has no meaning, but we set it to !CR0.PG so that role.ext.cr0_pg can go away in a future patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: remove "bool base_only" argumentsPaolo Bonzini
The argument is always false now that kvm_mmu_calc_root_page_role has been removed. Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: pull computation of kvm_mmu_role_regs to kvm_init_mmuPaolo Bonzini
The init_kvm_*mmu functions, with the exception of shadow NPT, do not need to know the full values of CR0/CR4/EFER; they only need to know the bits that make up the "role". This cleanup however will take quite a few incremental steps. As a start, pull the common computation of the struct kvm_mmu_role_regs into their caller: all of them extract the struct from the vcpu as the very first step. Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: constify uses of struct kvm_mmu_role_regsPaolo Bonzini
struct kvm_mmu_role_regs is computed just once and then accessed. Use const to make this clearer, even though the const fields of struct kvm_mmu_role_regs already prevent (or make it harder...) to modify the contents of the struct. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: nested EPT cannot be used in SMMPaolo Bonzini
The role.base.smm flag is always zero when setting up shadow EPT, do not bother copying it over from vcpu->arch.root_mmu. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabledSean Christopherson
Clear enable_mmio_caching if hardware can't support MMIO caching and use the dedicated flag to detect if MMIO caching is enabled instead of assuming shadow_mmio_value==0 means MMIO caching is disabled. TDX will use a zero value even when caching is enabled, and is_mmio_spte() isn't so hot that it needs to avoid an extra memory access, i.e. there's no reason to be super clever. And the clever approach may not even be more performant, e.g. gcc-11 lands the extra check on a non-zero value inline, but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra uops for non-MMIO SPTEs. Cc: Isaku Yamahata <isaku.yamahata@intel.com> Cc: Kai Huang <kai.huang@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220420002747.3287931-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29Merge branch 'kvm-fixes-for-5.18-rc5' into HEADPaolo Bonzini
Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees. The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between 5.18 and commit c24a950ec7d6 ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata for SEV-ES", 2022-04-13).
2022-04-29KVM: x86/mmu: fix potential races when walking host page tableMingwei Zhang
KVM uses lookup_address_in_mm() to detect the hugepage size that the host uses to map a pfn. The function suffers from several issues: - no usage of READ_ONCE(*). This allows multiple dereference of the same page table entry. The TOCTOU problem because of that may cause KVM to incorrectly treat a newly generated leaf entry as a nonleaf one, and dereference the content by using its pfn value. - the information returned does not match what KVM needs; for non-present entries it returns the level at which the walk was terminated, as long as the entry is not 'none'. KVM needs level information of only 'present' entries, otherwise it may regard a non-present PXE entry as a present large page mapping. - the function is not safe for mappings that can be torn down, because it does not disable IRQs and because it returns a PTE pointer which is never safe to dereference after the function returns. So implement the logic for walking host page tables directly in KVM, and stop using lookup_address_in_mm(). Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220429031757.2042406-1-mizhang@google.com> [Inline in host_pfn_mapping_level, ensure no semantic change for its callers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDRSean Christopherson
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13Merge branch 'kvm-older-features' into HEADPaolo Bonzini
Merge branch for features that did not make it into 5.18: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-05KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loadedSean Christopherson
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as -1 is technically undefined behavior when its value is read out by param_get_bool(), as boolean values are supposed to be '0' or '1'. Alternatively, KVM could define a custom getter for the param, but the auto value doesn't depend on the vendor module in any way, and printing "auto" would be unnecessarily unfriendly to the user. In addition to fixing the undefined behavior, resolving the auto value also fixes the scenario where the auto value resolves to N and no vendor module is loaded. Previously, -1 would result in Y being printed even though KVM would ultimately disable the mitigation. Rename the existing MMU module init/exit helpers to clarify that they're invoked with respect to the vendor module, and add comments to document why KVM has two separate "module init" flows. ========================================================================= UBSAN: invalid-load in kernel/params.c:320:33 load of value 255 is not a valid value for type '_Bool' CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value.cold+0x43/0x48 param_get_bool.cold+0xf/0x14 param_attr_show+0x55/0x80 module_attr_show+0x1c/0x30 sysfs_kf_seq_show+0x93/0xc0 seq_read_iter+0x11c/0x450 new_sync_read+0x11b/0x1a0 vfs_read+0xf0/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ========================================================================= Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220331221359.3912754-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/mmu: Don't rebuild page when the page is synced and no tlb flushing ↵Hou Wenlong
is required Before Commit c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed"), the return value of kvm_sync_page() indicates whether the page is synced, and kvm_mmu_get_page() would rebuild page when the sync fails. But now, kvm_sync_page() returns false when the page is synced and no tlb flushing is required, which leads to rebuild page in kvm_mmu_get_page(). So return the return value of mmu->sync_page() directly and check it in kvm_mmu_get_page(). If the sync fails, the page will be zapped and the invalid_list is not empty, so set flush as true is accepted in mmu_sync_children(). Cc: stable@vger.kernel.org Fixes: c3e5e415bc1e6 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed") Signed-off-by: Hou Wenlong <houwenlong.hwl@antgroup.com> Acked-by: Lai Jiangshan <jiangshanlai@gmail.com> Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com> [mmu_sync_children should not flush if the page is zapped. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was setMaxim Levitsky
It makes more sense to print new SPTE value than the old value. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Handle implicit supervisor access with SMAPLai Jiangshan
There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Fix comments in update_permission_bitmaskLai Jiangshan
The commit 09f037aa48f3 ("KVM: MMU: speedup update_permission_bitmask") refactored the code of update_permission_bitmask() and change the comments. It added a condition into a list to match the new code, so the number/order for conditions in the comments should be updated too. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: X86: Change the type of access u32 to u64Lai Jiangshan
Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmapSean Christopherson
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB flush when processing multiple roots (including nested TDP shadow roots). Dropping the TLB flush resulted in random crashes when running Hyper-V Server 2019 in a guest with KSM enabled in the host (or any source of mmu_notifier invalidations, KSM is just the easiest to force). This effectively revert commits 873dd122172f8cce329113cfb0dfe3d2344d80c0 and fcb93eb6d09dd302cbef22bd95a5858af75e4156, and thus restores commit cf3e26427c08ad9015956293ab389004ac6a338e, plus this delta on top: bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, struct kvm_mmu_page *root; for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false); + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); return flush; } Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325230348.2587437-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: MMU: propagate alloc_workqueue failurePaolo Bonzini
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>