summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2022-07-19KVM: x86: Protect the unused bits in MSR exiting flagsAaron Lewis
The flags for KVM_CAP_X86_USER_SPACE_MSR and KVM_X86_SET_MSR_FILTER have no protection for their unused bits. Without protection, future development for these features will be difficult. Add the protection needed to make it possible to extend these features in the future. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20220714161314.1715227-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()Vitaly Kuznetsov
'vector' and 'trig_mode' fields of 'struct kvm_lapic_irq' are left uninitialized in kvm_pv_kick_cpu_op(). While these fields are normally not needed for APIC_DM_REMRD, they're still referenced by __apic_accept_irq() for trace_kvm_apic_accept_irq(). Fully initialize the structure to avoid consuming random stack memory. Fixes: a183b638b61c ("KVM: x86: make apic_accept_irq tracepoint more generic") Reported-by: syzbot+d6caa905917d353f0d07@syzkaller.appspotmail.com Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220708125147.593975-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-07-14kvm: stats: tell userspace which values are booleanPaolo Bonzini
Some of the statistics values exported by KVM are always only 0 or 1. It can be useful to export this fact to userspace so that it can track them specially (for example by polling the value every now and then to compute a % of time spent in a specific state). Therefore, add "boolean value" as a new "unit". While it is not exactly a unit, it walks and quacks like one. In particular, using the type would be wrong because boolean values could be instantaneous or peak values (e.g. "is the rmap allocated?") or even two-bucket histograms (e.g. "number of posted vs. non-posted interrupt injections"). Suggested-by: Amneesh Singh <natto@weirdnatto.in> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-27x86/kvm/vmx: Make noinstr cleanPeter Zijlstra
The recent mmio_stale_data fixes broke the noinstr constraints: vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x15b: call to wrmsrl.constprop.0() leaves .noinstr.text section vmlinux.o: warning: objtool: vmx_vcpu_enter_exit+0x1bf: call to kvm_arch_has_assigned_device() leaves .noinstr.text section make it all happy again. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-06-14Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "While last week's pull request contained miscellaneous fixes for x86, this one covers other architectures, selftests changes, and a bigger series for APIC virtualization bugs that were discovered during 5.20 development. The idea is to base 5.20 development for KVM on top of this tag. ARM64: - Properly reset the SVE/SME flags on vcpu load - Fix a vgic-v2 regression regarding accessing the pending state of a HW interrupt from userspace (and make the code common with vgic-v3) - Fix access to the idreg range for protected guests - Ignore 'kvm-arm.mode=protected' when using VHE - Return an error from kvm_arch_init_vm() on allocation failure - A bunch of small cleanups (comments, annotations, indentation) RISC-V: - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry x86-64: - Fix error in page tables with MKTME enabled - Dirty page tracking performance test extended to running a nested guest - Disable APICv/AVIC in cases that it cannot implement correctly" [ This merge also fixes a misplaced end parenthesis bug introduced in commit 3743c2f02517 ("KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base") pointed out by Sean Christopherson ] Link: https://lore.kernel.org/all/20220610191813.371682-1-seanjc@google.com/ * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (34 commits) KVM: selftests: Restrict test region to 48-bit physical addresses when using nested KVM: selftests: Add option to run dirty_log_perf_test vCPUs in L2 KVM: selftests: Clean up LIBKVM files in Makefile KVM: selftests: Link selftests directly with lib object files KVM: selftests: Drop unnecessary rule for STATIC_LIBS KVM: selftests: Add a helper to check EPT/VPID capabilities KVM: selftests: Move VMX_EPT_VPID_CAP_AD_BITS to vmx.h KVM: selftests: Refactor nested_map() to specify target level KVM: selftests: Drop stale function parameter comment for nested_map() KVM: selftests: Add option to create 2M and 1G EPT mappings KVM: selftests: Replace x86_page_size with PG_LEVEL_XX KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE KVM: x86: SVM: drop preempt-safe wrappers for avic_vcpu_load/put KVM: x86: disable preemption around the call to kvm_arch_vcpu_{un|}blocking KVM: x86: disable preemption while updating apicv inhibition KVM: x86: SVM: fix avic_kick_target_vcpus_fast KVM: x86: SVM: remove avic's broken code that updated APIC ID KVM: x86: inhibit APICv/AVIC on changes to APIC ID or APIC base KVM: x86: document AVIC/APICv inhibit reasons KVM: x86/mmu: Set memory encryption "value", not "mask", in shadow PDPTRs ...
2022-06-14Merge tag 'x86-bugs-2022-06-01' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MMIO stale data fixes from Thomas Gleixner: "Yet another hw vulnerability with a software mitigation: Processor MMIO Stale Data. They are a class of MMIO-related weaknesses which can expose stale data by propagating it into core fill buffers. Data which can then be leaked using the usual speculative execution methods. Mitigations include this set along with microcode updates and are similar to MDS and TAA vulnerabilities: VERW now clears those buffers too" * tag 'x86-bugs-2022-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation/mmio: Print SMT warning KVM: x86/speculation: Disable Fill buffer clear within guests x86/speculation/mmio: Reuse SRBDS mitigation for SBDS x86/speculation/srbds: Update SRBDS mitigation selection x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data x86/speculation/mmio: Enable CPU Fill buffer clearing on idle x86/bugs: Group MDS, TAA & Processor MMIO Stale Data mitigations x86/speculation/mmio: Add mitigation for Processor MMIO Stale Data x86/speculation: Add a common function for MD_CLEAR mitigation update x86/speculation/mmio: Enumerate Processor MMIO Stale Data bug Documentation: Add documentation for Processor MMIO Stale Data
2022-06-09KVM: x86: disable preemption while updating apicv inhibitionMaxim Levitsky
Currently nothing prevents preemption in kvm_vcpu_update_apicv. On SVM, If the preemption happens after we update the vcpu->arch.apicv_active, the preemption itself will 'update' the inhibition since the AVIC will be first disabled on vCPU unload and then enabled, when the current task is loaded again. Then we will try to update it again, which will lead to a warning in __avic_vcpu_load, that the AVIC is already enabled. Fix this by disabling preemption in this code. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-09Merge tag 'kvm-riscv-fixes-5.19-1' of https://github.com/kvm-riscv/linux ↵Paolo Bonzini
into HEAD KVM/riscv fixes for 5.19, take #1 - Typo fix in arch/riscv/kvm/vmid.c - Remove broken reference pattern from MAINTAINERS entry
2022-06-08Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM fixes from Paolo Bonzini: - syzkaller NULL pointer dereference - TDP MMU performance issue with disabling dirty logging - 5.14 regression with SVM TSC scaling - indefinite stall on applying live patches - unstable selftest - memory leak from wrong copy-and-paste - missed PV TLB flush when racing with emulation * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86: do not report a vCPU as preempted outside instruction boundaries KVM: x86: do not set st->preempted when going back to user space KVM: SVM: fix tsc scaling cache logic KVM: selftests: Make hyperv_clock selftest more stable KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty logging x86: drop bogus "cc" clobber from __try_cmpxchg_user_asm() KVM: x86/mmu: Check every prev_roots in __kvm_mmu_free_obsolete_roots() entry/kvm: Exit to user mode when TIF_NOTIFY_SIGNAL is set KVM: Don't null dereference ops->destroy
2022-06-08KVM: x86: do not report a vCPU as preempted outside instruction boundariesPaolo Bonzini
If a vCPU is outside guest mode and is scheduled out, it might be in the process of making a memory access. A problem occurs if another vCPU uses the PV TLB flush feature during the period when the vCPU is scheduled out, and a virtual address has already been translated but has not yet been accessed, because this is equivalent to using a stale TLB entry. To avoid this, only report a vCPU as preempted if sure that the guest is at an instruction boundary. A rescheduling request will be delivered to the host physical CPU as an external interrupt, so for simplicity consider any vmexit *not* instruction boundary except for external interrupts. It would in principle be okay to report the vCPU as preempted also if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the vmentry/vmexit overhead unnecessarily, and optimistic spinning is also unlikely to succeed. However, leave it for later because right now kvm_vcpu_check_block() is doing memory accesses. Even though the TLB flush issue only applies to virtual memory address, it's very much preferrable to be conservative. Reported-by: Jann Horn <jannh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-08KVM: x86: do not set st->preempted when going back to user spacePaolo Bonzini
Similar to the Xen path, only change the vCPU's reported state if the vCPU was actually preempted. The reason for KVM's behavior is that for example optimistic spinning might not be a good idea if the guest is doing repeated exits to userspace; however, it is confusing and unlikely to make a difference, because well-tuned guests will hardly ever exit KVM_RUN in the first place. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-06-05Merge tag 'x86-cleanups-2022-06-05' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Thomas Gleixner: "A set of small x86 cleanups: - Remove unused headers in the IDT code - Kconfig indendation and comment fixes - Fix all 'the the' typos in one go instead of waiting for bots to fix one at a time" * tag 'x86-cleanups-2022-06-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Fix all occurences of the "the the" typo x86/idt: Remove unused headers x86/Kconfig: Fix indentation of arch/x86/Kconfig.debug x86/Kconfig: Fix indentation and add endif comments to arch/x86/Kconfig
2022-05-27x86: Fix all occurences of the "the the" typoBo Liu
Rather than waiting for the bots to fix these one-by-one, fix all occurences of "the the" throughout arch/x86. Signed-off-by: Bo Liu <liubo03@inspur.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20220527061400.5694-1-liubo03@inspur.com
2022-05-25KVM: x86: avoid calling x86 emulator without a decoded instructionSean Christopherson
Whenever x86_decode_emulated_instruction() detects a breakpoint, it returns the value that kvm_vcpu_check_breakpoint() writes into its pass-by-reference second argument. Unfortunately this is completely bogus because the expected outcome of x86_decode_emulated_instruction is an EMULATION_* value. Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK and x86_emulate_instruction() is called without having decoded the instruction. This causes various havoc from running with a stale emulation context. The fix is to move the call to kvm_vcpu_check_breakpoint() where it was before commit 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding") introduced x86_decode_emulated_instruction(). The other caller of the function does not need breakpoint checks, because it is invoked as part of a vmexit and the processor has already checked those before executing the instruction that #GP'd. This fixes CVE-2022-1852. Reported-by: Qiuhao Li <qiuhao@sysec.org> Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Fixes: 4aa2691dcbd3 ("KVM: x86: Factor out x86 instruction emulation with decoding") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220311032801.3467418-2-seanjc@google.com> [Rewrote commit message according to Qiuhao's report, since a patch already existed to fix the bug. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-25Merge tag 'kvmarm-5.19' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 updates for 5.19 - Add support for the ARMv8.6 WFxT extension - Guard pages for the EL2 stacks - Trap and emulate AArch32 ID registers to hide unsupported features - Ability to select and save/restore the set of hypercalls exposed to the guest - Support for PSCI-initiated suspend in collaboration with userspace - GICv3 register-based LPI invalidation support - Move host PMU event merging into the vcpu data structure - GICv3 ITS save/restore fixes - The usual set of small-scale cleanups and fixes [Due to the conflict, KVM_SYSTEM_EVENT_SEV_TERM is relocated from 4 to 6. - Paolo]
2022-05-25KVM: LAPIC: Trace LAPIC timer expiration on every vmentryWanpeng Li
In commit ec0671d5684a ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire was moved after guest_exit_irqoff() because invoking tracepoints within kvm_guest_enter/kvm_guest_exit caused a lockdep splat. These days this is not necessary, because commit 87fa7f3e98a1 ("x86/kvm: Move context tracking where it belongs", 2020-07-09) restricted the RCU extended quiescent state to be closer to vmentry/vmexit. Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate, because it will be reported even if vcpu_enter_guest causes multiple vmentries via the IPI/Timer fast paths, and it allows the removal of advance_expire_delta. Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-21KVM: x86/speculation: Disable Fill buffer clear within guestsPawan Gupta
The enumeration of MD_CLEAR in CPUID(EAX=7,ECX=0).EDX{bit 10} is not an accurate indicator on all CPUs of whether the VERW instruction will overwrite fill buffers. FB_CLEAR enumeration in IA32_ARCH_CAPABILITIES{bit 17} covers the case of CPUs that are not vulnerable to MDS/TAA, indicating that microcode does overwrite fill buffers. Guests running in VMM environments may not be aware of all the capabilities/vulnerabilities of the host CPU. Specifically, a guest may apply MDS/TAA mitigations when a virtual CPU is enumerated as vulnerable to MDS/TAA even when the physical CPU is not. On CPUs that enumerate FB_CLEAR_CTRL the VMM may set FB_CLEAR_DIS to skip overwriting of fill buffers by the VERW instruction. This is done by setting FB_CLEAR_DIS during VMENTER and resetting on VMEXIT. For guests that enumerate FB_CLEAR (explicitly asking for fill buffer clear capability) the VMM will not use FB_CLEAR_DIS. Irrespective of guest state, host overwrites CPU buffers before VMENTER to protect itself from an MMIO capable guest, as part of mitigation for MMIO Stale Data vulnerabilities. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-05-12KVM: x86: a vCPU with a pending triple fault is runnablePaolo Bonzini
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Expand and clean up page fault statsSean Christopherson
Expand and clean up the page fault stats. The current stats are at best incomplete, and at worst misleading. Differentiate between faults that are actually fixed vs those that result in an MMIO SPTE being created, track faults that are spurious, faults that trigger emulation, faults that that are fixed in the fast path, and last but not least, track the number of faults that are taken. Note, the number of faults that require emulation for write-protected shadow pages can roughly be calculated by subtracting the number of MMIO SPTEs created from the overall number of faults that trigger emulation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86/mmu: Make all page fault handlers internal to the MMUSean Christopherson
Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all of the page fault handling collateral that was in mmu.h purely for the async #PF handler into mmu_internal.h, where it belongs. This will allow kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to expose those enums outside of the MMU. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220423034752.1161007-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-12KVM: x86: fix typo in __try_cmpxchg_user causing non-atomicnessMaxim Levitsky
This shows up as a TDP MMU leak when running nested. Non-working cmpxchg on L0 relies makes L1 install two different shadow pages under same spte, and one of them is leaked. Fixes: 1c2361f667f36 ("KVM: x86: Use __try_cmpxchg_user() to emulate atomic accesses") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220512101420.306759-1-mlevitsk@redhat.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-05-02KVM: x86: avoid loading a vCPU after .vm_destroy was calledMaxim Levitsky
This can cause various unexpected issues, since VM is partially destroyed at that point. For example when AVIC is enabled, this causes avic_vcpu_load to access physical id page entry which is already freed by .vm_destroy. Fixes: 8221c1370056 ("svm: Manage vcpu load/unload when enable AVIC") Cc: stable@vger.kernel.org Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: SVM: Introduce trace point for the slow-path of avic_kic_target_vcpusSuravee Suthikulpanit
This can help identify potential performance issues when handles AVIC incomplete IPI due vCPU not running. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: replace direct_map with root_role.directPaolo Bonzini
direct_map is always equal to the direct field of the root page's role: - for shadow paging, direct_map is true if CR0.PG=0 and root_role.direct is copied from cpu_role.base.direct - for TDP, it is always true and root_role.direct is also always true - for shadow TDP, it is always false and root_role.direct is also always false Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86: Clean up and document nested #PF workaroundSean Christopherson
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29Merge branch 'kvm-fixes-for-5.18-rc5' into HEADPaolo Bonzini
Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees. The merge reconciles the ABI fixes for KVM_EXIT_SYSTEM_EVENT between 5.18 and commit c24a950ec7d6 ("KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata for SEV-ES", 2022-04-13).
2022-04-29Merge branch 'kvm-fixes-for-5.18-rc5' into HEADPaolo Bonzini
Fixes for (relatively) old bugs, to be merged in both the -rc and next development trees: * Fix potential races when walking host page table * Fix bad user ABI for KVM_EXIT_SYSTEM_EVENT * Fix shadow page table leak when KVM runs nested
2022-04-29KVM: fix bad user ABI for KVM_EXIT_SYSTEM_EVENTPaolo Bonzini
When KVM_EXIT_SYSTEM_EVENT was introduced, it included a flags member that at the time was unused. Unfortunately this extensibility mechanism has several issues: - x86 is not writing the member, so it would not be possible to use it on x86 except for new events - the member is not aligned to 64 bits, so the definition of the uAPI struct is incorrect for 32- on 64-bit userspace. This is a problem for RISC-V, which supports CONFIG_KVM_COMPAT, but fortunately usage of flags was only introduced in 5.18. Since padding has to be introduced, place a new field in there that tells if the flags field is valid. To allow further extensibility, in fact, change flags to an array of 16 values, and store how many of the values are valid. The availability of the new ndata field is tied to a system capability; all architectures are changed to fill in the field. To avoid breaking compilation of userspace that was using the flags field, provide a userspace-only union to overlap flags with data[0]. The new field is placed at the same offset for both 32- and 64-bit userspace. Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reported-by: kernel test robot <lkp@intel.com> Message-Id: <20220422103013.34832-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-29KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDRSean Christopherson
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: SEV: add cache flush to solve SEV cache incoherency issuesMingwei Zhang
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: Sean Christpherson <seanjc@google.com> Reported-by: Mingwei Zhang <mizhang@google.com> Signed-off-by: Mingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabledSean Christopherson
Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is disabled at the module level to avoid having to acquire the mutex and potentially process all vCPUs. The DISABLE inhibit will (barring bugs) never be lifted, so piling on more inhibits is unnecessary. Fixes: cae72dcc3b21 ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active") Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a raceSean Christopherson
Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an in-kernel local APIC and APICv enabled at the module level. Consuming kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens before the vCPU is fully onlined, i.e. it won't get the request made to "all" vCPUs. If APICv is globally inhibited between setting apicv_active and onlining the vCPU, the vCPU will end up running with APICv enabled and trigger KVM's sanity check. Mark APICv as active during vCPU creation if APICv is enabled at the module level, both to be optimistic about it's final state, e.g. to avoid additional VMWRITEs on VMX, and because there are likely bugs lurking since KVM checks apicv_active in multiple vCPU creation paths. While keeping the current behavior of consuming kvm_apicv_activated() is arguably safer from a regression perspective, force apicv_active so that vCPU creation runs with deterministic state and so that if there are bugs, they are found sooner than later, i.e. not when some crazy race condition is hit. WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877 Modules linked in: CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014 RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877 Call Trace: <TASK> vcpu_run arch/x86/kvm/x86.c:10039 [inline] kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234 kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:874 [inline] __se_sys_ioctl fs/ioctl.c:860 [inline] __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a call to KVM_SET_GUEST_DEBUG. r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0) r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0) ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async) r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async) r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002) ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5}) ioctl$KVM_RUN(r2, 0xae80, 0x0) Reported-by: Gaoning Pan <pgn@zju.edu.cn> Reported-by: Yongkang Jia <kangel@zju.edu.cn> Fixes: 8df14af42f00 ("kvm: x86: Add support for dynamic APICv activation") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabledSean Christopherson
Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via module param. A recent refactoring to add a wrapper for setting/clearing inhibits unintentionally changed the flag, probably due to a copy+paste goof. Fixes: 4f4c4a3ee53c ("KVM: x86: Trace all APICv inhibit changes and capture overall status") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220420013732.3308816-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abusedSean Christopherson
Add wrappers to acquire/release KVM's SRCU lock when stashing the index in vcpu->src_idx, along with rudimentary detection of illegal usage, e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx. Because the SRCU index is (currently) either 0 or 1, illegal nesting bugs can go unnoticed for quite some time and only cause problems when the nested lock happens to get a different index. Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will likely yell so loudly that it will bring the kernel to its knees. Signed-off-by: Sean Christopherson <seanjc@google.com> Tested-by: Fabiano Rosas <farosas@linux.ibm.com> Message-Id: <20220415004343.2203171-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-21KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()Sean Christopherson
Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the lock in kvm_arch_vcpu_ioctl_run(). More importantly, don't overwrite vcpu->srcu_idx. If the index acquired by complete_emulated_io() differs from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively leak a lock and hang if/when synchronize_srcu() is invoked for the relevant grace period. Fixes: 8d25b7beca7e ("KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220415004343.2203171-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13KVM: x86: Bail to userspace if emulation of atomic user access faultsSean Christopherson
Exit to userspace when emulating an atomic guest access if the CMPXCHG on the userspace address faults. Emulating the access as a write and thus likely treating it as emulated MMIO is wrong, as KVM has already confirmed there is a valid, writable memslot. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13KVM: x86: Use __try_cmpxchg_user() to emulate atomic accessesSean Christopherson
Use the recently introduce __try_cmpxchg_user() to emulate atomic guest accesses via the associated userspace address instead of mapping the backing pfn into kernel address space. Using kvm_vcpu_map() is unsafe as it does not coordinate with KVM's mmu_notifier to ensure the hva=>pfn translation isn't changed/unmapped in the memremap() path, i.e. when there's no struct page and thus no elevated refcount. Fixes: 42e35f8072c3 ("KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220202004945.2540433-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13KVM: x86: Move .pmu_ops to kvm_x86_init_ops and tag as __initdataLike Xu
The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata. That'll save those precious few bytes, and more importantly make the original ops unreachable, i.e. make it harder to sneak in post-init modification bugs. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13KVM: x86: Copy kvm_pmu_ops by value to eliminate layer of indirectionLike Xu
Replace the kvm_pmu_ops pointer in common x86 with an instance of the struct to save one pointer dereference when invoking functions. Copy the struct by value to set the ops during kvm_init(). Signed-off-by: Like Xu <likexu@tencent.com> [sean: Move pmc_is_enabled(), make kvm_pmu_ops static] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13KVM: x86: Move kvm_ops_static_call_update() to x86.cLike Xu
The kvm_ops_static_call_update() is defined in kvm_host.h. That's completely unnecessary, it should have exactly one caller, kvm_arch_hardware_setup(). Move the helper to x86.c and have it do the actual memcpy() of the ops in addition to the static call updates. This will also allow for cleanly giving kvm_pmu_ops static_call treatment. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> [sean: Move memcpy() into the helper and rename accordingly] Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-13Merge branch 'kvm-older-features' into HEADPaolo Bonzini
Merge branch for features that did not make it into 5.18: * New ioctls to get/set TSC frequency for a whole VM * Allow userspace to opt out of hypercall patching Nested virtualization improvements for AMD: * Support for "nested nested" optimizations (nested vVMLOAD/VMSAVE, nested vGIF) * Allow AVIC to co-exist with a nested guest running * Fixes for LBR virtualizations when a nested guest is running, and nested LBR virtualization support * PAUSE filtering for nested hypervisors Guest support: * Decoupling of vcpu_is_preempted from PV spinlocks Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-11KVM: x86: hyper-v: Avoid writing to TSC page without an active vCPUVitaly Kuznetsov
The following WARN is triggered from kvm_vm_ioctl_set_clock(): WARNING: CPU: 10 PID: 579353 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:3161 mark_page_dirty_in_slot+0x6c/0x80 [kvm] ... CPU: 10 PID: 579353 Comm: qemu-system-x86 Tainted: G W O 5.16.0.stable #20 Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021 RIP: 0010:mark_page_dirty_in_slot+0x6c/0x80 [kvm] ... Call Trace: <TASK> ? kvm_write_guest+0x114/0x120 [kvm] kvm_hv_invalidate_tsc_page+0x9e/0xf0 [kvm] kvm_arch_vm_ioctl+0xa26/0xc50 [kvm] ? schedule+0x4e/0xc0 ? __cond_resched+0x1a/0x50 ? futex_wait+0x166/0x250 ? __send_signal+0x1f1/0x3d0 kvm_vm_ioctl+0x747/0xda0 [kvm] ... The WARN was introduced by commit 03c0304a86bc ("KVM: Warn if mark_page_dirty() is called without an active vCPU") but the change seems to be correct (unlike Hyper-V TSC page update mechanism). In fact, there's no real need to actually write to guest memory to invalidate TSC page, this can be done by the first vCPU which goes through kvm_guest_time_update(). Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220407201013.963226-1-vkuznets@redhat.com>
2022-04-05KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loadedSean Christopherson
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as -1 is technically undefined behavior when its value is read out by param_get_bool(), as boolean values are supposed to be '0' or '1'. Alternatively, KVM could define a custom getter for the param, but the auto value doesn't depend on the vendor module in any way, and printing "auto" would be unnecessarily unfriendly to the user. In addition to fixing the undefined behavior, resolving the auto value also fixes the scenario where the auto value resolves to N and no vendor module is loaded. Previously, -1 would result in Y being printed even though KVM would ultimately disable the mitigation. Rename the existing MMU module init/exit helpers to clarify that they're invoked with respect to the vendor module, and add comments to document why KVM has two separate "module init" flows. ========================================================================= UBSAN: invalid-load in kernel/params.c:320:33 load of value 255 is not a valid value for type '_Bool' CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value.cold+0x43/0x48 param_get_bool.cold+0xf/0x14 param_attr_show+0x55/0x80 module_attr_show+0x1c/0x30 sysfs_kf_seq_show+0x93/0xc0 seq_read_iter+0x11c/0x450 new_sync_read+0x11b/0x1a0 vfs_read+0xf0/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ========================================================================= Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20220331221359.3912754-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: - Only do MSR filtering for MSRs accessed by rdmsr/wrmsr - Documentation improvements - Prevent module exit until all VMs are freed - PMU Virtualization fixes - Fix for kvm_irq_delivery_to_apic_fast() NULL-pointer dereferences - Other miscellaneous bugfixes * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (42 commits) KVM: x86: fix sending PV IPI KVM: x86/mmu: do compare-and-exchange of gPTE via the user address KVM: x86: Remove redundant vm_entry_controls_clearbit() call KVM: x86: cleanup enter_rmode() KVM: x86: SVM: fix tsc scaling when the host doesn't support it kvm: x86: SVM: remove unused defines KVM: x86: SVM: move tsc ratio definitions to svm.h KVM: x86: SVM: fix avic spec based definitions again KVM: MIPS: remove reference to trap&emulate virtualization KVM: x86: document limitations of MSR filtering KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr KVM: x86/emulator: Emulate RDPID only if it is enabled in guest KVM: x86/pmu: Fix and isolate TSX-specific performance event logic KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set KVM: x86/svm: Clear reserved bits written to PerfEvtSeln MSRs KVM: x86: Trace all APICv inhibit changes and capture overall status KVM: x86: Add wrappers for setting/clearing APICv inhibits KVM: x86: Make APICv inhibit reasons an enum and cleanup naming KVM: X86: Handle implicit supervisor access with SMAP KVM: X86: Rename variable smap to not_smap in permission_fault() ...
2022-04-02KVM: x86: optimize PKU branching in kvm_load_{guest|host}_xsave_stateJon Kohler
kvm_load_{guest|host}_xsave_state handles xsave on vm entry and exit, part of which is managing memory protection key state. The latest arch.pkru is updated with a rdpkru, and if that doesn't match the base host_pkru (which about 70% of the time), we issue a __write_pkru. To improve performance, implement the following optimizations: 1. Reorder if conditions prior to wrpkru in both kvm_load_{guest|host}_xsave_state. Flip the ordering of the || condition so that XFEATURE_MASK_PKRU is checked first, which when instrumented in our environment appeared to be always true and less overall work than kvm_read_cr4_bits. For kvm_load_guest_xsave_state, hoist arch.pkru != host_pkru ahead one position. When instrumented, I saw this be true roughly ~70% of the time vs the other conditions which were almost always true. With this change, we will avoid 3rd condition check ~30% of the time. 2. Wrap PKU sections with CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS, as if the user compiles out this feature, we should not have these branches at all. Signed-off-by: Jon Kohler <jon@nutanix.com> Message-Id: <20220324004439.6709-1-jon@nutanix.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: allow per cpu apicv inhibit reasonsMaxim Levitsky
Add optional callback .vcpu_get_apicv_inhibit_reasons returning extra inhibit reasons that prevent APICv from working on this vCPU. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Don't snapshot "max" TSC if host TSC is constantSean Christopherson
Don't snapshot tsc_khz into max_tsc_khz during KVM initialization if the host TSC is constant, in which case the actual TSC frequency will never change and thus capturing the "max" TSC during initialization is unnecessary, KVM can simply use tsc_khz during VM creation. On CPUs with constant TSC, but not a hardware-specified TSC frequency, snapshotting max_tsc_khz and using that to set a VM's default TSC frequency can lead to KVM thinking it needs to manually scale the guest's TSC if refining the TSC completes after KVM snapshots tsc_khz. The actual frequency never changes, only the kernel's calculation of what that frequency is changes. On systems without hardware TSC scaling, this either puts KVM into "always catchup" mode (extremely inefficient), or prevents creating VMs altogether. Ideally, KVM would not be able to race with TSC refinement, or would have a hook into tsc_refine_calibration_work() to get an alert when refinement is complete. Avoiding the race altogether isn't practical as refinement takes a relative eternity; it's deliberately put on a work queue outside of the normal boot sequence to avoid unnecessarily delaying boot. Adding a hook is doable, but somewhat gross due to KVM's ability to be built as a module. And if the TSC is constant, which is likely the case for every VMX/SVM-capable CPU produced in the last decade, the race can be hit if and only if userspace is able to create a VM before TSC refinement completes; refinement is slow, but not that slow. For now, punt on a proper fix, as not taking a snapshot can help some uses cases and not taking a snapshot is arguably correct irrespective of the race with refinement. [ dwmw2: Rebase on top of KVM-wide default_tsc_khz to ensure that all vCPUs get the same frequency even if we hit the race. ] Cc: Suleiman Souhlal <suleiman@google.com> Cc: Anton Romanov <romanton@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220225145304.36166-3-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86: Accept KVM_[GS]ET_TSC_KHZ as a VM ioctl.David Woodhouse
This sets the default TSC frequency for subsequently created vCPUs. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Message-Id: <20220225145304.36166-2-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/xen: Advertise and document KVM_XEN_HVM_CONFIG_EVTCHN_SENDDavid Woodhouse
At the end of the patch series adding this batch of event channel acceleration features, finally add the feature bit which advertises them and document it all. For SCHEDOP_poll we need to wake a polling vCPU when a given port is triggered, even when it's masked — and we want to implement that in the kernel, for efficiency. So we want the kernel to know that it has sole ownership of event channel delivery. Thus, we allow userspace to make the 'promise' by setting the corresponding feature bit in its KVM_XEN_HVM_CONFIG call. As we implement SCHEDOP_poll bypass later, we will do so only if that promise has been made by userspace. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-16-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2022-04-02KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_VCPU_IDDavid Woodhouse
In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we need to be aware of the Xen CPU numbering. This looks a lot like the Hyper-V handling of vpidx, for obvious reasons. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-12-dwmw2@infradead.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>