summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2019-09-11kvm: Nested KVM MMUs need PAE root tooJiří Paleček
On AMD processors, in PAE 32bit mode, nested KVM instances don't work. The L0 host get a kernel OOPS, which is related to arch.mmu->pae_root being NULL. The reason for this is that when setting up nested KVM instance, arch.mmu is set to &arch.guest_mmu (while normally, it would be &arch.root_mmu). However, the initialization and allocation of pae_root only creates it in root_mmu. KVM code (ie. in mmu_alloc_shadow_roots) then accesses arch.mmu->pae_root, which is the unallocated arch.guest_mmu->pae_root. This fix just allocates (and frees) pae_root in both guest_mmu and root_mmu (and also lm_root if it was allocated). The allocation is subject to previous restrictions ie. it won't allocate anything on 64-bit and AFAIK not on Intel. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=203923 Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") Signed-off-by: Jiri Palecek <jpalecek@web.de> Tested-by: Jiri Palecek <jpalecek@web.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: set ctxt->have_exception in x86_decode_insn()Jan Dakinevich
x86_emulate_instruction() takes into account ctxt->have_exception flag during instruction decoding, but in practice this flag is never set in x86_decode_insn(). Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: always stop emulation on page faultJan Dakinevich
inject_emulated_exception() returns true if and only if nested page fault happens. However, page fault can come from guest page tables walk, either nested or not nested. In both cases we should stop an attempt to read under RIP and give guest to step over its own page fault handler. This is also visible when an emulated instruction causes a #GP fault and the VMware backdoor is enabled. To handle the VMware backdoor, KVM intercepts #GP faults; with only the next patch applied, x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi() (or gp_interception() for SVM) to re-inject the original #GP because it thinks emulation failed due to a non-VMware opcode. This patch prevents the issue as x86_emulate_instruction() will return EMULATE_DONE after injecting the #GP. Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: nVMX: trace nested VM-Enter failures detected by H/WSean Christopherson
Use the recently added tracepoint for logging nested VM-Enter failures instead of spamming the kernel log when hardware detects a consistency check failure. Take the opportunity to print the name of the error code instead of dumping the raw hex number, but limit the symbol table to error codes that can reasonably be encountered by KVM. Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so that tracing of "invalid control field" errors isn't suppressed when nested early checks are enabled. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: nVMX: add tracepoint for failed nested VM-EnterSean Christopherson
Debugging a failed VM-Enter is often like searching for a needle in a haystack, e.g. there are over 80 consistency checks that funnel into the "invalid control field" error code. One way to expedite debug is to run the buggy code as an L1 guest under KVM (and pray that the failing check is detected by KVM). However, extracting useful debug information out of L0 KVM requires attaching a debugger to KVM and/or modifying the source, e.g. to log which check is failing. Make life a little less painful for VMM developers and add a tracepoint for failed VM-Enter consistency checks. Ideally the tracepoint would capture both what check failed and precisely why it failed, but logging why a checked failed is difficult to do in a generic tracepoint without resorting to invasive techniques, e.g. generating a custom string on failure. That being said, for the vast majority of VM-Enter failures the most difficult step is figuring out exactly what to look at, e.g. figuring out which bit was incorrectly set in a control field is usually not too painful once the guilty field as been identified. To reach a happy medium between precision and ease of use, simply log the code that detected a failed check, using a macro to execute the check and log the trace event on failure. This approach enables tracing arbitrary code, e.g. it's not limited to function calls or specific formats of checks, and the changes to the existing code are minimally invasive. A macro with a two-character name is desirable as usage of the macro doesn't result in overly long lines or confusing alignment, while still retaining some amount of readability. I.e. a one-character name is a little too terse, and a three-character name results in the contents being passed to the macro aligning with an indented line when the macro is used an in if-statement, e.g.: if (VCC(nested_vmx_check_long_line_one(...) && nested_vmx_check_long_line_two(...))) return -EINVAL; And that is the story of how the CC(), a.k.a. Consistency Check, macro got its name. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11x86: KVM: svm: Fix a check in nested_svm_vmrun()Dan Carpenter
We refactored this code a bit and accidentally deleted the "-" character from "-EINVAL". The kvm_vcpu_map() function never returns positive EINVAL. Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: Return to userspace with internal error on unexpected exit reasonLiran Alon
Receiving an unexpected exit reason from hardware should be considered as a severe bug in KVM. Therefore, instead of just injecting #UD to guest and ignore it, exit to userspace on internal error so that it could handle it properly (probably by terminating guest). In addition, prefer to use vcpu_unimpl() instead of WARN_ONCE() as handling unexpected exit reason should be a rare unexpected event (that was expected to never happen) and we prefer to print a message on it every time it occurs to guest. Furthermore, dump VMCS/VMCB to dmesg to assist diagnosing such cases. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM codeSean Christopherson
Move RDMSR and WRMSR emulation into common x86 code to consolidate nearly identical SVM and VMX code. Note, consolidating RDMSR introduces an extra indirect call, i.e. retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a guest kernel likely has bigger problems if increasing the latency of RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM performance. E.g. the only recurring RDMSR VM-Exits (after booting) on my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via arch_cpu_idle_enter(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callersSean Christopherson
Refactor the top-level MSR accessors to take/return the index and value directly instead of requiring the caller to dump them into a msr_data struct. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Tune PLE Window tracepointPeter Xu
The PLE window tracepoint triggers even if the window is not changed, and the wording can be a bit confusing too. One example line: kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096) It easily let people think of "the window now is 4096 which is shrinked", but the truth is the value actually didn't change (4096). Let's only dump this message if the value really changed, and we make the message even simpler like: kvm_ple_window: vcpu 4 old 4096 new 8192 (growed) Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: VMX: Change ple_window type to unsigned intPeter Xu
The VMX ple_window is 32 bits wide, so logically it can overflow with an int. The module parameter is declared as unsigned int which is good, however the dynamic variable is not. Switching all the ple_window references to use unsigned int. The tracepoint changes will also affect SVM, but SVM is using an even smaller width (16 bits) so it's always fine. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Remove tailing newline for tracepointsPeter Xu
It's done by TP_printk() already. Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Trace vcpu_id for vmexitPeter Xu
Tracing the ID helps to pair vmenters and vmexits for guests with multiple vCPUs. Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Manually calculate reserved bits when loading PDPTRSSean Christopherson
Manually generate the PDPTR reserved bit mask when explicitly loading PDPTRs. The reserved bits that are being tracked by the MMU reflect the current paging mode, which is unlikely to be PAE paging in the vast majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation, __set_sregs(), etc... This can cause KVM to incorrectly signal a bad PDPTR, or more likely, miss a reserved bit check and subsequently fail a VM-Enter due to a bad VMCS.GUEST_PDPTR. Add a one off helper to generate the reserved bits instead of sharing code across the MMU's calculations and the PDPTR emulation. The PDPTR reserved bits are basically set in stone, and pushing a helper into the MMU's calculation adds unnecessary complexity without improving readability. Oppurtunistically fix/update the comment for load_pdptrs(). Note, the buggy commit also introduced a deliberate functional change, "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was effectively (and correctly) reverted by commit cd9ae5fe47df ("KVM: x86: Fix page-tables reserved bits"). A bit of SDM archaeology shows that the SDM from late 2008 had a bug (likely a copy+paste error) where it listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and always have been reserved. Fixes: 20c466b56168d ("KVM: Use rsvd_bits_mask in load_pdptrs()") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Reported-by: Doug Reiland <doug.reiland@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Disable posted interrupts for non-standard IRQs delivery modesAlexander Graf
We can easily route hardware interrupts directly into VM context when they target the "Fixed" or "LowPriority" delivery modes. However, on modes such as "SMI" or "Init", we need to go via KVM code to actually put the vCPU into a different mode of operation, so we can not post the interrupt Add code in the VMX and SVM PI logic to explicitly refuse to establish posted mappings for advanced IRQ deliver modes. This reflects the logic in __apic_accept_irq() which also only ever passes Fixed and LowPriority interrupts as posted interrupts into the guest. This fixes a bug I have with code which configures real hardware to inject virtual SMIs into my guest. Signed-off-by: Alexander Graf <graf@amazon.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-27KVM: x86: Don't update RIP or do single-step on faulting emulationSean Christopherson
Don't advance RIP or inject a single-step #DB if emulation signals a fault. This logic applies to all state updates that are conditional on clean retirement of the emulation instruction, e.g. updating RFLAGS was previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update EFLAGS on faulting emulation"). Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with ctxt->_eip until emulation "retires" anyways. Skipping #DB injection fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to invalid state with EFLAGS.TF=1 would loop indefinitely due to emulation overwriting the #UD with #DB and thus restarting the bad SYSCALL over and over. Cc: Nadav Amit <nadav.amit@gmail.com> Cc: stable@vger.kernel.org Reported-by: Andy Lutomirski <luto@kernel.org> Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2019-08-27KVM: x86: hyper-v: don't crash on KVM_GET_SUPPORTED_HV_CPUID when ↵Vitaly Kuznetsov
kvm_intel.nested is disabled If kvm_intel is loaded with nested=0 parameter an attempt to perform KVM_GET_SUPPORTED_HV_CPUID results in OOPS as nested_get_evmcs_version hook in kvm_x86_ops is NULL (we assign it in nested_vmx_hardware_setup() and this only happens in case nested is enabled). Check that kvm_x86_ops->nested_get_evmcs_version is not NULL before calling it. With this, we can remove the stub from svm as it is no longer needed. Cc: <stable@vger.kernel.org> Fixes: e2e871ab2f02 ("x86/kvm/hyper-v: Introduce nested_get_evmcs_version() helper") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2019-08-22KVM: VMX: Fix and tweak the comments for VM-EnterSean Christopherson
Fix an incorrect/stale comment regarding the vmx_vcpu pointer, as guest registers are now loaded using a direct pointer to the start of the register array. Opportunistically add a comment to document why the vmx_vcpu pointer is needed, its consumption via 'call vmx_update_host_rsp' is rather subtle. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: Assert that struct kvm_vcpu is always as offset zeroSean Christopherson
KVM implementations that wrap struct kvm_vcpu with a vendor specific struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0, otherwise the usercopy region intended to encompass struct kvm_vcpu_arch will instead overlap random chunks of the vendor specific struct. E.g. padding a large number of bytes before struct kvm_vcpu triggers a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: X86: Add pv tlb shootdown tracepointWanpeng Li
Add pv tlb shootdown tracepoint. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: Unconditionally call x86 ops that are always implementedSean Christopherson
Remove a few stale checks for non-NULL ops now that the ops in question are implemented by both VMX and SVM. Note, this is **not** stable material, the Fixes tags are there purely to show when a particular op was first supported by both VMX and SVM. Fixes: 74f169090b6f ("kvm/svm: Setup MCG_CAP on AMD properly") Fixes: b31c114b82b2 ("KVM: X86: Provide a capability to disable PAUSE intercepts") Fixes: 411b44ba80ab ("svm: Implements update_pi_irte hook to setup posted interrupt") Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86/mmu: Consolidate "is MMIO SPTE" codeSean Christopherson
Replace the open-coded "is MMIO SPTE" checks in the MMU warnings related to software-based access/dirty tracking to make the code slightly more self-documenting. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86/mmu: Add explicit access mask for MMIO SPTEsSean Christopherson
When shadow paging is enabled, KVM tracks the allowed access type for MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit without having to walk the guest's page tables. The tracking is done by retaining the WRITE and USER bits of the access when inserting the MMIO SPTE (read access is implicitly allowed), which allows the MMIO page fault handler to retrieve and cache the WRITE/USER bits from the SPTE. Unfortunately for EPT, the mask used to retain the WRITE/USER bits is hardcoded using the x86 paging versions of the bits. This funkiness happens to work because KVM uses a completely different mask/value for MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to overlap exactly with the x86 WRITE/USER bits[*]. Explicitly define the access mask for MMIO SPTEs to accurately reflect that EPT does not want to incorporate any access bits into the SPTE, and so that KVM isn't subtly relying on EPT's WX bits always being set in MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks horribly. Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for EPT has no (known) functional impact. [*] Using WX to generate EPT misconfigurations (equivalent to reserved bit page fault) ensures KVM can employ its MMIO page fault tricks even platforms without reserved address bits. Fixes: ce88decffd17 ("KVM: MMU: mmio page fault support") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: Rename access permissions cache member in struct kvm_vcpu_archSean Christopherson
Rename "access" to "mmio_access" to match the other MMIO cache members and to make it more obvious that it's tracking the access permissions for the MMIO cache. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()Vitaly Kuznetsov
Just like we do with other intercepts, in vmrun_interception() we should be doing kvm_skip_emulated_instruction() and not just RIP += 3. Also, it is wrong to increment RIP before nested_svm_vmrun() as it can result in kvm_inject_gp(). We can't call kvm_skip_emulated_instruction() after nested_svm_vmrun() so move it inside. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: svm: eliminate weird goto from vmrun_interception()Vitaly Kuznetsov
Regardless of whether or not nested_svm_vmrun_msrpm() fails, we return 1 from vmrun_interception() so there's no point in doing goto. Also, nested_svm_vmrun_msrpm() call can be made from nested_svm_vmrun() where other nested launch issues are handled. nested_svm_vmrun() returns a bool, however, its result is ignored in vmrun_interception() as we always return '1'. As a preparatory change to putting kvm_skip_emulated_instruction() inside nested_svm_vmrun() make nested_svm_vmrun() return an int (always '1' for now). Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: svm: remove hardcoded instruction length from interceptsVitaly Kuznetsov
Various intercepts hard-code the respective instruction lengths to optimize skip_emulated_instruction(): when next_rip is pre-set we skip kvm_emulate_instruction(vcpu, EMULTYPE_SKIP). The optimization is, however, incorrect: different (redundant) prefixes could be used to enlarge the instruction. We can't really avoid decoding. svm->next_rip is not used when CPU supports 'nrips' (X86_FEATURE_NRIPS) feature: next RIP is provided in VMCB. The feature is not really new (Opteron G3s had it already) and the change should have zero affect. Remove manual svm->next_rip setting with hard-coded instruction lengths. The only case where we now use svm->next_rip is EXIT_IOIO: the instruction length is provided to us by hardware. Hardcoded RIP advancement remains in vmrun_interception(), this is going to be taken care of separately. Reported-by: Jim Mattson <jmattson@google.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: add xsetbv to the emulatorVitaly Kuznetsov
To avoid hardcoding xsetbv length to '3' we need to support decoding it in the emulator. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: clear interrupt shadow on EMULTYPE_SKIPVitaly Kuznetsov
When doing x86_emulate_instruction(EMULTYPE_SKIP) interrupt shadow has to be cleared if and only if the skipping is successful. There are two immediate issues: - In SVM skip_emulated_instruction() we are not zapping interrupt shadow in case kvm_emulate_instruction(EMULTYPE_SKIP) is used to advance RIP (!nrpip_save). - In VMX handle_ept_misconfig() when running as a nested hypervisor we (static_cpu_has(X86_FEATURE_HYPERVISOR) case) forget to clear interrupt shadow. Note that we intentionally don't handle the case when the skipped instruction is supposed to prolong the interrupt shadow ("MOV/POP SS") as skip-emulation of those instructions should not happen under normal circumstances. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: kvm: svm: propagate errors from skip_emulated_instruction()Vitaly Kuznetsov
On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory, fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP). Currently, we only do printk(KERN_DEBUG) when this happens and this is not ideal. Propagate the error up the stack. On VMX, skip_emulated_instruction() doesn't fail, we have two call sites calling it explicitly: handle_exception_nmi() and handle_task_switch(), we can just ignore the result. On SVM, we also have two explicit call sites: svm_queue_exception() and it seems we don't need to do anything there as we check if RIP was advanced or not. In task_switch_interception(), however, we are better off not proceeding to kvm_task_switch() in case skip_emulated_instruction() failed. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22x86: KVM: svm: don't pretend to advance RIP in case wrmsr_interception() ↵Vitaly Kuznetsov
results in #GP svm->next_rip is only used by skip_emulated_instruction() and in case kvm_set_msr() fails we rightfully don't do that. Move svm->next_rip advancement to 'else' branch to avoid creating false impression that it's always advanced (and make it look like rdmsr_interception()). This is a preparatory change to removing hardcoded RIP advancement from instruction intercepts, no functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: Fix x86_decode_insn() return when fetching insn bytes failsSean Christopherson
Jump to the common error handling in x86_decode_insn() if __do_insn_fetch_bytes() fails so that its error code is converted to the appropriate return type. Although the various helpers used by x86_decode_insn() return X86EMUL_* values, x86_decode_insn() itself returns EMULATION_FAILED or EMULATION_OK. This doesn't cause a functional issue as the sole caller, x86_emulate_instruction(), currently only cares about success vs. failure, and success is indicated by '0' for both types (X86EMUL_CONTINUE and EMULATION_OK). Fixes: 285ca9e948fa ("KVM: emulate: speed up do_insn_fetch") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: use Intel speculation bugs and features as derived in generic x86 codePaolo Bonzini
Similar to AMD bits, set the Intel bits from the vendor-independent feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care about the vendor and they should be set on AMD processors as well. Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: always expose VIRT_SSBD to guestsPaolo Bonzini
Even though it is preferrable to use SPEC_CTRL (represented by X86_FEATURE_AMD_SSBD) instead of VIRT_SPEC, VIRT_SPEC is always supported anyway because otherwise it would be impossible to migrate from old to new CPUs. Make this apparent in the result of KVM_GET_SUPPORTED_CPUID as well. However, we need to hide the bit on Intel processors, so move the setting to svm_set_supported_cpuid. Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reported-by: Eduardo Habkost <ehabkost@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-22KVM: x86: fix reporting of AMD speculation bug CPUID leafPaolo Bonzini
The AMD_* bits have to be set from the vendor-independent feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care about the vendor and they should be set on Intel processors as well. On top of this, SSBD, STIBP and AMD_SSB_NO bit were not set, and VIRT_SSBD does not have to be added manually because it is a cpufeature that comes directly from the host's CPUID bit. Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-21Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"Paolo Bonzini
This reverts commit 4e103134b862314dc2f2f18f2fb0ab972adc3f5f. Alex Williamson reported regressions with device assignment with this patch. Even though the bug is probably elsewhere and still latent, this is needed to fix the regression. Fixes: 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot", 2019-02-05) Reported-by: Alex Willamson <alex.williamson@redhat.com> Cc: stable@vger.kernel.org Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-14KVM: x86: svm: remove redundant assignment of var new_entryMiaohe Lin
new_entry is reassigned a new value next line. So it's redundant and remove it. Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-14kvm: x86: skip populating logical dest map if apic is not sw enabledRadim Krcmar
recalculate_apic_map does not santize ldr and it's possible that multiple bits are set. In that case, a previous valid entry can potentially be overwritten by an invalid one. This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then triggering a crash to boot a kdump kernel. This is the sequence of events: 1. Linux boots in bigsmp mode and enables PhysFlat, however, it still writes to the LDR which probably will never be used. 2. However, when booting into kdump, the stale LDR values remain as they are not cleared by the guest and there isn't a apic reset. 3. kdump boots with 1 cpu, and uses Logical Destination Mode but the logical map has been overwritten and points to an inactive vcpu. Signed-off-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-09Merge tag 'kvmarm-fixes-for-5.3' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm fixes for 5.3 - A bunch of switch/case fall-through annotation, fixing one actual bug - Fix PMU reset bug - Add missing exception class debug strings
2019-08-05KVM: no need to check return value of debugfs_create functionsGreg KH
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return void instead of an integer, as we should not care at all about if this function actually does anything or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Cc: <kvm@vger.kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: remove kvm_arch_has_vcpu_debugfs()Paolo Bonzini
There is no need for this function as all arches have to implement kvm_arch_create_vcpu_debugfs() no matter what. A #define symbol let us actually simplify the code. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: Fix leak vCPU's VMCS value into other pCPUWanpeng Li
After commit d73eb57b80b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a1467 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-05KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fireWanpeng Li
kvm_set_pending_timer() will take care to wake up the sleeping vCPU which has pending timer, don't need to check this in apic_timer_expired() again. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-08-01KVM: LAPIC: Mark hrtimer to expire in hard interrupt contextSebastian Andrzej Siewior
On PREEMPT_RT enabled kernels unmarked hrtimers are moved into soft interrupt expiry mode by default. While that's not a functional requirement for the KVM local APIC timer emulation, it's a latency issue which can be avoided by marking the timer so hard interrupt context expiry is enforced. No functional change. [ tglx: Split out from larger combo patch. Add changelog. ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20190726185753.363363474@linutronix.de
2019-07-24KVM: X86: Boost queue head vCPU to mitigate lock waiter preemptionWanpeng Li
Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks) introduces hybrid PV queued/unfair locks - queued mode (no starvation) - unfair mode (good performance on not heavily contended lock) The lock waiter goes into the unfair mode especially in VMs with over-commit vCPUs since increaing over-commitment increase the likehood that the queue head vCPU may have been preempted and not actively spinning. However, reschedule queue head vCPU timely to acquire the lock still can get better performance than just depending on lock stealing in over-subscribe scenario. Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM: ebizzy -M vanilla boosting improved 1VM 23520 25040 6% 2VM 8000 13600 70% 3VM 3100 5400 74% The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue head vCPU which is involuntary preemption or the one which is voluntary halt due to fail to acquire the lock after a short spin in the guest. Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-24Documentation: move Documentation/virtual to Documentation/virtChristoph Hellwig
Renaming docs seems to be en vogue at the moment, so fix on of the grossly misnamed directories. We usually never use "virtual" as a shortcut for virtualization in the kernel, but always virt, as seen in the virt/ top-level directory. Fix up the documentation to match that. Fixes: ed16648eb5b8 ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after freeJan Kiszka
Shall help finding use-after-free bugs earlier. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: X86: Dynamically allocate user_fpuWanpeng Li
After reverting commit 240c35a3783a (kvm: x86: Use task structs fpu field for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3) is the order at which allocations are deemed costly to service. In serveless scenario, one host can service hundreds/thoudands firecracker/kata-container instances, howerver, new instance will fail to launch after memory is too fragmented to allocate kvm_vcpu struct on host, this was observed in some cloud provider product environments. This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my Skylake server. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22KVM: X86: Fix fpu state crash in kvm guestWanpeng Li
The idea before commit 240c35a37 (which has just been reverted) was that we have the following FPU states: userspace (QEMU) guest --------------------------------------------------------------------------- processor vcpu->arch.guest_fpu >>> KVM_RUN: kvm_load_guest_fpu vcpu->arch.user_fpu processor >>> preempt out vcpu->arch.user_fpu current->thread.fpu >>> preempt in vcpu->arch.user_fpu processor >>> back to userspace >>> kvm_put_guest_fpu processor vcpu->arch.guest_fpu --------------------------------------------------------------------------- With the new lazy model we want to get the state back to the processor when schedule in from current->thread.fpu. Reported-by: Thomas Lambertz <mail@thomaslambertz.de> Reported-by: anthony <antdev66@gmail.com> Tested-by: anthony <antdev66@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Lambertz <mail@thomaslambertz.de> Cc: anthony <antdev66@gmail.com> Cc: stable@vger.kernel.org Fixes: 5f409e20b (x86/fpu: Defer FPU state load until return to userspace) Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Add a comment in front of the warning. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-22Revert "kvm: x86: Use task structs fpu field for user"Paolo Bonzini
This reverts commit 240c35a3783ab9b3a0afaba0dde7291295680a6b ("kvm: x86: Use task structs fpu field for user", 2018-11-06). The commit is broken and causes QEMU's FPU state to be destroyed when KVM_RUN is preempted. Fixes: 240c35a3783a ("kvm: x86: Use task structs fpu field for user") Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>