summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2017-09-01KVM: PPC: Book3S HV: Fix memory leak in kvm_vm_ioctl_get_htab_fdnixiaoming
We do ctx = kzalloc(sizeof(*ctx), GFP_KERNEL) and then later on call anon_inode_getfd(), but if that fails we don't free ctx, so that memory gets leaked. To fix it, this adds kfree(ctx) in the failure path. Signed-off-by: nixiaoming <nixiaoming@huawei.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras
This merges in the 'ppc-kvm' topic branch from the powerpc tree in order to bring in some fixes which touch both powerpc and KVM code. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: Book3S HV: Report storage key support to userspacePaul Mackerras
This adds information about storage keys to the struct returned by the KVM_PPC_GET_SMMU_INFO ioctl. The new fields replace a pad field, which was zeroed by previous kernel versions. Thus userspace that knows about the new fields will see zeroes when running on an older kernel, indicating that storage keys are not supported. The size of the structure has not changed. The number of keys is hard-coded for the CPUs supported by HV KVM, which is just POWER7, POWER8 and POWER9. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: Book3S HV: Fix case where HDEC is treated as 32-bit on POWER9Paul Mackerras
Commit 2f2724630f7a ("KVM: PPC: Book3S HV: Cope with host using large decrementer mode", 2017-05-22) added code to treat the hypervisor decrementer (HDEC) as a 64-bit value on POWER9 rather than 32-bit. Unfortunately, that commit missed one place where HDEC is treated as a 32-bit value. This fixes it. This bug should not have any user-visible consequences that I can think of, beyond an occasional unnecessary exit to the host kernel. If the hypervisor decrementer has gone negative, then the bottom 32 bits will be negative for about 4 seconds after that, so as long as we get out of the guest within those 4 seconds we won't conclude that the HDEC interrupt is spurious. Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Fixes: 2f2724630f7a ("KVM: PPC: Book3S HV: Cope with host using large decrementer mode") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: Book3S HV: Fix invalid use of register expressionAndreas Schwab
binutils >= 2.26 now warns about misuse of register expressions in assembler operands that are actually literals. In this instance r0 is being used where a literal 0 should be used. Signed-off-by: Andreas Schwab <schwab@linux-m68k.org> [mpe: Split into separate KVM patch, tweak change log] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: Book3S HV: Fix H_REGISTER_VPA VPA size validationNicholas Piggin
KVM currently validates the size of the VPA registered by the client against sizeof(struct lppaca), however we align (and therefore size) that struct to 1kB to avoid crossing a 4kB boundary in the client. PAPR calls for sizes >= 640 bytes to be accepted. Hard code this with a comment. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTERRam Pai
In handling a H_ENTER hypercall, the code in kvmppc_do_h_enter clobbers the high-order two bits of the storage key, which is stored in a split field in the second doubleword of the HPTE. Any storage key number above 7 hence fails to operate correctly. This makes sure we preserve all the bits of the storage key. Acked-by: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Ram Pai <linuxram@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: e500mc: Fix a NULL dereferenceDan Carpenter
We should set "err = -ENOMEM;", otherwise it means we're returning ERR_PTR(0) which is NULL. It results in a NULL pointer dereference in the caller. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-31KVM: PPC: e500: Fix some NULL dereferences on errorDan Carpenter
There are some error paths in kvmppc_core_vcpu_create_e500() where we forget to set the error code. It means that we return ERR_PTR(0) which is NULL and it results in a NULL pointer dereference in the caller. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
2017-08-29KVM: PPC: Book3S HV: POWER9 does not require secondary thread managementNicholas Piggin
POWER9 CPUs have independent MMU contexts per thread, so KVM does not need to quiesce secondary threads, so the hwthread_req/hwthread_state protocol does not have to be used. So patch it away on POWER9, and patch away the branch from the Linux idle wakeup to kvm_start_guest that is never used. Add a warning and error out of kvmppc_grab_hwthread in case it is ever called on POWER9. This avoids a hwsync in the idle wakeup path on POWER9. Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Paul Mackerras <paulus@ozlabs.org> [mpe: Use WARN(...) instead of WARN_ON()/pr_err(...)] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-25kvm: nVMX: Validate the virtual-APIC address on nested VM-entryJim Mattson
According to the SDM, if the "use TPR shadow" VM-execution control is 1, bits 11:0 of the virtual-APIC address must be 0 and the address should set any bits beyond the processor's physical-address width. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: nVMX: Fix trying to cancel vmlauch/vmresumeWanpeng Li
------------[ cut here ]------------ WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel] CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc4+ #11 RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel] Call Trace: ? kvm_multiple_exception+0x149/0x170 [kvm] ? handle_emulation_failure+0x79/0x230 [kvm] ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel] ? check_chain_key+0x137/0x1e0 ? reexecute_instruction.part.168+0x130/0x130 [kvm] nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel] ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel] vmx_queue_exception+0x197/0x300 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm] ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm] ? preempt_count_sub+0x18/0xc0 ? restart_apic_timer+0x17d/0x300 [kvm] ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm] ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm] kvm_vcpu_ioctl+0x4e4/0x910 [kvm] ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm] ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm] The flag "nested_run_pending", which can override the decision of which should run next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending is especially intended to avoid switching to L1 in the injection decision-point. This can be handled just like the other cases in vmx_check_nested_events, instead of having a special case in vmx_queue_exception. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: X86: Fix loss of exception which has not yet been injectedWanpeng Li
vmx_complete_interrupts() assumes that the exception is always injected, so it can be dropped by kvm_clear_exception_queue(). However, an exception cannot be injected immediately if it is: 1) originally destined to a nested guest; 2) trapped to cause a vmexit; 3) happening right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true. This patch applies to exceptions the same algorithm that is used for NMIs, replacing exception.reinject with "exception.injected" (equivalent to nmi_injected). exception.pending now represents an exception that is queued and whose side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet. If exception.pending is true, the exception might result in a nested vmexit instead, too (in which case the side effects must not be applied). exception.injected instead represents an exception that is going to be injected into the guest at the next vmentry. Reported-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: VMX: use kvm_event_needs_reinjectionWanpeng Li
Use kvm_event_needs_reinjection() encapsulation. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: speedup update_permission_bitmaskPaolo Bonzini
update_permission_bitmask currently does a 128-iteration loop to, essentially, compute a constant array. Computing the 8 bits in parallel reduces it to 16 iterations, and is enough to speed it up substantially because many boolean operations in the inner loop become constants or simplify noticeably. Because update_permission_bitmask is actually the top item in the profile for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand clock cycles, or up to 30%: before after cpuid 35173 25954 vmcall 35122 27079 inl_from_pmtimer 52635 42675 inl_from_qemu 53604 44599 inl_from_kernel 38498 30798 outl_to_kernel 34508 28816 wr_tsc_adjust_msr 34185 26818 rd_tsc_adjust_msr 37409 27049 mmio-no-eventfd:pci-mem 50563 45276 mmio-wildcard-eventfd:pci-mem 34495 30823 mmio-datamatch-eventfd:pci-mem 35612 31071 portio-no-eventfd:pci-io 44925 40661 portio-wildcard-eventfd:pci-io 29708 27269 portio-datamatch-eventfd:pci-io 31135 27164 (I wrote a small C program to compare the tables for all values of CR0.WP, CR4.SMAP and CR4.SMEP, and they match). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: Expose the LA57 feature to VM.Yu Zhang
This patch exposes 5 level page table feature to the VM. At the same time, the canonical virtual address checking is extended to support both 48-bits and 57-bits address width. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: Add 5 level EPT & Shadow page table support.Yu Zhang
Extends the shadow paging code, so that 5 level shadow page table can be constructed if VM is running in 5 level paging mode. Also extends the ept code, so that 5 level ept table can be constructed if maxphysaddr of VM exceeds 48 bits. Unlike the shadow logic, KVM should still use 4 level ept table for a VM whose physical address width is less than 48 bits, even when the VM is running in 5 level paging mode. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> [Unconditionally reset the MMU context in kvm_cpuid_update. Changing MAXPHYADDR invalidates the reserved bit bitmasks. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL.Yu Zhang
Now we have 4 level page table and 5 level page table in 64 bits long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL, then we can use PT64_ROOT_5LEVEL for 5 level page table, it's helpful to make the code more clear. Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just redefine it to 5 whenever a replacement is needed for 5 level paging. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: MMU: check guest CR3 reserved bits based on its physical address width.Yu Zhang
Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the reserved bits in CR3. Yet the length of reserved bits in guest CR3 should be based on the physical address width exposed to the VM. This patch changes CR3 check logic to calculate the reserved bits at runtime. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: x86: Add return value to kvm_cpuid().Yu Zhang
Return false in kvm_cpuid() when it fails to find the cpuid entry. Also, this routine(and its caller) is optimized with a new argument - check_limit, so that the check_cpuid_limit() fall back can be avoided. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORSPaolo Bonzini
A guest may not be configured to support XSAVES/XRSTORS, even when the host does. If the guest does not support XSAVES/XRSTORS, clear the secondary execution control so that the processor will raise #UD. Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in the VMCS02. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24kvm: vmx: Raise #UD on unsupported RDSEEDJim Mattson
A guest may not be configured to support RDSEED, even when the host does. If the guest does not support RDSEED, intercept the instruction and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting in the IA32_VMX_PROCBASED_CTLS2 MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24kvm: vmx: Raise #UD on unsupported RDRANDJim Mattson
A guest may not be configured to support RDRAND, even when the host does. If the guest does not support RDRAND, intercept the instruction and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting in the IA32_VMX_PROCBASED_CTLS2 MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-24KVM: VMX: cache secondary exec controlsPaolo Bonzini
Currently, secondary execution controls are divided in three groups: - static, depending mostly on the module arguments or the processor (vmx_secondary_exec_control) - static, depending on CPUID (vmx_cpuid_update) - dynamic, depending on nested VMX or local APIC state Because walking CPUID is expensive, prepare_vmcs02 is using only the first group. This however is unnecessarily complicated. Just cache the static secondary execution controls, and then prepare_vmcs02 does not need to compute them every time. Computation of all static secondary execution controls is now kept in a single function, vmx_compute_secondary_exec_control. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-23KVM: SVM: Enable Virtual GIF featureJanakarajan Natarajan
Enable the Virtual GIF feature. This is done by setting bit 25 at position 60h in the vmcb. With this feature enabled, the processor uses bit 9 at position 60h as the virtual GIF when executing STGI/CLGI instructions. Since the execution of STGI by the L1 hypervisor does not cause a return to the outermost (L0) hypervisor, the enable_irq_window and enable_nmi_window are modified. The IRQ window will be opened even if GIF is not set, under the assumption that on resuming the L1 hypervisor the IRQ will be held pending until the processor executes the STGI instruction. For the NMI window, the STGI intercept is set. This will assist in opening the window only when GIF=1. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-23KVM: SVM: Add Virtual GIF feature definitionJanakarajan Natarajan
Add a new cpufeature definition for Virtual GIF. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-18KVM: VMX: always require WB memory type for EPTDavid Hildenbrand
We already always set that type but don't check if it is supported. Also for nVMX, we only support WB for now. Let's just require it. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: VMX: cleanup EPTP definitionsDavid Hildenbrand
Don't use shifts, tag them correctly as EPTP and use better matching names (PWL vs. GAW). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: SVM: delete avic_vm_id_bitmap (2 megabyte static array)Denys Vlasenko
With lightly tweaked defconfig: text data bss dec hex filename 11259661 5109408 2981888 19350957 12745ad vmlinux.before 11259661 5109408 884736 17253805 10745ad vmlinux.after Only compile-tested. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: x86: fix use of L1 MMIO areas in nested guestsPaolo Bonzini
There is currently some confusion between nested and L1 GPAs. The assignment to "direct" in kvm_mmu_page_fault tries to fix that, but it is not enough. What this patch does is fence off the MMIO cache completely when using shadow nested page tables, since we have neither a GVA nor an L1 GPA to put in the cache. This also allows some simplifications in kvm_mmu_page_fault and FNAME(page_fault). The EPT misconfig likewise does not have an L1 GPA to pass to kvm_io_bus_write, so that must be skipped for guest mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Changed comment to say "GPAs" instead of "L1's physical addresses", as per David's review. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: x86: Avoid guest page table walk when gpa_available is setBrijesh Singh
When a guest causes a page fault which requires emulation, the vcpu->arch.gpa_available flag is set to indicate that cr2 contains a valid GPA. Currently, emulator_read_write_onepage() makes use of gpa_available flag to avoid a guest page walk for a known MMIO regions. Lets not limit the gpa_available optimization to just MMIO region. The patch extends the check to avoid page walk whenever gpa_available flag is set. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> [Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the new code. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Moved "ret < 0" to the else brach, as per David's review. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-18KVM: x86: simplify ept_misconfigPaolo Bonzini
Calling handle_mmio_page_fault() has been unnecessary since commit e9ee956e311d ("KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault()", 2016-02-22). handle_mmio_page_fault() can now be made static. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-17powerpc/mm: Rename find_linux_pte_or_hugepte()Aneesh Kumar K.V
Add newer helpers to make the function usage simpler. It is always recommended to use find_current_mm_pte() for walking the page table. If we cannot use find_current_mm_pte(), it should be documented why the said usage of __find_linux_pte() is safe against a parallel THP split. For now we have KVM code using __find_linux_pte(). This is because kvm code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but with PACA soft_enabled = 1. We may want to fix that later and make sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that we can switch kvm to use find_linux_pte(). Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2017-08-15kvm: avoid uninitialized-variable warningsArnd Bergmann
When PAGE_OFFSET is not a compile-time constant, we run into warnings from the use of kvm_is_error_hva() that the compiler cannot optimize out: arch/arm/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_gfn_to_hva_cache_init': arch/arm/kvm/../../../virt/kvm/kvm_main.c:1978:14: error: 'nr_pages_avail' may be used uninitialized in this function [-Werror=maybe-uninitialized] arch/arm/kvm/../../../virt/kvm/kvm_main.c: In function 'gfn_to_page_many_atomic': arch/arm/kvm/../../../virt/kvm/kvm_main.c:1660:5: error: 'entry' may be used uninitialized in this function [-Werror=maybe-uninitialized] This adds fake initializations to the two instances I ran into. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-08-11kvm: x86: Disallow illegal IA32_APIC_BASE MSR valuesJim Mattson
Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow local APIC state transition constraints, but the value written must be valid. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11KVM: MMU: Bail out immediately if there is no available mmu pageWanpeng Li
Bailing out immediately if there is no available mmu page to alloc. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11KVM: MMU: Fix softlockup due to mmu_lock is held too longWanpeng Li
watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089] irq event stamp: 20532 hardirqs last enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0 softirqs last enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1 softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100 CPU: 5 PID: 3089 Comm: warn_test Tainted: G OE 4.13.0-rc3+ #8 RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm] Call Trace: make_mmu_pages_available.isra.120+0x71/0xc0 [kvm] kvm_mmu_load+0x1cf/0x410 [kvm] kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 ? __this_cpu_preempt_check+0x13/0x20 This can be reproduced readily by ept=N and running syzkaller tests since many syzkaller testcases don't setup any memory regions. However, if ept=Y rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES which just hide the issue. I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1, so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails to zap any pages, however prepare_zap_oldest_mmu_page() always returns true. It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock softlockup. This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page() according to whether or not there is mmu page zapped. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11KVM: nVMX: validate eptp pointerDavid Hildenbrand
Let's reuse the function introduced with eptp switching. We don't explicitly have to check against enable_ept_ad_bits, as this is implicitly done when checking against nested_vmx_ept_caps in valid_ept_address(). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-11KVM: MAINTAINERS improvementsAndrew Jones
Remove nonexistent files, allow less awkward expressions when extracting arch-specific information, and only return relevant information when using arch-specific expressions. Additionally add include/trace/events/kvm.h, arch/*/include/uapi/asm/kvm*, and arch/powerpc/kernel/kvm* to appropriate sections. The arch- specific expressions are now: /KVM/ -- All KVM /\(KVM\)|\(KVM\/x86\)/ -- X86 /\(KVM\)|\(KVM\/x86\)|\(KVM\/amd\)/ -- X86 plus AMD /\(KVM\)|\(KVM\/arm\)/ -- ARM /\(KVM\)|\(KVM\/arm\)|\(KVM\/arm64\)/ -- ARM plus ARM64 /\(KVM\)|\(KVM\/powerpc\)/ -- POWERPC /\(KVM\)|\(KVM\/s390\)/ -- S390 /\(KVM\)|\(KVM\/mips\)/ -- MIPS Signed-off-by: Andrew Jones <drjones@redhat.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10kvm: nVMX: Add support for fast unprotection of nested guest page tablesPaolo Bonzini
This is the same as commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes", 2016-11-23), but for Intel processors. In this case, the exit qualification field's bit 8 says whether the EPT violation occurred while translating the guest's final physical address or rather while translating the guest page tables. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 guestBrijesh Singh
Commit 147277540bbc ("kvm: svm: Add support for additional SVM NPF error codes", 2016-11-23) added a new error code to aid nested page fault handling. The commit unprotects (kvm_mmu_unprotect_page) the page when we get a NPF due to guest page table walk where the page was marked RO. However, if an L0->L2 shadow nested page table can also be marked read-only when a page is read only in L1's nested page table. If such a page is accessed by L2 while walking page tables it can cause a nested page fault (page table walks are write accesses). However, after kvm_mmu_unprotect_page we may get another page fault, and again in an endless stream. To cover this use case, we qualify the new error_code check with vcpu->arch.mmu_direct_map so that the error_code check would run on L1 guest, and not the L2 guest. This avoids hitting the above scenario. Fixes: 147277540bbc54119172481c8ef6d930cc9fbfc2 Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10KVM: X86: Fix residual mmio emulation request to userspaceWanpeng Li
Reported by syzkaller: The kvm-intel.unrestricted_guest=0 WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm] CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm] Call Trace: ? put_pid+0x3a/0x50 ? rcu_read_lock_sched_held+0x79/0x80 ? kmem_cache_free+0x2f2/0x350 kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 ? __this_cpu_preempt_check+0x13/0x20 The syszkaller folks reported a residual mmio emulation request to userspace due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs several threads to launch the same vCPU, the thread which lauch this vCPU after the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will trigger the warning. #define _GNU_SOURCE #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/wait.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> #include <linux/kvm.h> #include <stdio.h> int kvmcpu; struct kvm_run *run; void* thr(void* arg) { int res; res = ioctl(kvmcpu, KVM_RUN, 0); printf("ret1=%d exit_reason=%d suberror=%d\n", res, run->exit_reason, run->internal.suberror); return 0; } void test() { int i, kvm, kvmvm; pthread_t th[4]; kvm = open("/dev/kvm", O_RDWR); kvmvm = ioctl(kvm, KVM_CREATE_VM, 0); kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0); run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0); srand(getpid()); for (i = 0; i < 4; i++) { pthread_create(&th[i], 0, thr, 0); usleep(rand() % 10000); } for (i = 0; i < 4; i++) pthread_join(th[i], 0); } int main() { for (;;) { int pid = fork(); if (pid < 0) exit(1); if (pid == 0) { test(); exit(0); } int status; while (waitpid(pid, &status, __WALL) != pid) {} } return 0; } This patch fixes it by resetting the vcpu->mmio_needed once we receive the triple fault to avoid the residue. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08KVM: arm: implements the kvm_arch_vcpu_in_kernel()Longpeng(Mike)
This implements the kvm_arch_vcpu_in_kernel() for ARM, and adjusts the calls to kvm_vcpu_on_spin(). Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08KVM: s390: implements the kvm_arch_vcpu_in_kernel()Longpeng(Mike)
This implements kvm_arch_vcpu_in_kernel() for s390. DIAG is a privileged operation, so it cannot be called from problem state (user mode). Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08KVM: X86: implement the logic for spinlock optimizationLongpeng(Mike)
get_cpl requires vcpu_load, so we must cache the result (whether the vcpu was preempted when its cpl=0) in kvm_vcpu_arch. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-08KVM: add spinlock optimization frameworkLongpeng(Mike)
If a vcpu exits due to request a user mode spinlock, then the spinlock-holder may be preempted in user mode or kernel mode. (Note that not all architectures trap spin loops in user mode, only AMD x86 and ARM/ARM64 currently do). But if a vcpu exits in kernel mode, then the holder must be preempted in kernel mode, so we should choose a vcpu in kernel mode as a more likely candidate for the lock holder. This introduces kvm_arch_vcpu_in_kernel() to decide whether the vcpu is in kernel-mode when it's preempted. kvm_vcpu_on_spin's new argument says the same of the spinning VCPU. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07KVM: x86: use general helpers for some cpuid manipulationRadim Krčmář
Add guest_cpuid_clear() and use it instead of kvm_find_cpuid_entry(). Also replace some uses of kvm_find_cpuid_entry() with guest_cpuid_has(). Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07KVM: x86: generalize guest_cpuid_has_ helpersRadim Krčmář
This patch turns guest_cpuid_has_XYZ(cpuid) into guest_cpuid_has(cpuid, X86_FEATURE_XYZ), which gets rid of many very similar helpers. When seeing a X86_FEATURE_*, we can know which cpuid it belongs to, but this information isn't in common code, so we recreate it for KVM. Add some BUILD_BUG_ONs to make sure that it runs nicely. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07KVM: x86: X86_FEATURE_NRIPS is not scattered anymoreRadim Krčmář
bit(X86_FEATURE_NRIPS) is 3 since 2ccd71f1b278 ("x86/cpufeature: Move some of the scattered feature bits to x86_capability"), so we can simplify the code. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-07KVM: nVMX: Emulate EPTP switching for the L1 hypervisorBandan Das
When L2 uses vmfunc, L0 utilizes the associated vmexit to emulate a switching of the ept pointer by reloading the guest MMU. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>