summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2016-11-02KVM: LAPIC: guarantee the timer is in tsc-deadline modeWanpeng Li
Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot() and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when rdmsr MSR_IA32_TSCDEADLINE. Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: LAPIC: extract start_sw_period() to handle periodic/oneshot modeWanpeng Li
Extract start_sw_period() to handle periodic/oneshot mode, it will be used by later patch. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02kvm: x86: remove the misleading comment in vmx_handle_external_intrLongpeng(Mike)
Since Paolo has removed irq-enable-operation in vmx_handle_external_intr (KVM: x86: use guest_exit_irqoff), the original comment about the IF bit in rflags is incorrect and stale now, so remove it. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: x86: add track_flush_slot page track notifierXiaoguang Chen
When a memory slot is being moved or removed users of page track can be notified. So users can drop write-protection for the pages in that memory slot. This notifier type is needed by KVMGT to sync up its shadow page table when memory slot is being moved or removed. Register the notifier type track_flush_slot to receive memslot move and remove event. Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com> Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com> [Squashed commits to avoid bisection breakage and reworded the subject.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: refactor setup of global page-sized bitmapsRadim Krčmář
We've had 10 page-sized bitmaps that were being allocated and freed one by one when we could just use a cycle. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: join functions that disable x2apic msr interceptsRadim Krčmář
vmx_disable_intercept_msr_read_x2apic() and vmx_disable_intercept_msr_write_x2apic() differed only in the type. Pass the type to a new function. [Ordered and commented TPR intercept according to Paolo's suggestion.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: remove functions that enable msr interceptsRadim Krčmář
All intercepts are enabled at the beginning, so they can only be used if we disabled an intercept that we wanted to have enabled. This was done for TMCCT to simplify a loop that disables all x2APIC MSR intercepts, but just keeping TMCCT enabled yields better results. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02kvm: nVMX: Update MSR load counts on a VMCS switchJim Mattson
When L0 establishes (or removes) an MSR entry in the VM-entry or VM-exit MSR load lists, the change should affect the dormant VMCS as well as the current VMCS. Moreover, the vmcs02 MSR-load addresses should be initialized. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02kvm: nVMX: Fetch VM_INSTRUCTION_ERROR from vmcs02 on vmx->failJim Mattson
When forwarding a hardware VM-entry failure to L1, fetch the VM_INSTRUCTION_ERROR field from vmcs02 before loading vmcs01. (Note that there is an implicit assumption that the VM-entry failure was on the first VM-entry to vmcs02 after nested_vmx_run; otherwise, L1 is going to be very confused.) Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Feiner <pfeiner@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: X86: MMU: no mmu_notifier_seq++ in kvm_age_hvaPeter Feiner
The MMU notifier sequence number keeps GPA->HPA mappings in sync when GPA->HPA lookups are done outside of the MMU lock (e.g., in tdp_page_fault). Since kvm_age_hva doesn't change GPA->HPA, it's unnecessary to increment the sequence number. Signed-off-by: Peter Feiner <pfeiner@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02KVM: VMX: Better name x2apic msr bitmapsWanpeng Li
Renames x2apic_apicv_inactive msr_bitmaps to x2apic and original x2apic bitmaps to x2apic_apicv. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02kvm: x86: Check memopp before dereference (CVE-2016-8630)Owen Hofmann
Commit 41061cdb98 ("KVM: emulate: do not initialize memopp") removes a check for non-NULL under incorrect assumptions. An undefined instruction with a ModR/M byte with Mod=0 and R/M-5 (e.g. 0xc7 0x15) will attempt to dereference a null pointer here. Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5 Message-Id: <1477592752-126650-2-git-send-email-osh@google.com> Signed-off-by: Owen Hofmann <osh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02kvm: nVMX: VMCLEAR an active shadow VMCS after last useJim Mattson
After a successful VM-entry with the "VMCS shadowing" VM-execution control set, the shadow VMCS referenced by the VMCS link pointer field in the current VMCS becomes active on the logical processor. A VMCS that is made active on more than one logical processor may become corrupted. Therefore, before an active VMCS can be migrated to another logical processor, the first logical processor must execute a VMCLEAR for the active VMCS. VMCLEAR both ensures that all VMCS data are written to memory and makes the VMCS inactive. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-By: David Matlack <dmatlack@google.com> Message-Id: <1477668579-22555-1-git-send-email-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCKPaolo Bonzini
Since commit a545ab6a0085 ("kvm: x86: add tsc_offset field to struct kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is cached and need not be fished out of the VMCS or VMCB. This means that we can implement adjust_tsc_offset_guest and read_l1_tsc entirely in generic code. The simplification is particularly significant for VMX code, where vmx->nested.vmcs01_tsc_offset was duplicating what is now in vcpu->arch.tsc_offset. Therefore the vmcs01_tsc_offset can be dropped completely. More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK which, after commit 108b249c453d ("KVM: x86: introduce get_kvmclock_ns", 2016-09-01) called read_l1_tsc while the VMCS was not loaded. It thus returned bogus values on Intel CPUs. Fixes: 108b249c453dd7132599ab6dc7e435a7036c193f Reported-by: Roman Kagan <rkagan@virtuozzo.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-01x86/fpu, kvm: Remove host CR0.TS manipulationAndy Lutomirski
Now that x86 always uses eager FPU switching on the host, there's no need for KVM to manipulate the host's CR0.TS. This should be both simpler and faster. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Rik van Riel <riel@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm list <kvm@vger.kernel.org> Link: http://lkml.kernel.org/r/b212064922537c05d0c81d931fc4dbe769127ce7.1477951965.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01Merge branch 'core/urgent' into x86/fpu, to merge fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-28KVM: x86: fix wbinvd_dirty_mask use-after-freeIdo Yariv
vcpu->arch.wbinvd_dirty_mask may still be used after freeing it, corrupting memory. For example, the following call trace may set a bit in an already freed cpu mask: kvm_arch_vcpu_load vcpu_load vmx_free_vcpu_nested vmx_free_vcpu kvm_arch_vcpu_free Fix this by deferring freeing of wbinvd_dirty_mask. Cc: stable@vger.kernel.org Signed-off-by: Ido Yariv <ido@wizery.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-27kvm/x86: Show WRMSR data is in hexBorislav Petkov
Add the "0x" prefix to the error messages format to make it unambiguous about what kind of value we're talking about. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Message-Id: <20161027181445.25319-1-bp@alien8.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-27kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID typesJim Mattson
Bitwise shifts by amounts greater than or equal to the width of the left operand are undefined. A malicious guest can exploit this to crash a 32-bit host, due to the BUG_ON(1)'s in handle_{invept,invvpid}. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <1477496318-17681-1-git-send-email-jmattson@google.com> [Change 1UL to 1, to match the range check on the shift count. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-20kvm: x86: memset whole irq_eoiJiri Slaby
gcc 7 warns: arch/x86/kvm/ioapic.c: In function 'kvm_ioapic_reset': arch/x86/kvm/ioapic.c:597:2: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size] And it is right. Memset whole array using sizeof operator. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> [Added x86 subject tag] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-20kvm/x86: Fix unused variable warning in kvm_timer_init()Borislav Petkov
When CONFIG_CPU_FREQ is not set, int cpu is unused and gcc rightfully warns about it: arch/x86/kvm/x86.c: In function ‘kvm_timer_init’: arch/x86/kvm/x86.c:5697:6: warning: unused variable ‘cpu’ [-Wunused-variable] int cpu; ^~~ But since it is used only in the CONFIG_CPU_FREQ block, simply move it there, thus squashing the warning too. Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-16Merge tag 'v4.9-rc1' into x86/fpu, to resolve conflictIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-11kthread: kthread worker API cleanupPetr Mladek
A good practice is to prefix the names of functions by the name of the subsystem. The kthread worker API is a mix of classic kthreads and workqueues. Each worker has a dedicated kthread. It runs a generic function that process queued works. It is implemented as part of the kthread subsystem. This patch renames the existing kthread worker API to use the corresponding name from the workqueues API prefixed by kthread_: __init_kthread_worker() -> __kthread_init_worker() init_kthread_worker() -> kthread_init_worker() init_kthread_work() -> kthread_init_work() insert_kthread_work() -> kthread_insert_work() queue_kthread_work() -> kthread_queue_work() flush_kthread_work() -> kthread_flush_work() flush_kthread_worker() -> kthread_flush_worker() Note that the names of DEFINE_KTHREAD_WORK*() macros stay as they are. It is common that the "DEFINE_" prefix has precedence over the subsystem names. Note that INIT() macros and init() functions use different naming scheme. There is no good solution. There are several reasons for this solution: + "init" in the function names stands for the verb "initialize" aka "initialize worker". While "INIT" in the macro names stands for the noun "INITIALIZER" aka "worker initializer". + INIT() macros are used only in DEFINE() macros + init() functions are used close to the other kthread() functions. It looks much better if all the functions use the same scheme. + There will be also kthread_destroy_worker() that will be used close to kthread_cancel_work(). It is related to the init() function. Again it looks better if all functions use the same naming scheme. + there are several precedents for such init() function names, e.g. amd_iommu_init_device(), free_area_init_node(), jump_label_init_type(), regmap_init_mmio_clk(), + It is not an argument but it was inconsistent even before. [arnd@arndb.de: fix linux-next merge conflict] Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Petr Mladek <pmladek@suse.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-07x86/fpu, kvm: Remove KVM vcpu->fpu_counterRik van Riel
With the removal of the lazy FPU code, this field is no longer used. Get rid of it. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pbonzini@redhat.com Link: http://lkml.kernel.org/r/1475627678-20788-7-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-07x86/fpu: Remove use_eager_fpu()Andy Lutomirski
This removes all the obvious code paths that depend on lazy FPU mode. It shouldn't change the generated code at all. Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pbonzini@redhat.com Link: http://lkml.kernel.org/r/1475627678-20788-5-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-06Merge tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Radim Krčmář: "All architectures: - move `make kvmconfig` stubs from x86 - use 64 bits for debugfs stats ARM: - Important fixes for not using an in-kernel irqchip - handle SError exceptions and present them to guests if appropriate - proxying of GICV access at EL2 if guest mappings are unsafe - GICv3 on AArch32 on ARMv8 - preparations for GICv3 save/restore, including ABI docs - cleanups and a bit of optimizations MIPS: - A couple of fixes in preparation for supporting MIPS EVA host kernels - MIPS SMP host & TLB invalidation fixes PPC: - Fix the bug which caused guests to falsely report lockups - other minor fixes - a small optimization s390: - Lazy enablement of runtime instrumentation - up to 255 CPUs for nested guests - rework of machine check deliver - cleanups and fixes x86: - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery - Hyper-V TSC page - per-vcpu tsc_offset in debugfs - accelerated INS/OUTS in nVMX - cleanups and fixes" * tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits) KVM: MIPS: Drop dubious EntryHi optimisation KVM: MIPS: Invalidate TLB by regenerating ASIDs KVM: MIPS: Split kernel/user ASID regeneration KVM: MIPS: Drop other CPU ASIDs on guest MMU changes KVM: arm/arm64: vgic: Don't flush/sync without a working vgic KVM: arm64: Require in-kernel irqchip for PMU support KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie KVM: PPC: BookE: Fix a sanity check KVM: PPC: Book3S HV: Take out virtual core piggybacking code KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread ARM: gic-v3: Work around definition of gic_write_bpr1 KVM: nVMX: Fix the NMI IDT-vectoring handling KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive KVM: nVMX: Fix reload apic access page warning kvmconfig: add virtio-gpu to config fragment config: move x86 kvm_guest.config to a common location arm64: KVM: Remove duplicating init code for setting VMID ARM: KVM: Support vgic-v3 ...
2016-09-23KVM: nVMX: Fix the NMI IDT-vectoring handlingWanpeng Li
Run kvm-unit-tests/eventinj.flat in L1: Sending NMI to self After NMI to self FAIL: NMI This test scenario is to test whether VMM can handle NMI IDT-vectoring info correctly. At the beginning, L2 writes LAPIC to send a self NMI, the EPT page tables on both L1 and L0 are empty so: - The L2 accesses memory can generate EPT violation which can be intercepted by L0. The EPT violation vmexit occurred during delivery of this NMI, and the NMI info is recorded in vmcs02's IDT-vectoring info. - L0 walks L1's EPT12 and L0 sees the mapping is invalid, it injects the EPT violation into L1. The vmcs02's IDT-vectoring info is reflected to vmcs12's IDT-vectoring info since it is a nested vmexit. - L1 receives the EPT violation, then fixes its EPT12. - L1 executes VMRESUME to resume L2 which generates vmexit and causes L1 exits to L0. - L0 emulates VMRESUME which is called from L1, then return to L2. L0 merges the requirement of vmcs12's IDT-vectoring info and injects it to L2 through vmcs02. - The L2 re-executes the fault instruction and cause EPT violation again. - Since the L1's EPT12 is valid, L0 can fix its EPT02 - L0 resume L2 The EPT violation vmexit occurred during delivery of this NMI again, and the NMI info is recorded in vmcs02's IDT-vectoring info. L0 should inject the NMI through vmentry event injection since it is caused by EPT02's EPT violation. However, vmx_inject_nmi() refuses to inject NMI from IDT-vectoring info if vCPU is in guest mode, this patch fix it by permitting to inject NMI from IDT-vectoring if it is the L0's responsibility to inject NMI from IDT-vectoring info to L2. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Bandan Das <bsd@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactiveWanpeng Li
I observed that kvmvapic(to optimize flexpriority=N or AMD) is used to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f9542 x86, apicv: add virtual x2apic support) disable virtual x2apic mode completely if w/o APICv, and the author also told me that windows guest can't enter into x2apic mode when he developed the APICv feature several years ago. However, it is not truth currently, Interrupt Remapping and vIOMMU is added to qemu and the developers from Intel test windows 8 can work in x2apic mode w/ Interrupt Remapping enabled recently. This patch enables TPR shadow for virtual x2apic mode to boost windows guest in x2apic mode even if w/o APICv. Can pass the kvm-unit-test. Suggested-by: Radim Krčmář <rkrcmar@redhat.com> Suggested-by: Wincy Van <fanwenyi0529@gmail.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Wincy Van <fanwenyi0529@gmail.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23KVM: nVMX: Fix reload apic access page warningWanpeng Li
WARNING: CPU: 1 PID: 4230 at kernel/sched/core.c:7564 __might_sleep+0x7e/0x80 do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d0de7f9>] prepare_to_swait+0x39/0xa0 CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47 Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_fmt+0x4f/0x60 ? prepare_to_swait+0x39/0xa0 ? prepare_to_swait+0x39/0xa0 __might_sleep+0x7e/0x80 __gfn_to_pfn_memslot+0x156/0x480 [kvm] gfn_to_pfn+0x2a/0x30 [kvm] gfn_to_page+0xe/0x20 [kvm] kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm] nested_vmx_vmexit+0x765/0xca0 [kvm_intel] ? _raw_spin_unlock_irqrestore+0x36/0x80 vmx_check_nested_events+0x49/0x1f0 [kvm_intel] kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm] kvm_vcpu_check_block+0x12/0x60 [kvm] kvm_vcpu_block+0x94/0x4c0 [kvm] kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm] ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm] kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm] =============================== [ INFO: suspicious RCU usage. ] 4.8.0-rc5+ #47 Not tainted ------------------------------- ./include/linux/kvm_host.h:535 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/4230: #0: (&vcpu->mutex){+.+.+.}, at: [<ffffffffc062975c>] vcpu_load+0x1c/0x60 [kvm] stack backtrace: CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47 Call Trace: dump_stack+0x99/0xd0 lockdep_rcu_suspicious+0xe7/0x120 gfn_to_memslot+0x12a/0x140 [kvm] gfn_to_pfn+0x12/0x30 [kvm] gfn_to_page+0xe/0x20 [kvm] kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm] nested_vmx_vmexit+0x765/0xca0 [kvm_intel] ? _raw_spin_unlock_irqrestore+0x36/0x80 vmx_check_nested_events+0x49/0x1f0 [kvm_intel] kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm] kvm_vcpu_check_block+0x12/0x60 [kvm] kvm_vcpu_block+0x94/0x4c0 [kvm] kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm] ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm] kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm] ? __fget+0xfd/0x210 ? __lock_is_held+0x54/0x70 do_vfs_ioctl+0x96/0x6a0 ? __fget+0x11c/0x210 ? __fget+0x5/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 These can be triggered by running kvm-unit-test: ./x86-run x86/vmx.flat The nested preemption timer is based on hrtimer which is started on L2 entry, stopped on L2 exit and evaluated via the new check_nested_events hook. The current logic adds vCPU to a simple waitqueue (TASK_INTERRUPTIBLE) if need to yield pCPU and w/o holding srcu read lock when accesses memslots, both can be in nested preemption timer evaluation path which results in the warning above. This patch fix it by leveraging request bit to async reload APIC access page before vmentry in order to avoid to reload directly during the nested preemption timer evaluation, it is safe since the vmcs01 is loaded and current is nested vmexit. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Yunhong Jiang <yunhong.jiang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-20kvm: svm: fix unsigned compare less than zero comparisonColin Ian King
vm_data->avic_vm_id is a u32, so the check for a error return (less than zero) such as -EAGAIN from avic_get_next_vm_id currently has no effect whatsoever. Fix this by using a temporary int for the comparison and assign vm_data->avic_vm_id to this. I used an explicit u32 cast in the assignment to show why vm_data->avic_vm_id cannot be used in the assign/compare steps. Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20KVM: x86: Hyper-V tsc page setupPaolo Bonzini
Lately tsc page was implemented but filled with empty values. This patch setup tsc page scale and offset based on vcpu tsc, tsc_khz and HV_X64_MSR_TIME_REF_COUNT value. The valid tsc page drops HV_X64_MSR_TIME_REF_COUNT msr reads count to zero which potentially improves performance. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Peter Hornyack <peterhornyack@google.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> [Computation of TSC page parameters rewritten to use the Linux timekeeper parameters. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20KVM: x86: introduce get_kvmclock_nsPaolo Bonzini
Introduce a function that reads the exact nanoseconds value that is provided to the guest in kvmclock. This crystallizes the notion of kvmclock as a thin veneer over a stable TSC, that the guest will (hopefully) convert with NTP. In other words, kvmclock is *not* a paravirtualized host-to-guest NTP. Drop the get_kernel_ns() function, that was used both to get the base value of the master clock and to get the current value of kvmclock. The former use is replaced by ktime_get_boot_ns(), the latter is the purpose of get_kernel_ns(). This also allows KVM to provide a Hyper-V time reference counter that is synchronized with the time that is computed from the TSC page. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20KVM: x86: initialize kvmclock_offsetPaolo Bonzini
Make the guest's kvmclock count up from zero, not from the host boot time. The guest cannot rely on that anyway because it changes on migration, the numbers are easier on the eye and finally it matches the desired semantics of the Hyper-V time reference counter. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20KVM: x86: always fill in vcpu->arch.hv_clockPaolo Bonzini
We will use it in the next patches for KVM_GET_CLOCK and as a basis for the contents of the Hyper-V TSC page. Get the values from the Linux timekeeper even if kvmclock is not enabled. Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20Merge branch 'linus' into x86/asm, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-18Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A couple of small fixes to x86 perf drivers: - Measure L2 for HW_CACHE* events on AMD - Fix the address filter handling in the intel/pt driver - Handle the BTS disabling at the proper place" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2 perf/x86/intel/pt: Do validate the size of a kernel address filter perf/x86/intel/pt: Fix kernel address filter's offset validation perf/x86/intel/pt: Fix an off-by-one in address filter configuration perf/x86/intel: Don't disable "intel_bts" around "intel" event batching
2016-09-16kvm: x86: export TSC information to user-spaceLuiz Capitulino
This commit exports the following information to user-space via the newly created per-vcpu debugfs directory: - TSC offset (as a signed number) - TSC scaling ratio - TSC scaling ratio fractinal bits The original intention of this commit was to export only the TSC offset, but the TSC scaling information is exported for completeness. We need to retrieve the TSC offset from user-space in order to support the merging of host and guest traces in trace-cmd. Today, we use the kvm_write_tsc_offset tracepoint, but it has a number of problems (mainly, it requires a running VM to be rebooted, ftrace setup, and also tracepoints are not supposed to be ABIs). The merging of host and guest traces is explained in more detail in this thread: [Qemu-devel] [RFC] host and guest kernel trace merging https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html This commit creates the following files in debugfs: /sys/kernel/debug/kvm/66828-10/vcpu0/tsc-offset /sys/kernel/debug/kvm/66828-10/vcpu0/tsc-scaling-ratio /sys/kernel/debug/kvm/66828-10/vcpu0/tsc-scaling-ratio-frac-bits The last two are only created if TSC scaling is supported. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16kvm: add stubs for arch specific debugfs supportLuiz Capitulino
Two stubs are added: o kvm_arch_has_vcpu_debugfs(): must return true if the arch supports creating debugfs entries in the vcpu debugfs dir (which will be implemented by the next commit) o kvm_arch_create_vcpu_debugfs(): code that creates debugfs entries in the vcpu debugfs dir For x86, this commit introduces a new file to avoid growing arch/x86/kvm/x86.c even more. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16kvm: x86: drop read_tsc_offset()Luiz Capitulino
The TSC offset can now be read directly from struct kvm_arch_vcpu. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16kvm: x86: add tsc_offset field to struct kvm_vcpu_archLuiz Capitulino
A future commit will want to easily read a vCPU's TSC offset, so we store it in struct kvm_arch_vcpu_arch for easy access. Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2Matt Fleming
While the Intel PMU monitors the LLC when perf enables the HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor L1 instruction cache fetches (0x0080) and instruction cache misses (0x0081) on the AMD PMU. This is extremely confusing when monitoring the same workload across Intel and AMD machines, since parameters like, $ perf stat -e cache-references,cache-misses measure completely different things. Instead, make the AMD PMU measure instruction/data cache and TLB fill requests to the L2 and instruction/data cache and TLB misses in the L2 when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled, respectively. That way the events measure unified caches on both platforms. Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15kvm: x86: correctly reset dest_map->vector when restoring LAPIC statePaolo Bonzini
When userspace sends KVM_SET_LAPIC, KVM schedules a check between the vCPU's IRR and ISR and the IOAPIC redirection table, in order to re-establish the IOAPIC's dest_map (the list of CPUs servicing the real-time clock interrupt with the corresponding vectors). However, __rtc_irq_eoi_tracking_restore_one was forgetting to set dest_map->vectors. Because of this, the IOAPIC did not process the real-time clock interrupt EOI, ioapic->rtc_status.pending_eoi got stuck at a non-zero value, and further RTC interrupts were reported to userspace as coalesced. Fixes: 9e4aabe2bb3454c83dac8139cf9974503ee044db Fixes: 4d99ba898dd0c521ca6cdfdde55c9b58aea3cb3d Cc: stable@vger.kernel.org Cc: Joerg Roedel <jroedel@suse.de> Cc: David Gilbert <dgilbert@redhat.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-15Merge branch 'linus' into x86/asm, to pick up recent fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-08svm: Implements update_pi_irte hook to setup posted interruptSuravee Suthikulpanit
This patch implements update_pi_irte function hook to allow SVM communicate to IOMMU driver regarding how to set up IRTE for handling posted interrupt. In case AVIC is enabled, during vcpu_load/unload, SVM needs to update IOMMU IRTE with appropriate host physical APIC ID. Also, when vcpu_blocking/unblocking, SVM needs to update the is-running bit in the IOMMU IRTE. Both are achieved via calling amd_iommu_update_ga(). However, if GA mode is not enabled for the pass-through device, IOMMU driver will simply just return when calling amd_iommu_update_ga. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08svm: Introduce AMD IOMMU avic_ga_log_notifierSuravee Suthikulpanit
This patch introduces avic_ga_log_notifier, which will be called by IOMMU driver whenever it handles the Guest vAPIC (GA) log entry. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08svm: Introduces AVIC per-VM IDSuravee Suthikulpanit
Introduces per-VM AVIC ID and helper functions to manage the IDs. Currently, the ID will be used to implement 32-bit AVIC IOMMU GA tag. The ID is 24-bit one-based indexing value, and is managed via helper functions to get the next ID, or to free an ID once a VM is destroyed. There should be no ID conflict for any active VMs. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07KVM: nVMX: expose INS/OUTS information supportJan Dakinevich
Expose the feature to L1 hypervisor if host CPU supports it, since certain hypervisors requires it for own purposes. According to Intel SDM A.1, if CPU supports the feature, VMX_INSTRUCTION_INFO field of VMCS will contain detailed information about INS/OUTS instructions handling. This field is already copied to VMCS12 for L1 hypervisor (see prepare_vmcs12 routine) independently feature presence. Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07KVM: VMX: not use vmcs_config in setup_vmcs_configPaolo Bonzini
setup_vmcs_config takes a pointer to the vmcs_config global. The indirection is somewhat pointless, but just keep things consistent for now. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07KVM: x86: remove stale commentsPaolo Bonzini
handle_external_intr does not enable interrupts anymore, vcpu_enter_guest does it after calling guest_exit_irqoff. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07KVM: x86: ratelimit and decrease severity for guest-triggered printkPaolo Bonzini
These are mostly related to nested VMX. They needn't have a loglevel as high as KERN_WARN, and mustn't be allowed to pollute the host logs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>