summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/svm.c
AgeCommit message (Collapse)Author
2015-06-04KVM: x86: pass host_initiated to functions that read MSRsPaolo Bonzini
SMBASE is only readable from SMM for the VCPU, but it must be always accessible if userspace is accessing it. Thus, all functions that read MSRs are changed to accept a struct msr_data; the host_initiated and index fields are pre-initialized, while the data field is filled on return. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20Merge branch 'kvm-master' into kvm-nextPaolo Bonzini
Grab MPX bugfix, and fix conflicts against Rik's adaptive FPU deactivation patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-20Revert "KVM: x86: drop fpu_activate hook"Paolo Bonzini
This reverts commit 4473b570a7ebb502f63f292ccfba7df622e5fdd3. We'll use the hook again. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-13tracing: Rename ftrace_event.h to trace_events.hSteven Rostedt (Red Hat)
The term "ftrace" is really the infrastructure of the function hooks, and not the trace events. Rename ftrace_event.h to trace_events.h to represent the trace_event infrastructure and decouple the term ftrace from it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-07KVM: x86: fix initial PAT valueRadim Krčmář
PAT should be 0007_0406_0007_0406h on RESET and not modified on INIT. VMX used a wrong value (host's PAT) and while SVM used the right one, it never got to arch.pat. This is not an issue with QEMU as it will force the correct value. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07KVM: x86: INIT and reset sequences are differentNadav Amit
x86 architecture defines differences between the reset and INIT sequences. INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU, MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP. References (from Intel SDM): "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions] [Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT] "If the processor is reset by asserting the INIT# pin, the x87 FPU state is not changed." [9.2: X87 FPU INITIALIZATION] "The state of the local APIC following an INIT reset is the same as it is after a power-up or hardware reset, except that the APIC ID and arbitration ID registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset ("Wait-for-SIPI" State)] Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-05-07KVM: x86: Support for disabling quirksNadav Amit
Introducing KVM_CAP_DISABLE_QUIRKS for disabling x86 quirks that were previous created in order to overcome QEMU issues. Those issue were mostly result of invalid VM BIOS. Currently there are two quirks that can be disabled: 1. KVM_QUIRK_LINT0_REENABLED - LINT0 was enabled after boot 2. KVM_QUIRK_CD_NW_CLEARED - CD and NW are cleared after boot These two issues are already resolved in recent releases of QEMU, and would therefore be disabled by QEMU. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Message-Id: <1428879221-29996-1-git-send-email-namit@cs.technion.ac.il> [Report capability from KVM_CHECK_EXTENSION too. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-08KVM: x86: BSP in MSR_IA32_APICBASE is writableNadav Amit
After reset, the CPU can change the BSP, which will be used upon INIT. Reset should return the BSP which QEMU asked for, and therefore handled accordingly. To quote: "If the MP protocol has completed and a BSP is chosen, subsequent INITs (either to a specific processor or system wide) do not cause the MP protocol to be repeated." [Intel SDM 8.4.2: MP Initialization Protocol Requirements and Restrictions] Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Message-Id: <1427933438-12782-3-git-send-email-namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-03-18KVM: SVM: Fix confusing message if no exit handlers are installedBandan Das
I hit this path on a AMD box and thought someone was playing a April Fool's joke on me. Signed-off-by: Bandan Das <bsd@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-17KVM: x86: For the symbols used locally only should be static typeXiubo Li
This patch fix the following sparse warnings: for arch/x86/kvm/x86.c: warning: symbol 'emulator_read_write' was not declared. Should it be static? warning: symbol 'emulator_write_emulated' was not declared. Should it be static? warning: symbol 'emulator_get_dr' was not declared. Should it be static? warning: symbol 'emulator_set_dr' was not declared. Should it be static? for arch/x86/kvm/pmu.c: warning: symbol 'fixed_pmc_events' was not declared. Should it be static? Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-13x86: svm: use cr_interception for SVM_EXIT_CR0_SEL_WRITEDavid Kaplan
Another patch in my war on emulate_on_interception() use as a svm exit handler. These were pulled out of a larger patch at the suggestion of Radim Krcmar, see https://lkml.org/lkml/2015/2/25/559 Changes since v1: * fixed typo introduced after test, retested Signed-off-by: David Kaplan <david.kaplan@amd.com> [separated out just cr_interception part from larger removal of INTERCEPT_CR0_WRITE, forward ported, tested] Signed-off-by: Joel Schopp <joel.schopp@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10kvm: svm: make wbinvd fasterDavid Kaplan
No need to re-decode WBINVD since we know what it is from the intercept. Signed-off-by: David Kaplan <David.Kaplan@amd.com> [extracted from larger unlrelated patch, forward ported, tested,style cleanup] Signed-off-by: Joel Schopp <joel.schopp@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10kvm: x86: make kvm_emulate_* consistantJoel Schopp
Currently kvm_emulate() skips the instruction but kvm_emulate_* sometimes don't. The end reult is the caller ends up doing the skip themselves. Let's make them consistant. Signed-off-by: Joel Schopp <joel.schopp@amd.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-10KVM: SVM: use kvm_register_write()/read()David Kaplan
KVM has nice wrappers to access the register values, clean up a few places that should use them but currently do not. Signed-off-by: David Kaplan <david.kaplan@amd.com> [forward port and testing] Signed-off-by: Joel Schopp <joel.schopp@amd.com> Acked-by: Borislav Petkov <bp@suse.de> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-03-02KVM: SVM: fix interrupt injection (apic->isr_count always 0)Radim Krčmář
In commit b4eef9b36db4, we started to use hwapic_isr_update() != NULL instead of kvm_apic_vid_enabled(vcpu->kvm). This didn't work because SVM had it defined and "apicv" path in apic_{set,clear}_isr() does not change apic->isr_count, because it should always be 1. The initial value of apic->isr_count was based on kvm_apic_vid_enabled(vcpu->kvm), which is always 0 for SVM, so KVM could have injected interrupts when it shouldn't. Fix it by implicitly setting SVM's hwapic_isr_update to NULL and make the initial isr_count depend on hwapic_isr_update() for good measure. Fixes: b4eef9b36db4 ("kvm: x86: vmx: NULL out hwapic_isr_update() in case of !enable_apicv") Reported-and-tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2015-02-16Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 perf updates from Ingo Molnar: "This series tightens up RDPMC permissions: currently even highly sandboxed x86 execution environments (such as seccomp) have permission to execute RDPMC, which may leak various perf events / PMU state such as timing information and other CPU execution details. This 'all is allowed' RDPMC mode is still preserved as the (non-default) /sys/devices/cpu/rdpmc=2 setting. The new default is that RDPMC access is only allowed if a perf event is mmap-ed (which is needed to correctly interpret RDPMC counter values in any case). As a side effect of these changes CR4 handling is cleaned up in the x86 code and a shadow copy of the CR4 value is added. The extra CR4 manipulation adds ~ <50ns to the context switch cost between rdpmc-capable and rdpmc-non-capable mms" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Add /sys/devices/cpu/rdpmc=2 to allow rdpmc for all tasks perf/x86: Only allow rdpmc if a perf_event is mapped perf: Pass the event to arch_perf_update_userpage() perf: Add pmu callbacks to track event mapping and unmapping x86: Add a comment clarifying LDT context switching x86: Store a per-cpu shadow copy of CR4 x86: Clean up cr4 manipulation
2015-02-04x86: Store a per-cpu shadow copy of CR4Andy Lutomirski
Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-01-08KVM: x86: mmu: remove argument to kvm_init_shadow_mmu and ↵Paolo Bonzini
kvm_init_shadow_ept_mmu The initialization function in mmu.c can always use walk_mmu, which is known to be vcpu->arch.mmu. Only init_kvm_nested_mmu is used to initialize vcpu->arch.nested_mmu. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-12-05kvm: x86: Add kvm_x86_ops hook that enables XSAVES for guestWanpeng Li
Expose the XSAVES feature to the guest if the kvm_x86_ops say it is available. Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-13kvm: svm: move WARN_ON in svm_adjust_tsc_offsetChris J Arges
When running the tsc_adjust kvm-unit-test on an AMD processor with the IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc is called; however it may trigger unnecessary warnings. This patch moves the WARN_ON to trigger only if __scale_tsc will actually be called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common s64 since this can have signed values. Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-11-03KVM: vmx: Unavailable DR4/5 is checked before CPLNadav Amit
If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD should be generated even if CPL>0. This is according to Intel SDM Table 6-2: "Priority Among Simultaneous Exceptions and Interrupts". Note, that this may happen on the first DR access, even if the host does not sets debug breakpoints. Obviously, it occurs when the host debugs the guest. This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr. The emulator already checks DR4/5 availability in check_dr_read. Nested virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject exceptions to the guest. As for SVM, the patch follows the previous logic as much as possible. Anyhow, it appears the DR interception code might be buggy - even if the DR access may cause an exception, the instruction is skipped. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24kvm: x86: don't kill guest on unknown exit reasonMichael S. Tsirkin
KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was triggered by a priveledged application. Let's not kill the guest: WARN and inject #UD instead. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-24KVM: x86: Check non-canonical addresses upon WRMSRNadav Amit
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is written to certain MSRs. The behavior is "almost" identical for AMD and Intel (ignoring MSRs that are not implemented in either architecture since they would anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if non-canonical address is written on Intel but not on AMD (which ignores the top 32-bits). Accordingly, this patch injects a #GP on the MSRs which behave identically on Intel and AMD. To eliminate the differences between the architecutres, the value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to canonical value before writing instead of injecting a #GP. Some references from Intel and AMD manuals: According to Intel SDM description of WRMSR instruction #GP is expected on WRMSR "If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP." According to AMD manual instruction manual: LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical form, a general-protection exception (#GP) occurs." IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the base field must be in canonical form or a #GP fault will occur." IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must be in canonical form." This patch fixes CVE-2014-3610. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-10-15Merge branch 'for-3.18-consistent-ops' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_*() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...
2014-09-11kvm: Use APIC_DEFAULT_PHYS_BASE macro as the apic access page address.Tang Chen
We have APIC_DEFAULT_PHYS_BASE defined as 0xfee00000, which is also the address of apic access page. So use this macro. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-03KVM: nSVM: propagate the NPF EXITINFO to the guestPaolo Bonzini
This is similar to what the EPT code does with the exit qualification. This allows the guest to see a valid value for bits 33:32. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29KVM: remove garbage arg to *hardware_{en,dis}ableRadim Krčmář
In the beggining was on_each_cpu(), which required an unused argument to kvm_arch_ops.hardware_{en,dis}able, but this was soon forgotten. Remove unnecessary arguments that stem from this. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-29KVM: x86: fix some sparse warningsPaolo Bonzini
Sparse reports the following easily fixed warnings: arch/x86/kvm/vmx.c:8795:48: sparse: Using plain integer as NULL pointer arch/x86/kvm/vmx.c:2138:5: sparse: symbol vmx_read_l1_tsc was not declared. Should it be static? arch/x86/kvm/vmx.c:6151:48: sparse: Using plain integer as NULL pointer arch/x86/kvm/vmx.c:8851:6: sparse: symbol vmx_sched_in was not declared. Should it be static? arch/x86/kvm/svm.c:2162:5: sparse: symbol svm_read_l1_tsc was not declared. Should it be static? Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-26x86: Replace __get_cpu_var usesChristoph Lameter
__get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to __this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to __this_cpu_inc(y) Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Acked-by: H. Peter Anvin <hpa@linux.intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2014-08-21KVM: x86: introduce sched_in to kvm_x86_opsRadim Krčmář
sched_in preempt notifier is available for x86, allow its use in specific virtualization technlogies as well. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-08-19KVM: x86: drop fpu_activate hookWanpeng Li
The only user of the fpu_activate hook was dropped in commit 2d04a05bd7e9 (KVM: x86 emulator: emulate CLTS internally, 2011-04-20). vmx_fpu_activate and svm_fpu_activate are still called on #NM (and for Intel CLTS), but never from common code; hence, there's no need for a hook. Reviewed-by: Yang Zhang <yang.z.zhang@intel.com> Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11KVM: x86: return all bits from get_interrupt_shadowPaolo Bonzini
For the next patch we will need to know the full state of the interrupt shadow; we will then set KVM_REQ_EVENT when one bit is cleared. However, right now get_interrupt_shadow only returns the one corresponding to the emulated instruction, or an unconditional 0 if the emulated instruction does not have an interrupt shadow. This is confusing and does not allow us to check for cleared bits as mentioned above. Clean the callback up, and modify toggle_interruptibility to match the comment above the call. As a small result, the call to set_interrupt_shadow will be skipped in the common case where int_shadow == 0 && mask == 0. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-11KVM: Synthesize G bit for all segments.Jim Mattson
We have noticed that qemu-kvm hangs early in the BIOS when runnning nested under some versions of VMware ESXi. The problem we believe is because KVM assumes that the platform preserves the 'G' but for any segment register. The SVM specification itemizes the segment attribute bits that are observed by the CPU, but the (G)ranularity bit is not one of the bits itemized, for any segment. Though current AMD CPUs keep track of the (G)ranularity bit for all segment registers other than CS, the specification does not require it. VMware's virtual CPU may not track the (G)ranularity bit for any segment register. Since kvm already synthesizes the (G)ranularity bit for the CS segment. It should do so for all segments. The patch below does that, and helps get rid of the hangs. Patch applies on top of Linus' tree. Signed-off-by: Jim Mattson <jmattson@vmware.com> Signed-off-by: Alok N Kataria <akataria@vmware.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09KVM: nSVM: Set correct port for IOIO interception evaluationJan Kiszka
Obtaining the port number from DX is bogus as a) there are immediate port accesses and b) user space may have changed the register content while processing the PIO access. Forward the correct value from the instruction emulator instead. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09KVM: nSVM: Fix IOIO size reported on emulationJan Kiszka
The access size of an in/ins is reported in dst_bytes, and that of out/outs in src_bytes. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09KVM: nSVM: Fix IOIO bitmap evaluationJan Kiszka
First, kvm_read_guest returns 0 on success. And then we need to take the access size into account when testing the bitmap: intercept if any of bits corresponding to the access is set. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-07-09KVM: nSVM: Do not report CLTS via SVM_EXIT_WRITE_CR0 to L1Jan Kiszka
CLTS only changes TS which is not monitored by selected CR0 interception. So skip any attempt to translate WRITE_CR0 to CR0_SEL_WRITE for this instruction. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-06-30KVM: SVM: Fix CPL export via SS.DPLJan Kiszka
We import the CPL via SS.DPL since ae9fedc793. However, we fail to export it this way so far. This caused spurious guest crashes, e.g. of Linux when accessing the vmport from guest user space which triggered register saving/restoring to/from host user space. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-22KVM: x86: get CPL from SS.DPLPaolo Bonzini
CS.RPL is not equal to the CPL in the few instructions between setting CR0.PE and reloading CS. And CS.DPL is also not equal to the CPL for conforming code segments. However, SS.DPL *is* always equal to the CPL except for the weird case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the value in the STAR MSR, but force CPL=3 (Intel instead forces SS.DPL=SS.RPL=CPL=3). So this patch: - modifies SVM to update the CPL from SS.DPL rather than CS.RPL; the above case with SYSRET is not broken further, and the way to fix it would be to pass the CPL to userspace and back - modifies VMX to always return the CPL from SS.DPL (except forcing it to 0 if we are emulating real mode via vm86 mode; in vm86 mode all DPLs have to be 3, but real mode does allow privileged instructions). It also removes the CPL cache, which becomes a duplicate of the SS access rights cache. This fixes doing KVM_IOCTL_SET_SREGS exactly after setting CR0.PE=1 but before CS has been reloaded. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-05-08kvm: x86: emulate monitor and mwait instructions as nopGabriel L. Somlo
Treat monitor and mwait instructions as nop, which is architecturally correct (but inefficient) behavior. We do this to prevent misbehaving guests (e.g. OS X <= 10.7) from crashing after they fail to check for monitor/mwait availability via cpuid. Since mwait-based idle loops relying on these nop-emulated instructions would keep the host CPU pegged at 100%, do NOT advertise their presence via cpuid, to prevent compliant guests from using them inadvertently. Signed-off-by: Gabriel L. Somlo <somlo@cmu.edu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-04-02Merge tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "PPC and ARM do not have much going on this time. Most of the cool stuff, instead, is in s390 and (after a few releases) x86. ARM has some caching fixes and PPC has transactional memory support in guests. MIPS has some fixes, with more probably coming in 3.16 as QEMU will soon get support for MIPS KVM. For x86 there are optimizations for debug registers, which trigger on some Windows games, and other important fixes for Windows guests. We now expose to the guest Broadwell instruction set extensions and also Intel MPX. There's also a fix/workaround for OS X guests, nested virtualization features (preemption timer), and a couple kvmclock refinements. For s390, the main news is asynchronous page faults, together with improvements to IRQs (floating irqs and adapter irqs) that speed up virtio devices" * tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits) KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8 KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode KVM: PPC: Book3S HV: Return ENODEV error rather than EIO KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state KVM: PPC: Book3S HV: Add transactional memory support KVM: Specify byte order for KVM_EXIT_MMIO KVM: vmx: fix MPX detection KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write KVM: s390: clear local interrupts at cpu initial reset KVM: s390: Fix possible memory leak in SIGP functions KVM: s390: fix calculation of idle_mask array size KVM: s390: randomize sca address KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP KVM: Bump KVM_MAX_IRQ_ROUTES for s390 KVM: s390: irq routing for adapter interrupts. KVM: s390: adapter interrupt sources ...
2014-03-17KVM: x86: handle missing MPX in nested virtualizationPaolo Bonzini
When doing nested virtualization, we may be able to read BNDCFGS but still not be allowed to write to GUEST_BNDCFGS in the VMCS. Guard writes to the field with vmx_mpx_supported(), and similarly hide the MSR from userspace if the processor does not support the field. We could work around this with the generic MSR save/load machinery, but there is only a limited number of MSR save/load slots and it is not really worthwhile to waste one for a scenario that should not happen except in the nested virtualization case. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-12KVM: SVM: fix cr8 intercept windowRadim Krčmář
We always disable cr8 intercept in its handler, but only re-enable it if handling KVM_REQ_EVENT, so there can be a window where we do not intercept cr8 writes, which allows an interrupt to disrupt a higher priority task. Fix this by disabling intercepts in the same function that re-enables them when needed. This fixes BSOD in Windows 2008. Cc: <stable@vger.kernel.org> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11KVM: svm: Allow the guest to run with dirty debug registersPaolo Bonzini
When not running in guest-debug mode (i.e. the guest controls the debug registers, having to take an exit for each DR access is a waste of time. If the guest gets into a state where each context switch causes DR to be saved and restored, this can take away as much as 40% of the execution time from the guest. If the guest is running with vcpu->arch.db == vcpu->arch.eff_db, we can let it write freely to the debug registers and reload them on the next exit. We still need to exit on the first access, so that the KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further accesses to the debug registers will not cause a vmexit. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11KVM: svm: set/clear all DR intercepts in one swoopPaolo Bonzini
Unlike other intercepts, debug register intercepts will be modified in hot paths if the guest OS is bad or otherwise gets tricked into doing so. Avoid calling recalc_intercepts 16 times for debug registers. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-03-11KVM: x86: Remove return code from enable_irq/nmi_windowJan Kiszka
It's no longer possible to enter enable_irq_window in guest mode when L1 intercepts external interrupts and we are entering L2. This is now caught in vcpu_enter_guest. So we can remove the check from the VMX version of enable_irq_window, thus the need to return an error code from both enable_irq_window and enable_nmi_window. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-02-18KVM: SVM: fix NMI window after iretRadim Krčmář
We should open NMI window right after an iret, but SVM exits before it. We wanted to single step using the trap flag and then open it. (or we could emulate the iret instead) We don't do it since commit 3842d135ff2 (likely), because the iret exit handler does not request an event, so NMI window remains closed until the next exit. Fix this by making KVM_REQ_EVENT request in the iret handler. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-01-17KVM: SVM: Fix reading of DR6Jan Kiszka
In contrast to VMX, SVM dose not automatically transfer DR6 into the VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor hook to obtain the current value. And as SVM now picks the DR6 state from its VMCB, we also need a set callback in order to write updates of DR6 back. Fixes a regression of 020df0794f. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-10-03KVM: mmu: change useless int return types to voidPaolo Bonzini
kvm_mmu initialization is mostly filling in function pointers, there is no way for it to fail. Clean up unused return values. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-06-27kvm: Add a tracepoint write_tsc_offsetYoshihiro YUNOMAE
Add a tracepoint write_tsc_offset for tracing TSC offset change. We want to merge ftrace's trace data of guest OSs and the host OS using TSC for timestamp in chronological order. We need "TSC offset" values for each guest when merge those because the TSC value on a guest is always the host TSC plus guest's TSC offset. If we get the TSC offset values, we can calculate the host TSC value for each guest events from the TSC offset and the event TSC value. The host TSC values of the guest events are used when we want to merge trace data of guests and the host in chronological order. (Note: the trace_clock of both the host and the guest must be set x86-tsc in this case) This tracepoint also records vcpu_id which can be used to merge trace data for SMP guests. A merge tool will read TSC offset for each vcpu, then the tool converts guest TSC values to host TSC values for each vcpu. TSC offset is stored in the VMCS by vmx_write_tsc_offset() or vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots. The latter function is executed when kvm clock is updated. Only host can read TSC offset value from VMCS, so a host needs to output TSC offset value when TSC offset is changed. Since the TSC offset is not often changed, it could be overwritten by other frequent events while tracing. To avoid that, I recommend to use a special instance for getting this event: 1. set a instance before booting a guest # cd /sys/kernel/debug/tracing/instances # mkdir tsc_offset # cd tsc_offset # echo x86-tsc > trace_clock # echo 1 > events/kvm/kvm_write_tsc_offset/enable 2. boot a guest Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Acked-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Gleb Natapov <gleb@redhat.com>