summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-03-19KVM: arm64: Avoid storing the vcpu pointer on the stackChristoffer Dall
We already have the percpu area for the host cpu state, which points to the VCPU, so there's no need to store the VCPU pointer on the stack on every context switch. We can be a little more clever and just use tpidr_el2 for the percpu offset and load the VCPU pointer from the host context. This has the benefit of being able to retrieve the host context even when our stack is corrupted, and it has a potential performance benefit because we trade a store plus a load for an mrs and a load on a round trip to the guest. This does require us to calculate the percpu offset without including the offset from the kernel mapping of the percpu array to the linear mapping of the array (which is what we store in tpidr_el1), because a PC-relative generated address in EL2 is already giving us the hyp alias of the linear mapping of a kernel address. We do this in __cpu_init_hyp_mode() by using kvm_ksym_ref(). The code that accesses ESR_EL2 was previously using an alternative to use the _EL1 accessor on VHE systems, but this was actually unnecessary as the _EL1 accessor aliases the ESR_EL2 register on VHE, and the _EL2 accessor does the same thing on both systems. Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-19KVM: arm/arm64: Move vcpu_load call after kvm_vcpu_first_run_initChristoffer Dall
Moving the call to vcpu_load() in kvm_arch_vcpu_ioctl_run() to after we've called kvm_vcpu_first_run_init() simplifies some of the vgic and there is also no need to do vcpu_load() for things such as handling the immediate_exit flag. Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-19KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUNChristoffer Dall
Calling vcpu_load() registers preempt notifiers for this vcpu and calls kvm_arch_vcpu_load(). The latter will soon be doing a lot of heavy lifting on arm/arm64 and will try to do things such as enabling the virtual timer and setting us up to handle interrupts from the timer hardware. Loading state onto hardware registers and enabling hardware to signal interrupts can be problematic when we're not actually about to run the VCPU, because it makes it difficult to establish the right context when handling interrupts from the timer, and it makes the register access code difficult to reason about. Luckily, now when we call vcpu_load in each ioctl implementation, we can simply remove the call from the non-KVM_RUN vcpu ioctls, and our kvm_arch_vcpu_load() is only used for loading vcpu content to the physical CPU when we're actually going to run the vcpu. Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-16x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASEVitaly Kuznetsov
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so the context is pretty well defined and as we're past 'swapgs' MSR_GS_BASE should contain kernel's GS base which we point to irq_stack_union. Add new kernelmode_gs_base() API, irq_stack_union needs to be exported as KVM can be build as module. Acked-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from current->threadVitaly Kuznetsov
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so the context is pretty well defined. Read MSR_{FS,KERNEL_GS}_BASE from current->thread after calling save_fsgs() which takes care of X86_BUG_NULL_SEG case now and will do RD[FG,GS]BASE when FSGSBASE extensions are exposed to userspace (currently they are not). Acked-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: X86: Provide a capability to disable PAUSE interceptsWanpeng Li
Allow to disable pause loop exit/pause filtering on a per VM basis. If some VMs have dedicated host CPUs, they won't be negatively affected due to needlessly intercepted PAUSE instructions. Thanks to Jan H. Schönherr's initial patch. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: X86: Provide a capability to disable HLT interceptsWanpeng Li
If host CPUs are dedicated to a VM, we can avoid VM exits on HLT. This patch adds the per-VM capability to disable them. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: X86: Provide a capability to disable MWAIT interceptsWanpeng Li
Allowing a guest to execute MWAIT without interception enables a guest to put a (physical) CPU into a power saving state, where it takes longer to return from than what may be desired by the host. Don't give a guest that power over a host by default. (Especially, since nothing prevents a guest from using MWAIT even when it is not advertised via CPUID.) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jan H. Schönherr <jschoenh@amazon.de> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16Merge tag 'kvm-s390-next-4.17-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: s390: fixes and features - more kvm stat counters - virtio gpu plumbing. The 3 non-KVM/s390 patches have Acks from Bartlomiej Zolnierkiewicz, Heiko Carstens and Greg Kroah-Hartman but all belong together to make virtio-gpu work as a tty. So I carried them in the KVM/s390 tree. - document some KVM_CAPs - cpu-model only facilities - cleanups
2018-03-16KVM: x86: Add support for VMware backdoor Pseudo-PMCsArbel Moshe
VMware exposes the following Pseudo PMCs: 0x10000: Physical host TSC 0x10001: Elapsed real time in ns 0x10002: Elapsed apparent time in ns For more info refer to: https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf VMware allows access to these Pseduo-PMCs even when read via RDPMC in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator to allow access to these PMCs in this situation. In addition, emulation of these PMCs were added to kvm_pmu_rdpmc(). Signed-off-by: Arbel Moshe <arbel.moshe@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: SVM: Intercept #GP to support access to VMware backdoor portsLiran Alon
If KVM enable_vmware_backdoor module parameter is set, the commit change VMX to now intercept #GP instead of being directly deliviered from CPU to guest. It is done to support access to VMware Backdoor I/O ports even if TSS I/O permission denies it. In that case: 1. A #GP will be raised and intercepted. 2. #GP intercept handler will simulate I/O port access instruction. 3. I/O port access instruction simulation will allow access to VMware backdoor ports specifically even if TSS I/O permission bitmap denies it. Note that the above change introduce slight performance hit as now #GPs are now not deliviered directly from CPU to guest but instead cause #VMExit and instruction emulation. However, this behavior is introduced only when enable_vmware_backdoor KVM module parameter is set. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: VMX: Intercept #GP to support access to VMware backdoor portsLiran Alon
If KVM enable_vmware_backdoor module parameter is set, the commit change VMX to now intercept #GP instead of being directly deliviered from CPU to guest. It is done to support access to VMware backdoor I/O ports even if TSS I/O permission denies it. In that case: 1. A #GP will be raised and intercepted. 2. #GP intercept handler will simulate I/O port access instruction. 3. I/O port access instruction simulation will allow access to VMware backdoor ports specifically even if TSS I/O permission bitmap denies it. Note that the above change introduce slight performance hit as now #GPs are not deliviered directly from CPU to guest but instead cause #VMExit and instruction emulation. However, this behavior is introduced only when enable_vmware_backdoor KVM module parameter is set. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Emulate only IN/OUT instructions when accessing VMware backdoorLiran Alon
Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS instructions. These ports must be allowed access even if TSS I/O permission bitmap don't allow it. To handle this, VMX/SVM will be changed in future commits to intercept #GP which was raised by such access and handle it by calling x86 emulator to emulate instruction. If it was one of these instructions, the x86 emulator already handles it correctly (Since commit "KVM: x86: Always allow access to VMware backdoor I/O ports") by not checking these ports against TSS I/O permission bitmap. One may wonder why checking for specific instructions is necessary as we can just forward all #GPs to the x86 emulator. There are multiple reasons for doing so: 1. We don't want the x86 emulator to be reached easily by guest by just executing an instruction that raises #GP as that exposes the x86 emulator as a bigger attack surface. 2. The x86 emulator is incomplete and therefore certain instructions that can cause #GP cannot be emulated. Such an example is "INT x" (opcode 0xcd) which reaches emulate_int() which can only emulate the instruction if vCPU is in real-mode. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Add emulation_type to not raise #UD on emulation failureLiran Alon
Next commits are going introduce support for accessing VMware backdoor ports even though guest's TSS I/O permissions bitmap doesn't allow access. This mimic VMware hypervisor behavior. In order to support this, next commits will change VMX/SVM to intercept #GP which was raised by such access and handle it by calling the x86 emulator to emulate instruction. Since commit "KVM: x86: Always allow access to VMware backdoor I/O ports", the x86 emulator handles access to these I/O ports by not checking these ports against the TSS I/O permission bitmap. However, there could be cases that CPU rasies a #GP on instruction that fails to be disassembled by the x86 emulator (Because of incomplete implementation for example). In those cases, we would like the #GP intercept to just forward #GP as-is to guest as if there was no intercept to begin with. However, current emulator code always queues #UD exception in case emulator fails (including disassembly failures) which is not what is wanted in this flow. This commit addresses this issue by adding a new emulation_type flag that will allow the #GP intercept handler to specify that it wishes to be aware when instruction emulation fails and doesn't want #UD exception to be queued. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Always allow access to VMware backdoor I/O portsLiran Alon
VMware allows access to these ports even if denied by TSS I/O permission bitmap. Mimic behavior. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: Add module parameter for supporting VMware backdoorLiran Alon
Support access to VMware backdoor requires KVM to intercept #GP exceptions from guest which introduce slight performance hit. Therefore, control this support by module parameter. Note that module parameter is exported as it should be consumed by kvm_intel & kvm_amd to determine if they should intercept #GP or not. This commit doesn't change semantics. It is done as a preparation for future commits. Signed-off-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: x86: add kvm_fast_pio() to consolidate fast PIO codeSean Christopherson
Add kvm_fast_pio() to consolidate duplicate code in VMX and SVM. Unexport kvm_fast_pio_in() and kvm_fast_pio_out(). Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: VMX: use kvm_fast_pio_in for handling IN I/OSean Christopherson
Fast emulation of processor I/O for IN was disabled on x86 (both VMX and SVM) some years ago due to a buggy implementation. The addition of kvm_fast_pio_in(), used by SVM, re-introduced (functional!) fast emulation of IN. Piggyback SVM's work and use kvm_fast_pio_in() on VMX instead of performing full emulation of IN. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: vVMX: signal failure for nested VMEntry if emulation_requiredSean Christopherson
Fail a nested VMEntry with EXIT_REASON_INVALID_STATE if L2 guest state is invalid, i.e. vmcs12 contained invalid guest state, and unrestricted guest is disabled in L0 (and by extension disabled in L1). WARN_ON_ONCE in handle_invalid_guest_state() if we're attempting to emulate L2, i.e. nested_run_pending is true, to aid debug in the (hopefully unlikely) scenario that we somehow skip the nested VMEntry consistency check, e.g. due to a L0 bug. Note: KVM relies on hardware to detect the scenario where unrestricted guest is enabled in L0 but disabled in L1 and vmcs12 contains invalid guest state, i.e. checking emulation_required in prepare_vmcs02 is required only to handle the case were unrestricted guest is disabled in L0 since L0 never actually attempts VMLAUNCH/VMRESUME with vmcs02. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-16KVM: VMX: WARN on a MOV CR3 exit w/ unrestricted guestSean Christopherson
CR3 load/store exiting are always off when unrestricted guest is enabled. WARN on the associated CR3 VMEXIT to detect code that would re-introduce CR3 load/store exiting for unrestricted guest. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: VMX: give unrestricted guest full control of CR3Sean Christopherson
Now CR3 is not forced to a host-controlled value when paging is disabled in an unrestricted guest, CR3 load/store exiting can be left disabled (for an unrestricted guest). And because CR0.WP and CR4.PAE/PSE are also not force to host-controlled values, all of ept_update_paging_mode_cr0() is no longer needed, i.e. skip ept_update_paging_mode_cr0() for an unrestricted guest. Because MOV CR3 no longer exits when paging is disabled for an unrestricted guest, vmx_decache_cr3() must always read GUEST_CR3 from the VMCS for an unrestricted guest. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: VMX: don't force CR4.PAE/PSE for unrestricted guestSean Christopherson
CR4.PAE - Unrestricted guest can only be enabled when EPT is enabled, and vmx_set_cr4() clears hardware CR0.PAE based on the guest's CR4.PAE, i.e. CR4.PAE always follows the guest's value when unrestricted guest is enabled. CR4.PSE - Unrestricted guest no longer uses the identity mapped IA32 page tables since CR0.PG can be cleared in hardware, thus there is no need to set CR4.PSE when paging is disabled in the guest (and EPT is enabled). Define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST (to X86_CR4_VMXE) and use it in lieu of KVM_*MODE_VM_CR4_ALWAYS_ON when unrestricted guest is enabled, which removes the forcing of CR4.PAE. Skip the manipulation of CR4.PAE/PSE for EPT when unrestricted guest is enabled, as CR4.PAE isn't forced and so doesn't need to be manually cleared, and CR4.PSE does not need to be set when paging is disabled since the identity mapped IA32 page tables are not used. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: VMX: remove CR0.WP from ..._ALWAYS_ON_UNRESTRICTED_GUESTSean Christopherson
Unrestricted guest can only be enabled when EPT is enabled, and when EPT is enabled, ept_update_paging_mode_cr0() will clear hardware CR0.WP based on the guest's CR0.WP, i.e. CR0.WP always follows the guest's value when unrestricted guest is enabled. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: VMX: don't configure EPT identity map for unrestricted guestSean Christopherson
An unrestricted guest can run with hardware CR0.PG==0, i.e. IA32 paging disabled, in which case there is no need to load the guest's CR3 with identity mapped IA32 page tables since hardware will effectively ignore CR3. If unrestricted guest is enabled, don't configure the identity mapped IA32 page table and always load the guest's desired CR3. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: VMX: don't configure RM TSS for unrestricted guestSean Christopherson
An unrestricted guest can run with CR0.PG==0 and/or CR0.PE==0, e.g. it can run in Real Mode without requiring host emulation. The RM TSS is only used for emulating RM, i.e. it will never be used when unrestricted guest is enabled and so doesn't need to be configured. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16x86/kvm/hyper-v: inject #GP only when invalid SINTx vector is unmaskedVitaly Kuznetsov
Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following trace: kvm_entry: vcpu 0 kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0 kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0 kvm_msr: msr_write 40000090 = 0x10000 (#GP) kvm_inj_exception: #GP (0x0) KVM acts according to the following statement from TLFS: " 11.8.4 SINTx Registers ... Valid values for vector are 16-255 inclusive. Specifying an invalid vector number results in #GP. " However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000 to SINTx. I checked with Microsoft and they confirmed that if either the Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they ignore the value of Vector. Make KVM act accordingly. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16x86/kvm/hyper-v: remove stale entries from vec_bitmap/auto_eoi_bitmap on ↵Vitaly Kuznetsov
vector change When a new vector is written to SINx we update vec_bitmap/auto_eoi_bitmap but we forget to remove old vector from these masks (in case it is not present in some other SINTx). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16x86/kvm/hyper-v: add reenlightenment MSRs supportVitaly Kuznetsov
Nested Hyper-V/Windows guest running on top of KVM will use TSC page clocksource in two cases: - L0 exposes invariant TSC (CPUID.80000007H:EDX[8]). - L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]). Exposing invariant TSC effectively blocks migration to hosts with different TSC frequencies, providing reenlightenment support will be needed when we start migrating nested workloads. Implement rudimentary support for reenlightenment MSRs. For now, these are just read/write MSRs with no effect. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: x86: Update the exit_qualification access bits while walking an addressKarimAllah Ahmed
... to avoid having a stale value when handling an EPT misconfig for MMIO regions. MMIO regions that are not passed-through to the guest are handled through EPT misconfigs. The first time a certain MMIO page is touched it causes an EPT violation, then KVM marks the EPT entry to cause an EPT misconfig instead. Any subsequent accesses to the entry will generate an EPT misconfig. Things gets slightly complicated with nested guest handling for MMIO regions that are not passed through from L0 (i.e. emulated by L0 user-space). An EPT violation for one of these MMIO regions from L2, exits to L0 hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and realize it is not present (or not sufficient to serve the request). Then L0 injects an EPT violation to L1. L1 would then update its EPT mappings. The EXIT_QUALIFICATION value for L1 would come from exit_qualification variable in "struct vcpu". The problem is that this variable is only updated on EPT violation and not on EPT misconfig. So if an EPT violation because of a read happened first, then an EPT misconfig because of a write happened afterwards. The L0 hypervisor will still contain exit_qualification value from the previous read instead of the write and end up injecting an EPT violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION. The EPT violation that is injected from L0 to L1 needs to have the correct EXIT_QUALIFICATION specially for the access bits because the individual access bits for MMIO EPTs are updated only on actual access of this specific type. So for the example above, the L1 hypervisor will keep updating only the read bit in the EPT then resume the L2 guest. The L2 guest would end up causing another exit where the L0 *again* will inject another EPT violation to L1 hypervisor with *again* an out of date exit_qualification which indicates a read and not a write. Then this ping-pong just keeps happening without making any forward progress. The behavior of mapping MMIO regions changed in: commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable (if possible)") ... where an EPT violation for a read would also fixup the write bits to avoid another EPT violation which by acciddent would fix the bug mentioned above. This commit fixes this situation and ensures that the access bits for the exit_qualifcation is up to date. That ensures that even L1 hypervisor running with a KVM version before the commit mentioned above would still work. ( The description above assumes EPT to be available and used by L1 hypervisor + the L1 hypervisor is passing through the MMIO region to the L2 guest while this MMIO region is emulated by the L0 user-space ). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: x86: Make enum conversion explicit in kvm_pdptr_read()Matthias Kaehlcke
The type 'enum kvm_reg_ex' is an extension of 'enum kvm_reg', however the extension is only semantical and the compiler doesn't know about the relationship between the two types. In kvm_pdptr_read() a value of the extended type is passed to kvm_x86_ops->cache_reg(), which expects a value of the base type. Clang raises the following warning about the type mismatch: arch/x86/kvm/kvm_cache_regs.h:44:32: warning: implicit conversion from enumeration type 'enum kvm_reg_ex' to different enumeration type 'enum kvm_reg' [-Wenum-conversion] kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR); Cast VCPU_EXREG_PDPTR to 'enum kvm_reg' to make the compiler happy. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Guenter Roeck <groeck@chromium.org> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: lapic: stop advertising DIRECTED_EOI when in-kernel IOAPIC is in useVitaly Kuznetsov
Devices which use level-triggered interrupts under Windows 2016 with Hyper-V role enabled don't work: Windows disables EOI broadcast in SPIV unconditionally. Our in-kernel IOAPIC implementation emulates an old IOAPIC version which has no EOI register so EOI never happens. The issue was discovered and discussed a while ago: https://www.spinics.net/lists/kvm/msg148098.html While this is a guest OS bug (it should check that IOAPIC has the required capabilities before disabling EOI broadcast) we can workaround it in KVM: advertising DIRECTED_EOI with in-kernel IOAPIC makes little sense anyway. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16KVM: x86: Add support for AMD Core Perf Extension in guestJanakarajan Natarajan
Add support for AMD Core Performance counters in the guest. The base event select and counter MSRs are changed. In addition, with the core extension, there are 2 extra counters available for performance measurements for a total of 6. With the new MSRs, the logic to map them to the gp_counters[] is changed. New functions are added to check the validity of the get/set MSRs. If the guest has the X86_FEATURE_PERFCTR_CORE cpuid flag set, the number of counters available to the vcpu is set to 6. It the flag is not set then it is 4. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> [Squashed "Expose AMD Core Perf Extension flag to guests" - Radim.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-16x86/msr: Add AMD Core Perf Extension MSRsJanakarajan Natarajan
Add the EventSelect and Counter MSRs for AMD Core Perf Extension. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-14KVM: s390: provide counters for all interrupt injects/deliveryChristian Borntraeger
For testing the exitless interrupt support it turned out useful to have separate counters for inject and delivery of I/O interrupt. While at it do the same for all interrupt types. For timer related interrupts (clock comparator and cpu timer) we even had no delivery counters. Fix this as well. On this way some counters are being renamed to have a similar name. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com>
2018-03-14KVM: add machine check counter to kvm_statQingFeng Hao
This counter can be used for administration, debug or test purposes. Suggested-by: Vladislav Mironov <mironov@de.ibm.com> Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-14s390/setup: enable display support for KVM guestFarhan Ali
The S390 architecture does not support any graphics hardware, but with the latest support for Virtio GPU in Linux and Virtio GPU emulation in QEMU, it's possible to enable graphics for S390 using the Virtio GPU device. To enable display we need to enable the Linux Virtual Terminal (VT) layer for S390. But the VT subsystem initializes quite early at boot so we need a dummy console driver till the Virtio GPU driver is initialized and we can run the framebuffer console. The framebuffer console over a Virtio GPU device can be run in combination with the serial SCLP console (default on S390). The SCLP console can still be accessed by management applications (eg: via Libvirt's virsh console). Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Message-Id: <e23b61f4f599ba23881727a1e8880e9d60cc6a48.1519315352.git.alifm@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-14s390/char: Rename EBCDIC keymap variablesFarhan Ali
The Linux Virtual Terminal (VT) layer provides a default keymap which is compiled when VT layer is enabled. But at the same time we are also compiling the EBCDIC keymap and this causes the linker to complain. So let's rename the EBCDIC keymap variables to prevent linker conflict. Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com> Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Message-Id: <f670a2698d2372e1e990c48a29334ffe894804b1.1519315352.git.alifm@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-14Kconfig: Remove HAS_IOMEM dependency for Graphics supportFarhan Ali
The 'commit e25df1205f37 ("[S390] Kconfig: menus with depends on HAS_IOMEM.")' added the HAS_IOMEM dependecy for "Graphics support". This disabled the "Graphics support" menu for S390. But if we enable VT layer for S390, we would also need to enable the dummy console. So let's remove the HAS_IOMEM dependency. Move this dependency to sub menu items and console drivers that use io memory. Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com> Reviewed-by: Thomas Huth <thuth@redhat.com> Message-Id: <6e8ef238162df5be4462126be155975c722e9863.1519315352.git.alifm@linux.vnet.ibm.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-14KVM: s390: fix fallthrough annotationSebastian Ott
A case statement in kvm_s390_shadow_tables uses fallthrough annotations which are not recognized by gcc because they are hidden within a block. Move these annotations out of the block to fix (W=1) warnings like below: arch/s390/kvm/gaccess.c: In function 'kvm_s390_shadow_tables': arch/s390/kvm/gaccess.c:1029:26: warning: this statement may fall through [-Wimplicit-fallthrough=] case ASCE_TYPE_REGION1: { ^ Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-14KVM: s390: add exit io request stats and simplify codeChristian Borntraeger
We want to count IO exit requests in kvm_stat. At the same time we can get rid of the handle_noop function. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-14KVM: document KVM_CAP_S390_[BPB|PSW|GMAP|COW]Christian Borntraeger
commit 35b3fde6203b ("KVM: s390: wire up bpb feature") has no documentation for KVM_CAP_S390_BPB. While adding this let's also add other missing capabilities like KVM_CAP_S390_PSW, KVM_CAP_S390_GMAP and KVM_CAP_S390_COW. Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-14kvm: arm/arm64: vgic-v3: Tighten synchronization for guests using v2 on v3Marc Zyngier
On guest exit, and when using GICv2 on GICv3, we use a dsb(st) to force synchronization between the memory-mapped guest view and the system-register view that the hypervisor uses. This is incorrect, as the spec calls out the need for "a DSB whose required access type is both loads and stores with any Shareability attribute", while we're only synchronizing stores. We also lack an isb after the dsb to ensure that the latter has actually been executed before we start reading stuff from the sysregs. The fix is pretty easy: turn dsb(st) into dsb(sy), and slap an isb() just after. Cc: stable@vger.kernel.org Fixes: f68d2b1b73cc ("arm64: KVM: Implement vgic-v3 save/restore") Acked-by: Christoffer Dall <cdall@kernel.org> Reviewed-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-14KVM: arm/arm64: vgic: Don't populate multiple LRs with the same vintidMarc Zyngier
The vgic code is trying to be clever when injecting GICv2 SGIs, and will happily populate LRs with the same interrupt number if they come from multiple vcpus (after all, they are distinct interrupt sources). Unfortunately, this is against the letter of the architecture, and the GICv2 architecture spec says "Each valid interrupt stored in the List registers must have a unique VirtualID for that virtual CPU interface.". GICv3 has similar (although slightly ambiguous) restrictions. This results in guests locking up when using GICv2-on-GICv3, for example. The obvious fix is to stop trying so hard, and inject a single vcpu per SGI per guest entry. After all, pending SGIs with multiple source vcpus are pretty rare, and are mostly seen in scenario where the physical CPUs are severely overcomitted. But as we now only inject a single instance of a multi-source SGI per vcpu entry, we may delay those interrupts for longer than strictly necessary, and run the risk of injecting lower priority interrupts in the meantime. In order to address this, we adopt a three stage strategy: - If we encounter a multi-source SGI in the AP list while computing its depth, we force the list to be sorted - When populating the LRs, we prevent the injection of any interrupt of lower priority than that of the first multi-source SGI we've injected. - Finally, the injection of a multi-source SGI triggers the request of a maintenance interrupt when there will be no pending interrupt in the LRs (HCR_NPIE). At the point where the last pending interrupt in the LRs switches from Pending to Active, the maintenance interrupt will be delivered, allowing us to add the remaining SGIs using the same process. Cc: stable@vger.kernel.org Fixes: 0919e84c0fc1 ("KVM: arm/arm64: vgic-new: Add IRQ sync/flush framework") Acked-by: Christoffer Dall <cdall@kernel.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-14KVM: arm/arm64: Reduce verbosity of KVM init logArd Biesheuvel
On my GICv3 system, the following is printed to the kernel log at boot: kvm [1]: 8-bit VMID kvm [1]: IDMAP page: d20e35000 kvm [1]: HYP VA range: 800000000000:ffffffffffff kvm [1]: vgic-v2@2c020000 kvm [1]: GIC system register CPU interface enabled kvm [1]: vgic interrupt IRQ1 kvm [1]: virtual timer IRQ4 kvm [1]: Hyp mode initialized successfully The KVM IDMAP is a mapping of a statically allocated kernel structure, and so printing its physical address leaks the physical placement of the kernel when physical KASLR in effect. So change the kvm_info() to kvm_debug() to remove it from the log output. While at it, trim the output a bit more: IRQ numbers can be found in /proc/interrupts, and the HYP VA and vgic-v2 lines are not highly informational either. Cc: <stable@vger.kernel.org> Acked-by: Will Deacon <will.deacon@arm.com> Acked-by: Christoffer Dall <cdall@kernel.org> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-14KVM: arm/arm64: Reset mapped IRQs on VM resetChristoffer Dall
We currently don't allow resetting mapped IRQs from userspace, because their state is controlled by the hardware. But we do need to reset the state when the VM is reset, so we provide a function for the 'owner' of the mapped interrupt to reset the interrupt state. Currently only the timer uses mapped interrupts, so we call this function from the timer reset logic. Cc: stable@vger.kernel.org Fixes: 4c60e360d6df ("KVM: arm/arm64: Provide a get_input_level for the arch timer") Signed-off-by: Christoffer Dall <cdall@kernel.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-14KVM: arm/arm64: Avoid vcpu_load for other vcpu ioctls than KVM_RUNChristoffer Dall
Calling vcpu_load() registers preempt notifiers for this vcpu and calls kvm_arch_vcpu_load(). The latter will soon be doing a lot of heavy lifting on arm/arm64 and will try to do things such as enabling the virtual timer and setting us up to handle interrupts from the timer hardware. Loading state onto hardware registers and enabling hardware to signal interrupts can be problematic when we're not actually about to run the VCPU, because it makes it difficult to establish the right context when handling interrupts from the timer, and it makes the register access code difficult to reason about. Luckily, now when we call vcpu_load in each ioctl implementation, we can simply remove the call from the non-KVM_RUN vcpu ioctls, and our kvm_arch_vcpu_load() is only used for loading vcpu content to the physical CPU when we're actually going to run the vcpu. Cc: stable@vger.kernel.org Fixes: 9b062471e52a ("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl") Reviewed-by: Julien Grall <julien.grall@arm.com> Reviewed-by: Marc Zyngier <marc.zyngier@arm.com> Reviewed-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-14KVM: arm/arm64: vgic: Add missing irq_lock to vgic_mmio_read_pendingAndre Przywara
Our irq_is_pending() helper function accesses multiple members of the vgic_irq struct, so we need to hold the lock when calling it. Add that requirement as a comment to the definition and take the lock around the call in vgic_mmio_read_pending(), where we were missing it before. Fixes: 96b298000db4 ("KVM: arm/arm64: vgic-new: Add PENDING registers handlers") Signed-off-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2018-03-09KVM: s390: Refactor host cmma and pfmfi interpretation controlsJanosch Frank
use_cmma in kvm_arch means that the KVM hypervisor is allowed to use cmma, whereas use_cmma in the mm context means cmm has been used before. Let's rename the context one to uses_cmm, as the vm does use collaborative memory management but the host uses the cmm assist (interpretation facility). Also let's introduce use_pfmfi, so we can remove the pfmfi disablement when we activate cmma and rather not activate it in the first place. Signed-off-by: Janosch Frank <frankja@linux.vnet.ibm.com> Message-Id: <1518779775-256056-2-git-send-email-frankja@linux.vnet.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-09KVM: s390: implement CPU model only facilitiesChristian Borntraeger
Some facilities should only be provided to the guest, if they are enabled by a CPU model. This allows us to avoid capabilities and to simply fall back to the cpumodel for deciding about a facility without enabling it for older QEMUs or QEMUs without a CPU model. Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2018-03-08KVM: nVMX: Enforce NMI controls on vmentry of L2 guestsKrish Sadhukhan
According to Intel SDM 26.2.1.1, the following rules should be enforced on vmentry: * If the "NMI exiting" VM-execution control is 0, "Virtual NMIs" VM-execution control must be 0. * If the “virtual NMIs” VM-execution control is 0, the “NMI-window exiting” VM-execution control must be 0. This patch enforces these rules when entering an L2 guest. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>