Age | Commit message (Collapse) | Author |
|
We have numerous checks around that checks if the HCR_EL2 has the RW bit
set to figure out if we're running an AArch64 or AArch32 VM. In some
cases, directly checking the RW bit (given its unintuitive name), is a
bit confusing, and that's not going to improve as we move logic around
for the following patches that optimize KVM on AArch64 hosts with VHE.
Therefore, introduce a helper, vcpu_el1_is_32bit, and replace existing
direct checks of HCR_EL2.RW with the helper.
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
As we are about to move a bunch of save/restore logic for VHE kernels to
the load and put functions, we need some infrastructure to do this.
Reviewed-by: Andrew Jones <drjones@redhat.com>
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
We currently have a separate read-modify-write of the HCR_EL2 on entry
to the guest for the sole purpose of setting the VF and VI bits, if set.
Since this is most rarely the case (only when using userspace IRQ chip
and interrupts are in flight), let's get rid of this operation and
instead modify the bits in the vcpu->arch.hcr[_el2] directly when
needed.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Julien Thierry <julien.thierry@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
We always set the IMO and FMO bits in the HCR_EL2 when running the
guest, regardless if we use the vgic or not. By moving these flags to
HCR_GUEST_FLAGS we can avoid one of the extra save/restore operations of
HCR_EL2 in the world switch code, and we can also soon get rid of the
other one.
This is safe, because even though the IMO and FMO bits control both
taking the interrupts to EL2 and remapping ICC_*_EL1 to ICV_*_EL1 when
executed at EL1, as long as we ensure that these bits are clear when
running the EL1 host, we're OK, because we reset the HCR_EL2 to only
have the HCR_RW bit set when returning to EL1 on non-VHE systems.
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Shih-Wei Li <shihwei@cs.columbia.edu>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
VHE actually doesn't rely on clearing the VTTBR when returning to the
host kernel, and that is the current key mechanism of hyp_panic to
figure out how to attempt to return to a state good enough to print a
panic statement.
Therefore, we split the hyp_panic function into two functions, a VHE and
a non-VHE, keeping the non-VHE version intact, but changing the VHE
behavior.
The vttbr_el2 check on VHE doesn't really make that much sense, because
the only situation where we can get here on VHE is when the hypervisor
assembly code actually called into hyp_panic, which only happens when
VBAR_EL2 has been set to the KVM exception vectors. On VHE, we can
always safely disable the traps and restore the host registers at this
point, so we simply do that unconditionally and call into the panic
function directly.
Acked-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
We already have the percpu area for the host cpu state, which points to
the VCPU, so there's no need to store the VCPU pointer on the stack on
every context switch. We can be a little more clever and just use
tpidr_el2 for the percpu offset and load the VCPU pointer from the host
context.
This has the benefit of being able to retrieve the host context even
when our stack is corrupted, and it has a potential performance benefit
because we trade a store plus a load for an mrs and a load on a round
trip to the guest.
This does require us to calculate the percpu offset without including
the offset from the kernel mapping of the percpu array to the linear
mapping of the array (which is what we store in tpidr_el1), because a
PC-relative generated address in EL2 is already giving us the hyp alias
of the linear mapping of a kernel address. We do this in
__cpu_init_hyp_mode() by using kvm_ksym_ref().
The code that accesses ESR_EL2 was previously using an alternative to
use the _EL1 accessor on VHE systems, but this was actually unnecessary
as the _EL1 accessor aliases the ESR_EL2 register on VHE, and the _EL2
accessor does the same thing on both systems.
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Moving the call to vcpu_load() in kvm_arch_vcpu_ioctl_run() to after
we've called kvm_vcpu_first_run_init() simplifies some of the vgic and
there is also no need to do vcpu_load() for things such as handling the
immediate_exit flag.
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
Calling vcpu_load() registers preempt notifiers for this vcpu and calls
kvm_arch_vcpu_load(). The latter will soon be doing a lot of heavy
lifting on arm/arm64 and will try to do things such as enabling the
virtual timer and setting us up to handle interrupts from the timer
hardware.
Loading state onto hardware registers and enabling hardware to signal
interrupts can be problematic when we're not actually about to run the
VCPU, because it makes it difficult to establish the right context when
handling interrupts from the timer, and it makes the register access
code difficult to reason about.
Luckily, now when we call vcpu_load in each ioctl implementation, we can
simply remove the call from the non-KVM_RUN vcpu ioctls, and our
kvm_arch_vcpu_load() is only used for loading vcpu content to the
physical CPU when we're actually going to run the vcpu.
Reviewed-by: Julien Grall <julien.grall@arm.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
|
|
This adds code to the radix hypervisor page fault handler to handle the
case where the guest memory is backed by 1GB hugepages, and put them
into the partition-scoped radix tree at the PUD level. The code is
essentially analogous to the code for 2MB pages. This also rearranges
kvmppc_create_pte() to make it easier to follow.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
When using the radix MMU, we can get hypervisor page fault interrupts
with the DSISR_SET_RC bit set in DSISR/HSRR1, indicating that an
attempt to set the R (reference) or C (change) bit in a PTE atomically
failed. Previously we would find the corresponding Linux PTE and
check the permission and dirty bits there, but this is not really
necessary since we only need to do what the hardware was trying to
do, namely set R or C atomically. This removes the code that reads
the Linux PTE and just update the partition-scoped PTE, having first
checked that it is still present, and if the access is a write, that
the PTE still has write permission.
Furthermore, we now check whether any other relevant bits are set
in DSISR, and if there are, then we proceed with the rest of the
function in order to handle whatever condition they represent,
instead of returning to the guest as we did previously.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
This improves the handling of transparent huge pages in the radix
hypervisor page fault handler. Previously, if a small page is faulted
in to a 2MB region of guest physical space, that means that there is
a page table pointer at the PMD level, which could never be replaced
by a leaf (2MB) PMD entry. This adds the code to clear the PMD,
invlidate the page walk cache and free the page table page in this
situation, so that the leaf PMD entry can be created.
This also adds code to check whether a PMD or PTE being inserted is
the same as is already there (because of a race with another CPU that
faulted on the same page) and if so, we don't replace the existing
entry, meaning that we don't invalidate the PTE or PMD and do a TLB
invalidation.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
Since commit fb1522e099f0 ("KVM: update to new mmu_notifier semantic
v2", 2017-08-31), the MMU notifier code in KVM no longer calls the
kvm_unmap_hva callback. This removes the PPC implementations of
kvm_unmap_hva().
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
|
|
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so
the context is pretty well defined and as we're past 'swapgs' MSR_GS_BASE
should contain kernel's GS base which we point to irq_stack_union.
Add new kernelmode_gs_base() API, irq_stack_union needs to be exported
as KVM can be build as module.
Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
vmx_save_host_state() is only called from kvm_arch_vcpu_ioctl_run() so
the context is pretty well defined. Read MSR_{FS,KERNEL_GS}_BASE from
current->thread after calling save_fsgs() which takes care of
X86_BUG_NULL_SEG case now and will do RD[FG,GS]BASE when FSGSBASE
extensions are exposed to userspace (currently they are not).
Acked-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Allow to disable pause loop exit/pause filtering on a per VM basis.
If some VMs have dedicated host CPUs, they won't be negatively affected
due to needlessly intercepted PAUSE instructions.
Thanks to Jan H. Schönherr's initial patch.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If host CPUs are dedicated to a VM, we can avoid VM exits on HLT.
This patch adds the per-VM capability to disable them.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Allowing a guest to execute MWAIT without interception enables a guest
to put a (physical) CPU into a power saving state, where it takes
longer to return from than what may be desired by the host.
Don't give a guest that power over a host by default. (Especially,
since nothing prevents a guest from using MWAIT even when it is not
advertised via CPUID.)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD
KVM: s390: fixes and features
- more kvm stat counters
- virtio gpu plumbing. The 3 non-KVM/s390 patches have Acks from
Bartlomiej Zolnierkiewicz, Heiko Carstens and Greg Kroah-Hartman
but all belong together to make virtio-gpu work as a tty. So
I carried them in the KVM/s390 tree.
- document some KVM_CAPs
- cpu-model only facilities
- cleanups
|
|
VMware exposes the following Pseudo PMCs:
0x10000: Physical host TSC
0x10001: Elapsed real time in ns
0x10002: Elapsed apparent time in ns
For more info refer to:
https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
VMware allows access to these Pseduo-PMCs even when read via RDPMC
in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator
to allow access to these PMCs in this situation. In addition,
emulation of these PMCs were added to kvm_pmu_rdpmc().
Signed-off-by: Arbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If KVM enable_vmware_backdoor module parameter is set,
the commit change VMX to now intercept #GP instead of being directly
deliviered from CPU to guest.
It is done to support access to VMware Backdoor I/O ports
even if TSS I/O permission denies it.
In that case:
1. A #GP will be raised and intercepted.
2. #GP intercept handler will simulate I/O port access instruction.
3. I/O port access instruction simulation will allow access to VMware
backdoor ports specifically even if TSS I/O permission bitmap denies it.
Note that the above change introduce slight performance hit as now #GPs
are now not deliviered directly from CPU to guest but instead
cause #VMExit and instruction emulation.
However, this behavior is introduced only when enable_vmware_backdoor
KVM module parameter is set.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
If KVM enable_vmware_backdoor module parameter is set,
the commit change VMX to now intercept #GP instead of being directly
deliviered from CPU to guest.
It is done to support access to VMware backdoor I/O ports
even if TSS I/O permission denies it.
In that case:
1. A #GP will be raised and intercepted.
2. #GP intercept handler will simulate I/O port access instruction.
3. I/O port access instruction simulation will allow access to VMware
backdoor ports specifically even if TSS I/O permission bitmap denies it.
Note that the above change introduce slight performance hit as now #GPs
are not deliviered directly from CPU to guest but instead
cause #VMExit and instruction emulation.
However, this behavior is introduced only when enable_vmware_backdoor
KVM module parameter is set.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS
instructions. These ports must be allowed access even if TSS I/O
permission bitmap don't allow it.
To handle this, VMX/SVM will be changed in future commits
to intercept #GP which was raised by such access and
handle it by calling x86 emulator to emulate instruction.
If it was one of these instructions, the x86 emulator already handles
it correctly (Since commit "KVM: x86: Always allow access to VMware
backdoor I/O ports") by not checking these ports against TSS I/O
permission bitmap.
One may wonder why checking for specific instructions is necessary
as we can just forward all #GPs to the x86 emulator.
There are multiple reasons for doing so:
1. We don't want the x86 emulator to be reached easily
by guest by just executing an instruction that raises #GP as that
exposes the x86 emulator as a bigger attack surface.
2. The x86 emulator is incomplete and therefore certain instructions
that can cause #GP cannot be emulated. Such an example is "INT x"
(opcode 0xcd) which reaches emulate_int() which can only emulate
the instruction if vCPU is in real-mode.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Next commits are going introduce support for accessing VMware backdoor
ports even though guest's TSS I/O permissions bitmap doesn't allow
access. This mimic VMware hypervisor behavior.
In order to support this, next commits will change VMX/SVM to
intercept #GP which was raised by such access and handle it by calling
the x86 emulator to emulate instruction. Since commit "KVM: x86:
Always allow access to VMware backdoor I/O ports", the x86 emulator
handles access to these I/O ports by not checking these ports against
the TSS I/O permission bitmap.
However, there could be cases that CPU rasies a #GP on instruction
that fails to be disassembled by the x86 emulator (Because of
incomplete implementation for example).
In those cases, we would like the #GP intercept to just forward #GP
as-is to guest as if there was no intercept to begin with.
However, current emulator code always queues #UD exception in case
emulator fails (including disassembly failures) which is not what is
wanted in this flow.
This commit addresses this issue by adding a new emulation_type flag
that will allow the #GP intercept handler to specify that it wishes
to be aware when instruction emulation fails and doesn't want #UD
exception to be queued.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
VMware allows access to these ports even if denied
by TSS I/O permission bitmap. Mimic behavior.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Support access to VMware backdoor requires KVM to intercept #GP
exceptions from guest which introduce slight performance hit.
Therefore, control this support by module parameter.
Note that module parameter is exported as it should be consumed by
kvm_intel & kvm_amd to determine if they should intercept #GP or not.
This commit doesn't change semantics.
It is done as a preparation for future commits.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Add kvm_fast_pio() to consolidate duplicate code in VMX and SVM.
Unexport kvm_fast_pio_in() and kvm_fast_pio_out().
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fast emulation of processor I/O for IN was disabled on x86 (both VMX
and SVM) some years ago due to a buggy implementation. The addition
of kvm_fast_pio_in(), used by SVM, re-introduced (functional!) fast
emulation of IN. Piggyback SVM's work and use kvm_fast_pio_in() on
VMX instead of performing full emulation of IN.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
Fail a nested VMEntry with EXIT_REASON_INVALID_STATE if L2 guest state
is invalid, i.e. vmcs12 contained invalid guest state, and unrestricted
guest is disabled in L0 (and by extension disabled in L1).
WARN_ON_ONCE in handle_invalid_guest_state() if we're attempting to
emulate L2, i.e. nested_run_pending is true, to aid debug in the
(hopefully unlikely) scenario that we somehow skip the nested VMEntry
consistency check, e.g. due to a L0 bug.
Note: KVM relies on hardware to detect the scenario where unrestricted
guest is enabled in L0 but disabled in L1 and vmcs12 contains invalid
guest state, i.e. checking emulation_required in prepare_vmcs02 is
required only to handle the case were unrestricted guest is disabled
in L0 since L0 never actually attempts VMLAUNCH/VMRESUME with vmcs02.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
CR3 load/store exiting are always off when unrestricted guest
is enabled. WARN on the associated CR3 VMEXIT to detect code
that would re-introduce CR3 load/store exiting for unrestricted
guest.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Now CR3 is not forced to a host-controlled value when paging is
disabled in an unrestricted guest, CR3 load/store exiting can be
left disabled (for an unrestricted guest). And because CR0.WP
and CR4.PAE/PSE are also not force to host-controlled values,
all of ept_update_paging_mode_cr0() is no longer needed, i.e.
skip ept_update_paging_mode_cr0() for an unrestricted guest.
Because MOV CR3 no longer exits when paging is disabled for an
unrestricted guest, vmx_decache_cr3() must always read GUEST_CR3
from the VMCS for an unrestricted guest.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
CR4.PAE - Unrestricted guest can only be enabled when EPT is
enabled, and vmx_set_cr4() clears hardware CR0.PAE based on
the guest's CR4.PAE, i.e. CR4.PAE always follows the guest's
value when unrestricted guest is enabled.
CR4.PSE - Unrestricted guest no longer uses the identity mapped
IA32 page tables since CR0.PG can be cleared in hardware, thus
there is no need to set CR4.PSE when paging is disabled in the
guest (and EPT is enabled).
Define KVM_VM_CR4_ALWAYS_ON_UNRESTRICTED_GUEST (to X86_CR4_VMXE)
and use it in lieu of KVM_*MODE_VM_CR4_ALWAYS_ON when unrestricted
guest is enabled, which removes the forcing of CR4.PAE.
Skip the manipulation of CR4.PAE/PSE for EPT when unrestricted
guest is enabled, as CR4.PAE isn't forced and so doesn't need to
be manually cleared, and CR4.PSE does not need to be set when
paging is disabled since the identity mapped IA32 page tables
are not used.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Unrestricted guest can only be enabled when EPT is enabled, and
when EPT is enabled, ept_update_paging_mode_cr0() will clear
hardware CR0.WP based on the guest's CR0.WP, i.e. CR0.WP always
follows the guest's value when unrestricted guest is enabled.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
An unrestricted guest can run with hardware CR0.PG==0, i.e.
IA32 paging disabled, in which case there is no need to load
the guest's CR3 with identity mapped IA32 page tables since
hardware will effectively ignore CR3. If unrestricted guest
is enabled, don't configure the identity mapped IA32 page
table and always load the guest's desired CR3.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
An unrestricted guest can run with CR0.PG==0 and/or CR0.PE==0,
e.g. it can run in Real Mode without requiring host emulation.
The RM TSS is only used for emulating RM, i.e. it will never
be used when unrestricted guest is enabled and so doesn't need
to be configured.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Hyper-V 2016 on KVM with SynIC enabled doesn't boot with the following
trace:
kvm_entry: vcpu 0
kvm_exit: reason MSR_WRITE rip 0xfffff8000131c1e5 info 0 0
kvm_hv_synic_set_msr: vcpu_id 0 msr 0x40000090 data 0x10000 host 0
kvm_msr: msr_write 40000090 = 0x10000 (#GP)
kvm_inj_exception: #GP (0x0)
KVM acts according to the following statement from TLFS:
"
11.8.4 SINTx Registers
...
Valid values for vector are 16-255 inclusive. Specifying an invalid
vector number results in #GP.
"
However, I checked and genuine Hyper-V doesn't #GP when we write 0x10000
to SINTx. I checked with Microsoft and they confirmed that if either the
Masked bit (bit 16) or the Polling bit (bit 18) is set to 1, then they
ignore the value of Vector. Make KVM act accordingly.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
vector change
When a new vector is written to SINx we update vec_bitmap/auto_eoi_bitmap
but we forget to remove old vector from these masks (in case it is not
present in some other SINTx).
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Nested Hyper-V/Windows guest running on top of KVM will use TSC page
clocksource in two cases:
- L0 exposes invariant TSC (CPUID.80000007H:EDX[8]).
- L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]).
Exposing invariant TSC effectively blocks migration to hosts with different
TSC frequencies, providing reenlightenment support will be needed when we
start migrating nested workloads.
Implement rudimentary support for reenlightenment MSRs. For now, these are
just read/write MSRs with no effect.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
... to avoid having a stale value when handling an EPT misconfig for MMIO
regions.
MMIO regions that are not passed-through to the guest are handled through
EPT misconfigs. The first time a certain MMIO page is touched it causes an
EPT violation, then KVM marks the EPT entry to cause an EPT misconfig
instead. Any subsequent accesses to the entry will generate an EPT
misconfig.
Things gets slightly complicated with nested guest handling for MMIO
regions that are not passed through from L0 (i.e. emulated by L0
user-space).
An EPT violation for one of these MMIO regions from L2, exits to L0
hypervisor. L0 would then look at the EPT12 mapping for L1 hypervisor and
realize it is not present (or not sufficient to serve the request). Then L0
injects an EPT violation to L1. L1 would then update its EPT mappings. The
EXIT_QUALIFICATION value for L1 would come from exit_qualification variable
in "struct vcpu". The problem is that this variable is only updated on EPT
violation and not on EPT misconfig. So if an EPT violation because of a
read happened first, then an EPT misconfig because of a write happened
afterwards. The L0 hypervisor will still contain exit_qualification value
from the previous read instead of the write and end up injecting an EPT
violation to the L1 hypervisor with an out of date EXIT_QUALIFICATION.
The EPT violation that is injected from L0 to L1 needs to have the correct
EXIT_QUALIFICATION specially for the access bits because the individual
access bits for MMIO EPTs are updated only on actual access of this
specific type. So for the example above, the L1 hypervisor will keep
updating only the read bit in the EPT then resume the L2 guest. The L2
guest would end up causing another exit where the L0 *again* will inject
another EPT violation to L1 hypervisor with *again* an out of date
exit_qualification which indicates a read and not a write. Then this
ping-pong just keeps happening without making any forward progress.
The behavior of mapping MMIO regions changed in:
commit a340b3e229b24 ("kvm: Map PFN-type memory regions as writable (if possible)")
... where an EPT violation for a read would also fixup the write bits to
avoid another EPT violation which by acciddent would fix the bug mentioned
above.
This commit fixes this situation and ensures that the access bits for the
exit_qualifcation is up to date. That ensures that even L1 hypervisor
running with a KVM version before the commit mentioned above would still
work.
( The description above assumes EPT to be available and used by L1
hypervisor + the L1 hypervisor is passing through the MMIO region to the L2
guest while this MMIO region is emulated by the L0 user-space ).
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
The type 'enum kvm_reg_ex' is an extension of 'enum kvm_reg', however
the extension is only semantical and the compiler doesn't know about the
relationship between the two types. In kvm_pdptr_read() a value of the
extended type is passed to kvm_x86_ops->cache_reg(), which expects a
value of the base type. Clang raises the following warning about the
type mismatch:
arch/x86/kvm/kvm_cache_regs.h:44:32: warning: implicit conversion from
enumeration type 'enum kvm_reg_ex' to different enumeration type
'enum kvm_reg' [-Wenum-conversion]
kvm_x86_ops->cache_reg(vcpu, VCPU_EXREG_PDPTR);
Cast VCPU_EXREG_PDPTR to 'enum kvm_reg' to make the compiler happy.
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Guenter Roeck <groeck@chromium.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Devices which use level-triggered interrupts under Windows 2016 with
Hyper-V role enabled don't work: Windows disables EOI broadcast in SPIV
unconditionally. Our in-kernel IOAPIC implementation emulates an old IOAPIC
version which has no EOI register so EOI never happens.
The issue was discovered and discussed a while ago:
https://www.spinics.net/lists/kvm/msg148098.html
While this is a guest OS bug (it should check that IOAPIC has the required
capabilities before disabling EOI broadcast) we can workaround it in KVM:
advertising DIRECTED_EOI with in-kernel IOAPIC makes little sense anyway.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Add support for AMD Core Performance counters in the guest. The base
event select and counter MSRs are changed. In addition, with the core
extension, there are 2 extra counters available for performance
measurements for a total of 6.
With the new MSRs, the logic to map them to the gp_counters[] is changed.
New functions are added to check the validity of the get/set MSRs.
If the guest has the X86_FEATURE_PERFCTR_CORE cpuid flag set, the number
of counters available to the vcpu is set to 6. It the flag is not set
then it is 4.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
[Squashed "Expose AMD Core Perf Extension flag to guests" - Radim.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
Add the EventSelect and Counter MSRs for AMD Core Perf Extension.
Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
|
|
For testing the exitless interrupt support it turned out useful to
have separate counters for inject and delivery of I/O interrupt.
While at it do the same for all interrupt types. For timer
related interrupts (clock comparator and cpu timer) we even had
no delivery counters. Fix this as well. On this way some counters
are being renamed to have a similar name.
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
|
|
This counter can be used for administration, debug or test purposes.
Suggested-by: Vladislav Mironov <mironov@de.ibm.com>
Signed-off-by: QingFeng Hao <haoqf@linux.vnet.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
The S390 architecture does not support any graphics hardware,
but with the latest support for Virtio GPU in Linux and Virtio
GPU emulation in QEMU, it's possible to enable graphics for
S390 using the Virtio GPU device.
To enable display we need to enable the Linux Virtual Terminal (VT)
layer for S390. But the VT subsystem initializes quite early
at boot so we need a dummy console driver till the Virtio GPU
driver is initialized and we can run the framebuffer console.
The framebuffer console over a Virtio GPU device can be run
in combination with the serial SCLP console (default on S390).
The SCLP console can still be accessed by management applications
(eg: via Libvirt's virsh console).
Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <e23b61f4f599ba23881727a1e8880e9d60cc6a48.1519315352.git.alifm@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
The Linux Virtual Terminal (VT) layer provides a default keymap
which is compiled when VT layer is enabled. But at the same time
we are also compiling the EBCDIC keymap and this causes the linker
to complain.
So let's rename the EBCDIC keymap variables to prevent linker
conflict.
Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <f670a2698d2372e1e990c48a29334ffe894804b1.1519315352.git.alifm@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
The 'commit e25df1205f37 ("[S390] Kconfig: menus with depends on HAS_IOMEM.")'
added the HAS_IOMEM dependecy for "Graphics support". This disabled the
"Graphics support" menu for S390. But if we enable VT layer for S390,
we would also need to enable the dummy console. So let's remove the
HAS_IOMEM dependency.
Move this dependency to sub menu items and console drivers that use
io memory.
Signed-off-by: Farhan Ali <alifm@linux.vnet.ibm.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Message-Id: <6e8ef238162df5be4462126be155975c722e9863.1519315352.git.alifm@linux.vnet.ibm.com>
Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
A case statement in kvm_s390_shadow_tables uses fallthrough annotations
which are not recognized by gcc because they are hidden within a block.
Move these annotations out of the block to fix (W=1) warnings like below:
arch/s390/kvm/gaccess.c: In function 'kvm_s390_shadow_tables':
arch/s390/kvm/gaccess.c:1029:26: warning: this statement may fall through [-Wimplicit-fallthrough=]
case ASCE_TYPE_REGION1: {
^
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
We want to count IO exit requests in kvm_stat. At the same time
we can get rid of the handle_noop function.
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|
|
commit 35b3fde6203b ("KVM: s390: wire up bpb feature") has no
documentation for KVM_CAP_S390_BPB. While adding this let's also add
other missing capabilities like KVM_CAP_S390_PSW, KVM_CAP_S390_GMAP and
KVM_CAP_S390_COW.
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Janosch Frank <frankja@linux.vnet.ibm.com>
Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
|