summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2019-07-11Merge tag 'kvm-arm-for-5.3' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for 5.3 - Add support for chained PMU counters in guests - Improve SError handling - Handle Neoverse N1 erratum #1349291 - Allow side-channel mitigation status to be migrated - Standardise most AArch64 system register accesses to msr_s/mrs_s - Fix host MPIDR corruption on 32bit
2019-07-11KVM: x86: Unconditionally enable irqs in guest contextSean Christopherson
On VMX, KVM currently does not re-enable irqs until after it has exited the guest context. As a result, a tick that fires in the window between VM-Exit and guest_exit_irqoff() will be accounted as system time. While said window is relatively small, it's large enough to be problematic in some configurations, e.g. if VM-Exits are consistently occurring a hair earlier than the tick irq. Intentionally toggle irqs back off so that guest_exit_irqoff() can be used in lieu of guest_exit() in order to avoid the save/restore of flags in guest_exit(). On my Haswell system, "nop; cli; sti" is ~6 cycles, versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf". Fixes: f2485b3e0c6c0 ("KVM: x86: use guest_exit_irqoff") Reported-by: Wei Yang <w90p710@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-11KVM: x86: PMU Event FilterEric Hankland
Some events can provide a guest with information about other guests or the host (e.g. L3 cache stats); providing the capability to restrict access to a "safe" set of events would limit the potential for the PMU to be used in any side channel attacks. This change introduces a new VM ioctl that sets an event filter. If the guest attempts to program a counter for any blacklisted or non-whitelisted event, the kernel counter won't be created, so any RDPMC/RDMSR will show 0 instances of that event. Signed-off-by: Eric Hankland <ehankland@google.com> [Lots of changes. All remaining bugs are probably mine. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-10kvm: x86: Fix -Wmissing-prototypes warningsYi Wang
We get a warning when build kernel W=1: arch/x86/kvm/../../../virt/kvm/eventfd.c:48:1: warning: no previous prototype for ‘kvm_arch_irqfd_allowed’ [-Wmissing-prototypes] kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args) ^ The reason is kvm_arch_irqfd_allowed() is declared in arch/x86/kvm/irq.h, which is not included by eventfd.c. Considering kvm_arch_irqfd_allowed() is a weakly defined function in eventfd.c, remove the declaration to kvm_host.h can fix this. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: LAPIC: Retry tune per-vCPU timer_advance_ns if adaptive tuning goes insaneWanpeng Li
Retry tune per-vCPU timer_advance_ns if adaptive tuning goes insane which can happen sporadically in product environment. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05kvm: LAPIC: write down valid APIC registersPaolo Bonzini
Replace a magic 64-bit mask with a list of valid registers, computing the same mask in the end. Suggested-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: LAPIC: ARBPRI is a reserved register for x2APICPaolo Bonzini
kvm-unit-tests were adjusted to match bare metal behavior, but KVM itself was not doing what bare metal does; fix that. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM nVMX: Check Host Segment Registers and Descriptor Tables on vmentry of ↵Krish Sadhukhan
nested guests According to section "Checks on Host Segment and Descriptor-Table Registers" in Intel SDM vol 3C, the following checks are performed on vmentry of nested guests: - In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the RPL (bits 1:0) and the TI flag (bit 2) must be 0. - The selector fields for CS and TR cannot be 0000H. - The selector field for SS cannot be 0000H if the "host address-space size" VM-exit control is 0. - On processors that support Intel 64 architecture, the base-address fields for FS, GS and TR must contain canonical addresses. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Karl Heubaum <karl.heubaum@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: nVMX: Stash L1's CR3 in vmcs01.GUEST_CR3 on nested entry w/o EPTSean Christopherson
KVM does not have 100% coverage of VMX consistency checks, i.e. some checks that cause VM-Fail may only be detected by hardware during a nested VM-Entry. In such a case, KVM must restore L1's state to the pre-VM-Enter state as L2's state has already been loaded into KVM's software model. L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But when EPT is disabled, the associated fields hold KVM's shadow values, not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs come from memory, i.e. are not cached in the VMCS. Which leaves CR3 as the sole anomaly. A previously applied workaround to handle CR3 was to force nested early checks if EPT is disabled: commit 2b27924bb1d48 ("KVM: nVMX: always use early vmcs check when EPT is disabled") Forcing nested early checks is undesirable as doing so adds hundreds of cycles to every nested VM-Entry. Rather than take this performance hit, handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested VM-Entry when EPT is disabled *and* nested early checks are disabled. By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3. These shenanigans work because nested_vmx_restore_host_state() does a full kvm_mmu_reset_context(), i.e. unloads the current MMU, which guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3 prior to re-entering L1. vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via: nested_vmx_restore_host_state() -> kvm_mmu_reset_context() -> kvm_mmu_unload() -> kvm_mmu_free_roots() kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank on 'root_hpa == INVALID_PAGE' unless the implementation of kvm_mmu_reset_context() is changed. On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a successful entry) via: vcpu_enter_guest() -> kvm_mmu_reload() -> kvm_mmu_load() -> kvm_mmu_load_cr3() -> vmx_set_cr3() Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled as a "late" VM-Fail should never happen win that case (KVM WARNs), and the conditional write avoids the need to restore the correct GUEST_CR3 when nested_vmx_check_vmentry_hw() fails. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: x86: add tracepoints around __direct_map and FNAME(fetch)Paolo Bonzini
These are useful in debugging shadow paging. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: x86: change kvm_mmu_page_get_gfn BUG_ON to WARN_ONPaolo Bonzini
Note that in such a case it is quite likely that KVM will BUG_ON in __pte_list_remove when the VM is closed. However, there is no immediate risk of memory corruption in the host so a WARN_ON is enough and it lets you gather traces for debugging. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: x86: remove now unneeded hugepage gfn adjustmentPaolo Bonzini
After the previous patch, the low bits of the gfn are masked in both FNAME(fetch) and __direct_map, so we do not need to clear them in transparent_hugepage_adjust. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: x86: make FNAME(fetch) and __direct_map more similarPaolo Bonzini
These two functions are basically doing the same thing through kvm_mmu_get_page, link_shadow_page and mmu_set_spte; yet, for historical reasons, their code looks very different. This patch tries to take the best of each and make them very similar, so that it is easy to understand changes that apply to both of them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05kvm: x86: Do not release the page inside mmu_set_spte()Junaid Shahid
Release the page at the call-site where it was originally acquired. This makes the exit code cleaner for most call sites, since they do not need to duplicate code between success and the failure label. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: cpuid: remove has_leaf_count from struct kvm_cpuid_paramPaolo Bonzini
The has_leaf_count member was originally added for KVM's paravirtualization CPUID leaves. However, since then the leaf count _has_ been added to those leaves as well, so we can drop that special case. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: cpuid: rename do_cpuid_1_entPaolo Bonzini
do_cpuid_1_ent does not do the entire processing for a CPUID entry, it only retrieves the host's values. Rename it to match reality. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: cpuid: set struct kvm_cpuid_entry2 flags in do_cpuid_1_entPaolo Bonzini
do_cpuid_1_ent is typically called in two places by __do_cpuid_func for CPUID functions that have subleafs. Both places have to set the KVM_CPUID_FLAG_SIGNIFCANT_INDEX. Set that flag, and KVM_CPUID_FLAG_STATEFUL_FUNC as well, directly in do_cpuid_1_ent. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: cpuid: extract do_cpuid_7_mask and support multiple subleafsPaolo Bonzini
CPUID function 7 has multiple subleafs. Instead of having nested switch statements, move the logic to filter supported features to a separate function, and call it for each subleaf. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-05KVM: cpuid: do_cpuid_ent works on a whole CPUID functionPaolo Bonzini
Rename it as well as __do_cpuid_ent and __do_cpuid_ent_emulated to have "func" in its name, and drop the index parameter which is always 0. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-03KVM: LAPIC: remove the trailing newline used in the fmt parameter of TP_printkWanpeng Li
The trailing newlines will lead to extra newlines in the trace file which looks like the following output, so remove it. qemu-system-x86-15695 [002] ...1 15774.839240: kvm_hv_timer_state: vcpu_id 0 hv_timer 1 qemu-system-x86-15695 [002] ...1 15774.839309: kvm_hv_timer_state: vcpu_id 0 hv_timer 1 Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-03KVM: svm: add nrips module parameterPaolo Bonzini
Allow testing code for old processors that lack the next RIP save feature, by disabling usage of the next_rip field. Nested hypervisors however get the feature unconditionally. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02kvm: x86: Pass through AMD_STIBP_ALWAYS_ON in GET_SUPPORTED_CPUIDJim Mattson
This bit is purely advisory. Passing it through to the guest indicates that the virtual processor, like the physical processor, prefers that STIBP is only set once during boot and not changed. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02kvm: nVMX: Remove unnecessary sync_roots from handle_inveptJim Mattson
When L0 is executing handle_invept(), the TDP MMU is active. Emulating an L1 INVEPT does require synchronizing the appropriate shadow EPT root(s), but a call to kvm_mmu_sync_roots in this context won't do that. Similarly, the hardware TLB and paging-structure-cache entries associated with the appropriate shadow EPT root(s) must be flushed, but requesting a TLB_FLUSH from this context won't do that either. How did this ever work? KVM always does a sync_roots and TLB flush (in the correct context) when transitioning from L1 to L2. That isn't the best choice for nested VM performance, but it effectively papers over the mistakes here. Remove the unnecessary operations and leave a comment to try to do better in the future. Reported-by: Junaid Shahid <junaids@google.com> Fixes: bfd0a56b90005f ("nEPT: Nested INVEPT") Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Nadav Har'El <nyh@il.ibm.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Xinhao Xu <xinhao.xu@intel.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by Peter Shier <pshier@google.com> Reviewed-by: Junaid Shahid <junaids@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: X86: Expose PV_SCHED_YIELD CPUID feature bit to guestWanpeng Li
Expose PV_SCHED_YIELD feature bit to guest, the guest can check this feature bit before using paravirtualized sched yield. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: X86: Implement PV sched yield hypercallWanpeng Li
The target vCPUs are in runnable state after vcpu_kick and suitable as a yield target. This patch implements the sched yield hypercall. 17% performance increasement of ebizzy benchmark can be observed in an over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush call-function IPI-many since call-function is not easy to be trigged by userspace workload). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: X86: Yield to IPI target if necessaryWanpeng Li
When sending a call-function IPI-many to vCPUs, yield if any of the IPI target vCPUs was preempted, we just select the first preempted target vCPU which we found since the state of target vCPUs can change underneath and to avoid race conditions. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in useVitaly Kuznetsov
When Enlightened VMCS is in use, it is valid to do VMCLEAR and, according to TLFS, this should "transition an enlightened VMCS from the active to the non-active state". It is, however, wrong to assume that it is only valid to do VMCLEAR for the eVMCS which is currently active on the vCPU performing VMCLEAR. Currently, the logic in handle_vmclear() is broken: in case, there is no active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal' VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly corrupts the memory area. So, in case the VMCLEAR argument is not the current active eVMCS on the vCPU, how can we know if the area it is pointing to is a normal or an enlightened VMCS? Thanks to the bug in Hyper-V (see commit 72aeb60c52bf7 ("KVM: nVMX: Verify eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can not, the revision can't be used to distinguish between them. So let's assume it is always enlightened in case enlightened vmentry is enabled in the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to minimize the impact for 'unenlightened' workloads. Fixes: b8bbab928fb1 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCHVitaly Kuznetsov
Apparently, Windows doesn't maintain clean fields data after it does VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME. The issue went unnoticed because currently we do nested_release_evmcs() in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates clean fields when a new eVMCS is mapped but we're going to change the logic. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LISTPaolo Bonzini
This allows userspace to know which MSRs are supported by the hypervisor. Unfortunately userspace must resort to tricks for everything except MSR_IA32_VMX_VMFUNC (which was just added in the previous patch). One possibility is to use the feature control MSR, which is tied to nested VMX as well and is present on all KVM versions that support feature MSRs. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: allow setting the VMFUNC controls MSRPaolo Bonzini
Allow userspace to set a custom value for the VMFUNC controls MSR, as long as the capabilities it advertises do not exceed those of the host. Fixes: 27c42a1bb ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 2017-08-03) Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-07-02KVM: nVMX: include conditional controls in /dev/kvm KVM_GET_MSRSPaolo Bonzini
Some secondary controls are automatically enabled/disabled based on the CPUID values that are set for the guest. However, they are still available at a global level and therefore should be present when KVM_GET_MSRS is sent to /dev/kvm. Fixes: 1389309c811 ("KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace", 2018-02-26) Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-21x86/vdso: Prevent segfaults due to hoisted vclock readsAndy Lutomirski
GCC 5.5.0 sometimes cleverly hoists reads of the pvclock and/or hvclock pages before the vclock mode checks. This creates a path through vclock_gettime() in which no vclock is enabled at all (due to disabled TSC on old CPUs, for example) but the pvclock or hvclock page nevertheless read. This will segfault on bare metal. This fixes commit 459e3a21535a ("gcc-9: properly declare the {pv,hv}clock_page storage") in the sense that, before that commit, GCC didn't seem to generate the offending code. There was nothing wrong with that commit per se, and -stable maintainers should backport this to all supported kernels regardless of whether the offending commit was present, since the same crash could just as easily be triggered by the phase of the moon. On GCC 9.1.1, this doesn't seem to affect the generated code at all, so I'm not too concerned about performance regressions from this fix. Cc: stable@vger.kernel.org Cc: x86@kernel.org Cc: Borislav Petkov <bp@alien8.de> Reported-by: Duncan Roe <duncan_roe@optusnet.com.au> Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-06-21Merge tag 'spdx-5.2-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull still more SPDX updates from Greg KH: "Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat" * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485 ...
2019-06-20Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: "Fixes for ARM and x86, plus selftest patches and nicer structs for nested state save/restore" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: nVMX: reorganize initial steps of vmx_set_nested_state KVM: arm/arm64: Fix emulated ptimer irq injection tests: kvm: Check for a kernel warning kvm: tests: Sort tests in the Makefile alphabetically KVM: x86/mmu: Allocate PAE root array when using SVM's 32-bit NPT KVM: x86: Modify struct kvm_nested_state to have explicit fields for data KVM: fix typo in documentation KVM: nVMX: use correct clean fields when copying from eVMCS KVM: arm/arm64: vgic: Fix kvm_device leak in vgic_its_destroy KVM: arm64: Filter out invalid core register IDs in KVM_GET_REG_LIST KVM: arm64: Implement vq_present() as a macro
2019-06-20KVM: nVMX: reorganize initial steps of vmx_set_nested_statePaolo Bonzini
Commit 332d079735f5 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state", 2019-05-02) broke evmcs_test because the eVMCS setup must be performed even if there is no VMXON region defined, as long as the eVMCS bit is set in the assist page. While the simplest possible fix would be to add a check on kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly. Instead, this patch moves checks earlier in the function and conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that vmx_set_nested_state always goes through vmx_leave_nested and nested_enable_evmcs. Fixes: 332d079735f5 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state") Cc: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-20KVM: x86: Fix apic dangling pointer in vcpuSaar Amar
The function kvm_create_lapic() attempts to allocate the apic structure and sets a pointer to it in the virtual processor structure. However, if get_zeroed_page() failed, the function frees the apic chunk, but forgets to set the pointer in the vcpu to NULL. It's not a security issue since there isn't a use of that pointer if kvm_create_lapic() returns error, but it's more accurate that way. Signed-off-by: Saar Amar <saaramar@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-20KVM: VMX: check CPUID before allowing read/write of IA32_XSSWanpeng Li
Raise #GP when guest read/write IA32_XSS, but the CPUID bits say that it shouldn't exist. Fixes: 203000993de5 (kvm: vmx: add MSR logic for XSAVES) Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Reported-by: Tao Xu <tao3.xu@intel.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504Thomas Gleixner
Based on 1 normalized pattern(s): this file is free software you can redistribute it and or modify it under the terms of version 2 of the gnu general public license as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 51 franklin st fifth floor boston ma 02110 1301 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 8 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081207.443595178@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499Thomas Gleixner
Based on 1 normalized pattern(s): this work is licensed under the terms of the gnu gpl version 2 see the copying file in the top level directory extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 35 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497Thomas Gleixner
Based on 1 normalized pattern(s): this file is part of the linux kernel and is made available under the terms of the gnu general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 28 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.534229504@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 477Thomas Gleixner
Based on 1 normalized pattern(s): subject to gplv2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 1 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081204.018005938@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 474Thomas Gleixner
Based on 1 normalized pattern(s): subject to the gnu public license v 2 no warranty of any kind extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 2 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081203.641025917@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 243Thomas Gleixner
Based on 1 normalized pattern(s): this file is licensed under the gpl v2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 3 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Armijn Hemel <armijn@tjaldur.nl> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190602204654.634736654@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 230Thomas Gleixner
Based on 2 normalized pattern(s): this source code is licensed under the gnu general public license version 2 see the file copying for more details this source code is licensed under general public license version 2 see extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 52 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190602204653.449021192@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19KVM: x86/mmu: Allocate PAE root array when using SVM's 32-bit NPTSean Christopherson
SVM's Nested Page Tables (NPT) reuses x86 paging for the host-controlled page walk. For 32-bit KVM, this means PAE paging is used even when TDP is enabled, i.e. the PAE root array needs to be allocated. Fixes: ee6268ba3a68 ("KVM: x86: Skip pae_root shadow allocation if tdp enabled") Cc: stable@vger.kernel.org Reported-by: Jiri Palecek <jpalecek@web.de> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-19KVM: x86: Modify struct kvm_nested_state to have explicit fields for dataLiran Alon
Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format of VMX nested state data in a struct. In order to avoid changing the ioctl values of KVM_{GET,SET}_NESTED_STATE, there is a need to preserve sizeof(struct kvm_nested_state). This is done by defining the data struct as "data.vmx[0]". It was the most elegant way I found to preserve struct size while still keeping struct readable and easy to maintain. It does have a misfortunate side-effect that now it has to be accessed as "data.vmx[0]" rather than just "data.vmx". Because we are already modifying these structs, I also modified the following: * Define the "format" field values as macros. * Rename vmcs_pa to vmcs12_pa for better readability. Signed-off-by: Liran Alon <liran.alon@oracle.com> [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo] Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: nVMX: shadow pin based execution controlsPaolo Bonzini
The VMX_PREEMPTION_TIMER flag may be toggled frequently, though not *very* frequently. Since it does not affect KVM's dirty logic, e.g. the preemption timer value is loaded from vmcs12 even if vmcs12 is "clean", there is no need to mark vmcs12 dirty when L1 writes pin controls, and shadowing the field achieves that. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Leave preemption timer running when it's disabledSean Christopherson
VMWRITEs to the major VMCS controls, pin controls included, are deceptively expensive. CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than doubles the latency of the next VM-Entry (and again when/if the flag is toggled back). In a non-nested scenario, running a "standard" guest with the preemption timer enabled, toggling the timer flag is uncommon but not rare, e.g. roughly 1 in 10 entries. Disabling the preemption timer can change these numbers due to its use for "immediate exits", even when explicitly disabled by userspace. Nested virtualization in particular is painful, as the timer flag is set for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's pin controls to *clear* the flag since its the timer's final state isn't known until vmx_vcpu_run(). I.e. the majority of nested VM-Enters end up unnecessarily writing pin controls *twice*. Rather than toggle the timer flag in pin controls, set the timer value itself to the largest allowed value to put it into a "soft disabled" state, and ignore any spurious preemption timer exits. Sadly, the timer is a 32-bit value and so theoretically it can fire before the head death of the universe, i.e. spurious exits are possible. But because KVM does *not* save the timer value on VM-Exit and because the timer runs at a slower rate than the TSC, the maximuma timer value is still sufficiently large for KVM's purposes. E.g. on a modern CPU with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate TSC, the timer will fire after ~55 seconds of *uninterrupted* guest execution. In other words, spurious VM-Exits are effectively only possible if the host is completely tickless on the logical CPU, the guest is not using the preemption timer, and the guest is not generating VM-Exits for any other reason. To be safe from bad/weird hardware, disable the preemption timer if its maximum delay is less than ten seconds. Ten seconds is mostly arbitrary and was selected in no small part because it's a nice round number. For simplicity and paranoia, fall back to __kvm_request_immediate_exit() if the preemption timer is disabled by KVM or userspace. Previously KVM continued to use the preemption timer to force immediate exits even when the timer was disabled by userspace. Now that KVM leaves the timer running instead of truly disabling it, allow userspace to kill it entirely in the unlikely event the timer (or KVM) malfunctions. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: VMX: Drop hv_timer_armed from 'struct loaded_vmcs'Sean Christopherson
... now that it is fully redundant with the pin controls shadow. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>