summaryrefslogtreecommitdiff
path: root/arch/x86/kvm
AgeCommit message (Collapse)Author
2019-06-18kvm: x86: check kvm_apic_sw_enabled() is enoughWei Yang
On delivering irq to apic, we iterate on vcpu and do the check like this: kvm_apic_present(vcpu) kvm_lapic_enabled(vpu) kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic) Since we have already checked kvm_apic_present(), it is reasonable to replace kvm_lapic_enabled() with kvm_apic_sw_enabled(). Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18kvm: x86: add host poll control msrsMarcelo Tosatti
Add an MSRs which allows the guest to disable host polling (specifically the cpuidle-haltpoll, when performing polling in the guest, disables host side polling). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18kvm: vmx: segment limit check: use access lengthEugene Korenevsky
There is an imperfection in get_vmx_mem_address(): access length is ignored when checking the limit. To fix this, pass access length as a function argument. The access length is usually obvious since it is used by callers after get_vmx_mem_address() call, but for vmread/vmwrite it depends on the state of 64-bit mode. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18kvm: vmx: fix limit checking in get_vmx_mem_address()Eugene Korenevsky
Intel SDM vol. 3, 5.3: The processor causes a general-protection exception (or, if the segment is SS, a stack-fault exception) any time an attempt is made to access the following addresses in a segment: - A byte at an offset greater than the effective limit - A word at an offset greater than the (effective-limit – 1) - A doubleword at an offset greater than the (effective-limit – 3) - A quadword at an offset greater than the (effective-limit – 7) Therefore, the generic limit checking error condition must be exn = (off > limit + 1 - access_len) = (off + access_len - 1 > limit) but not exn = (off + access_len > limit) as for now. Also avoid integer overflow of `off` at 32-bit KVM by casting it to u64. Note: access length is currently sizeof(u64) which is incorrect. This will be fixed in the subsequent patch. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: Add Intel CPUID.1F cpuid emulation supportLike Xu
Add support to expose Intel V2 Extended Topology Enumeration Leaf for some new systems with multiple software-visible die within each package. Because unimplemented and unexposed leaves should be explicitly reported as zero, there is no need to limit cpuid.0.eax to the maximum value of feature configuration but limit it to the highest leaf implemented in the current code. A single clamping seems sufficient and cheaper. Co-developed-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Like Xu <like.xu@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-18KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15Liran Alon
Make all code consistent with kvm_deliver_exception_payload() by using appropriate symbolic constant instead of hard-coded number. Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-13KVM: x86: clean up conditions for asynchronous page fault handlingPaolo Bonzini
Even when asynchronous page fault is disabled, KVM does not want to pause the host if a guest triggers a page fault; instead it will put it into an artificial HLT state that allows running other host processes while allowing interrupt delivery into the guest. However, the way this feature is triggered is a bit confusing. First, it is not used for page faults while a nested guest is running: but this is not an issue since the artificial halt is completely invisible to the guest, either L1 or L2. Second, it is used even if kvm_halt_in_guest() returns true; in this case, the guest probably should not pay the additional latency cost of the artificial halt, and thus we should handle the page fault in a completely synchronous way. By introducing a new function kvm_can_deliver_async_pf, this patch commonizes the code that chooses whether to deliver an async page fault (kvm_arch_async_page_not_present) and the code that chooses whether a page fault should be handled synchronously (kvm_can_do_async_pf). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-13KVM: nVMX: use correct clean fields when copying from eVMCSVitaly Kuznetsov
Unfortunately, a couple of mistakes were made while implementing Enlightened VMCS support, in particular, wrong clean fields were used in copy_enlightened_to_vmcs12(): - exception_bitmap is covered by CONTROL_EXCPN; - vm_exit_controls/pin_based_vm_exec_control/secondary_vm_exec_control are covered by CONTROL_GRP1. Fixes: 945679e301ea0 ("KVM: nVMX: add enlightened VMCS state") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-05treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 320Thomas Gleixner
Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms and conditions of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation inc 59 temple place suite 330 boston ma 02111 1307 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 33 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190530000435.254582722@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05kvm: Convert kvm_lock to a mutexJunaid Shahid
It doesn't seem as if there is any particular need for kvm_lock to be a spinlock, so convert the lock to a mutex so that sleepable functions (in particular cond_resched()) can be called while holding it. Signed-off-by: Junaid Shahid <junaids@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-05KVM: VMX: remove unneeded 'asm volatile ("")' from vmcs_write64Uros Bizjak
__vmcs_writel uses volatile asm, so there is no need to insert another one between the first and the second call to __vmcs_writel in order to prevent unwanted code moves for 32bit targets. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-05x86/kvm/VMX: drop bad asm() clobber from nested_vmx_check_vmentry_hw()Jan Beulich
While upstream gcc doesn't detect conflicts on cc (yet), it really should, and hence "cc" should not be specified for asm()-s also having "=@cc<cond>" outputs. (It is quite pointless anyway to specify a "cc" clobber in x86 inline assembly, since the compiler assumes it to be always clobbered, and has no means [yet] to suppress this behavior.) Signed-off-by: Jan Beulich <jbeulich@suse.com> Fixes: bbc0b8239257 ("KVM: nVMX: Capture VM-Fail via CC_{SET,OUT} in nested early checks") Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: X86: Emulate MSR_IA32_MISC_ENABLE MWAIT bitWanpeng Li
MSR IA32_MISC_ENABLE bit 18, according to SDM: | When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0). | This indicates that MONITOR/MWAIT are not supported. | | Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0. | | When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1). The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit, CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest(). kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its guest visibility. This patch implements toggling of the CPUID bit based on guest writes to the MSR. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Fixes for backwards compatibility - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: X86: Provide a capability to disable cstate msr read interceptsWanpeng Li
Allow guest reads CORE cstate when exposing host CPU power management capabilities to the guest. PKG cstate is restricted to avoid a guest to get the whole package information in multi-tenant scenario. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04kvm: x86: refine kvm_get_arch_capabilities()Xiaoyao Li
1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe(). 2. Since kvm_get_arch_capabilities() is only used in this file, making it static. Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: Directly return result from kvm_arch_check_processor_compat()Sean Christopherson
Add a wrapper to invoke kvm_arch_check_processor_compat() so that the boilerplate ugliness of checking virtualization support on all CPUs is hidden from the arch specific code. x86's implementation in particular is quite heinous, as it unnecessarily propagates the out-param pattern into kvm_x86_ops. While the x86 specific issue could be resolved solely by changing kvm_x86_ops, make the change for all architectures as returning a value directly is prettier and technically more robust, e.g. s390 doesn't set the out param, which could lead to subtle breakage in the (highly unlikely) scenario where the out-param was not pre-initialized by the caller. Opportunistically annotate svm_check_processor_compat() with __init. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04kvm: svm/avic: Do not send AVIC doorbell to selfSuthikulpanit, Suravee
AVIC doorbell is used to notify a running vCPU that interrupts has been injected into the vCPU AVIC backing page. Current logic checks only if a VCPU is running before sending a doorbell. However, the doorbell is not necessary if the destination CPU is itself. Add logic to check currently running CPU before sending doorbell. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Alexander Graf <graf@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: LAPIC: Optimize timer latency furtherWanpeng Li
Advance lapic timer tries to hidden the hypervisor overhead between the host emulated timer fires and the guest awares the timer is fired. However, it just hidden the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire, instead of the real position of vmentry which is mentioned in the orignial commit d0659d946be0 ("KVM: x86: add option to advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire and before world switch on my haswell desktop. This patch tries to narrow the last gap(wait_lapic_expire -> world switch), it takes the real overhead time between apic_timer_fn/handle_preemption_timer and before world switch into consideration when adaptively tuning timer advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexitWanpeng Li
wait_lapic_expire() call was moved above guest_enter_irqoff() because of its tracepoint, which violated the RCU extended quiescent state invoked by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before VM-Enter, but trace it after VM-Exit. This can help us to move wait_lapic_expire() just before vmentry in the later patch. [1] Commit 8b89fe1f6c43 ("kvm: x86: move tracepoints outside extended quiescent state") [2] https://patchwork.kernel.org/patch/7821111/ Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Track whether wait_lapic_expire was called, and do not invoke the tracepoint if not. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM: LAPIC: Extract adaptive tune timer advancement logicWanpeng Li
Extract adaptive tune timer advancement logic to a single function. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> [Rename new function. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04KVM/nSVM: properly map nested VMCBVitaly Kuznetsov
Commit 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") broke nested SVM completely: kvm_vcpu_map()'s second parameter is GFN so vmcb_gpa needs to be converted with gpa_to_gfn(), not the other way around. Fixes: 8c5fbf1a7231 ("KVM/nSVM: Use the new mapping API for mapping guest memory") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04kvm: x86: Fix reserved bits related calculation errors caused by MKTMEKai Huang
Intel MKTME repurposes several high bits of physical address as 'keyID' for memory encryption thus effectively reduces platform's maximum physical address bits. Exactly how many bits are reduced is configured by BIOS. To honor such HW behavior, the repurposed bits are reduced from cpuinfo_x86->x86_phys_bits when MKTME is detected in CPU detection. Similarly, AMD SME/SEV also reduces physical address bits for memory encryption, and cpuinfo->x86_phys_bits is reduced too when SME/SEV is detected, so for both MKTME and SME/SEV, boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported by CPUID anymore. Currently KVM treats bits from boot_cpu_data.x86_phys_bits to 51 as reserved bits, but it's not true anymore for MKTME, since MKTME treats those reduced bits as 'keyID', but not reserved bits. Therefore boot_cpu_data.x86_phys_bits cannot be used to calculate reserved bits anymore, although we can still use it for AMD SME/SEV since SME/SEV treats the reduced bits differently -- they are treated as reserved bits, the same as other reserved bits in page table entity [1]. Fix by introducing a new 'shadow_phys_bits' variable in KVM x86 MMU code to store the effective physical bits w/o reserved bits -- for MKTME, it equals to physical address reported by CPUID, and for SME/SEV, it is boot_cpu_data.x86_phys_bits. Note that for the physical address bits reported to guest should remain unchanged -- KVM should report physical address reported by CPUID to guest, but not boot_cpu_data.x86_phys_bits. Because for Intel MKTME, there's no harm if guest sets up 'keyID' bits in guest page table (since MKTME only works at physical address level), and KVM doesn't even expose MKTME to guest. Arguably, for AMD SME/SEV, guest is aware of SEV thus it should adjust boot_cpu_data.x86_phys_bits when it detects SEV, therefore KVM should still reports physcial address reported by CPUID to guest. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-06-04kvm: x86: Move kvm_set_mmio_spte_mask() from x86.c to mmu.cKai Huang
As a prerequisite to fix several SPTE reserved bits related calculation errors caused by MKTME, which requires kvm_set_mmio_spte_mask() to use local static variable defined in mmu.c. Also move call site of kvm_set_mmio_spte_mask() from kvm_arch_init() to kvm_mmu_module_init() so that kvm_set_mmio_spte_mask() can be static. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-28KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_IDThomas Huth
KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all architectures. However, on s390x, the amount of usable CPUs is determined during runtime - it is depending on the features of the machine the code is running on. Since we are using the vcpu_id as an index into the SCA structures that are defined by the hardware (see e.g. the sca_add_vcpu() function), it is not only the amount of CPUs that is limited by the hard- ware, but also the range of IDs that we can use. Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too. So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common code into the architecture specific code, and on s390x we have to return the same value here as for KVM_CAP_MAX_VCPUS. This problem has been discovered with the kvm_create_max_vcpus selftest. With this change applied, the selftest now passes on s390x, too. Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20190523164309.13345-9-thuth@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
2019-05-24KVM: x86: fix return value for reserved EFERPaolo Bonzini
Commit 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes", 2019-04-02) introduced a "return false" in a function returning int, and anyway set_efer has a "nonzero on error" conventon so it should be returning 1. Reported-by: Pavel Machek <pavel@denx.de> Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: x86/pmu: do not mask the value that is written to fixed PMUsPaolo Bonzini
According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of each MSR may be written with any value, and the high-order 8 bits are sign-extended according to the value of bit 31", but the fixed counters in real hardware are limited to the width of the fixed counters ("bits beyond the width of the fixed-function counter are reserved and must be written as zeros"). Fix KVM to do the same. Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: x86/pmu: mask the result of rdpmc according to the width of the countersPaolo Bonzini
This patch will simplify the changes in the next, by enforcing the masking of the counters to RDPMC and RDMSR. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24x86/kvm/pmu: Set AMD's virt PMU version to 1Borislav Petkov
After commit: 672ff6cff80c ("KVM: x86: Raise #GP when guest vCPU do not support PMU") my AMD guests started #GPing like this: general protection fault: 0000 [#1] PREEMPT SMP CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 RIP: 0010:x86_perf_event_update+0x3b/0xa0 with Code: pointing to RDPMC. It is RDPMC because the guest has the hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to: if (!pmu->version) return 1; which the above commit added. Since AMD's PMU leaves the version at 0, that causes the #GP injection into the guest. Set pmu->version arbitrarily to 1 and move it above the non-applicable struct kvm_pmu members. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: kvm@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Cc: Mihai Carabas <mihai.carabas@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: stable@vger.kernel.org Fixes: 672ff6cff80c ("KVM: x86: Raise #GP when guest vCPU do not support PMU") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: x86: do not spam dmesg with VMCS/VMCB dumpsPaolo Bonzini
Userspace can easily set up invalid processor state in such a way that dmesg will be filled with VMCS or VMCB dumps. Disable this by default using a module parameter. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24kvm: Check irqchip mode before assign irqfdPeter Xu
When assigning kvm irqfd we didn't check the irqchip mode but we allow KVM_IRQFD to succeed with all the irqchip modes. However it does not make much sense to create irqfd even without the kernel chips. Let's provide a arch-dependent helper to check whether a specific irqfd is allowed by the arch. At least for x86, it should make sense to check: - when irqchip mode is NONE, all irqfds should be disallowed, and, - when irqchip mode is SPLIT, irqfds that are with resamplefd should be disallowed. For either of the case, previously we'll silently ignore the irq or the irq ack event if the irqchip mode is incorrect. However that can cause misterious guest behaviors and it can be hard to triage. Let's fail KVM_IRQFD even earlier to detect these incorrect configurations. CC: Paolo Bonzini <pbonzini@redhat.com> CC: Radim Krčmář <rkrcmar@redhat.com> CC: Alex Williamson <alex.williamson@redhat.com> CC: Eduardo Habkost <ehabkost@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24kvm: svm/avic: fix off-by-one in checking host APIC IDSuthikulpanit, Suravee
Current logic does not allow VCPU to be loaded onto CPU with APIC ID 255. This should be allowed since the host physical APIC ID field in the AVIC Physical APIC table entry is an 8-bit value, and APIC ID 255 is valid in system with x2APIC enabled. Instead, do not allow VCPU load if the host APIC ID cannot be represented by an 8-bit value. Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK instead of AVIC_MAX_PHYSICAL_ID_COUNT. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspaceWanpeng Li
Expose per-vCPU timer_advance_ns to userspace, so it is able to query the auto-adjusted value. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflowWanpeng Li
After commit c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of timer advancement), '-1' enables adaptive tuning starting from default advancment of 1000ns. However, we should expose an int instead of an overflow uint module parameter. Before patch: /sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295 After patch: /sys/module/kvm/parameters/lapic_timer_advance_ns:-1 Fixes: c3941d9e0 (KVM: lapic: Allow user to disable adaptive tuning of timer advancement) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24kvm: vmx: Fix -Wmissing-prototypes warningsYi Wang
We get a warning when build kernel W=1: arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes] void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) Add the missing declaration to fix this. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: nVMX: Fix using __this_cpu_read() in preemptible contextWanpeng Li
BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590 caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel] CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G OE 5.1.0-rc4+ #1 Call Trace: dump_stack+0x67/0x95 __this_cpu_preempt_check+0xd2/0xe0 nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel] nested_vmx_run+0xda/0x2b0 [kvm_intel] handle_vmlaunch+0x13/0x20 [kvm_intel] vmx_handle_exit+0xbd/0x660 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm] kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm] do_vfs_ioctl+0xa5/0x6e0 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x6f/0x6c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Accessing per-cpu variable should disable preemption, this patch extends the preemption disable region for __this_cpu_read(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Fixes: 52017608da33 ("KVM: nVMX: add option to perform early consistency checks via H/W") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24kvm: x86: Include CPUID leaf 0x8000001e in kvm's supported CPUIDJim Mattson
Kvm now supports extended CPUID functions through 0x8000001f. CPUID leaf 0x8000001e is AMD's Processor Topology Information leaf. This contains similar information to CPUID leaf 0xb (Intel's Extended Topology Enumeration leaf), and should be included in the output of KVM_GET_SUPPORTED_CPUID, even though userspace is likely to override some of this information based upon the configuration of the particular VM. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Borislav Petkov <bp@suse.de> Fixes: 8765d75329a38 ("KVM: X86: Extend CPUID range to include new leaf") Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24kvm: x86: Include multiple indices with CPUID leaf 0x8000001dJim Mattson
Per the APM, "CPUID Fn8000_001D_E[D,C,B,A]X reports cache topology information for the cache enumerated by the value passed to the instruction in ECX, referred to as Cache n in the following description. To gather information for all cache levels, software must repeatedly execute CPUID with 8000_001Dh in EAX and ECX set to increasing values beginning with 0 until a value of 00h is returned in the field CacheType (EAX[4:0]) indicating no more cache descriptions are available for this processor." The termination condition is the same as leaf 4, so we can reuse that code block for leaf 0x8000001d. Fixes: 8765d75329a38 ("KVM: X86: Extend CPUID range to include new leaf") Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: nVMX: Clear nested_run_pending if setting nested state failsSean Christopherson
VMX's nested_run_pending flag is subtly consumed when stuffing state to enter guest mode, i.e. needs to be set according before KVM knows if setting guest state is successful. If setting guest state fails, clear the flag as a nested run is obviously not pending. Reported-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-24KVM: nVMX: really fix the size checks on KVM_SET_NESTED_STATEPaolo Bonzini
The offset for reading the shadow VMCS is sizeof(*kvm_state)+VMCS12_SIZE, so the correct size must be that plus sizeof(*vmcs12). This could lead to KVM reading garbage data from userspace and not reporting an error, but is otherwise not sensitive. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-17Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "ARM: - support for SVE and Pointer Authentication in guests - PMU improvements POWER: - support for direct access to the POWER9 XIVE interrupt controller - memory and performance optimizations x86: - support for accessing memory not backed by struct page - fixes and refactoring Generic: - dirty page tracking improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits) kvm: fix compilation on aarch64 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" kvm: x86: Fix L1TF mitigation for shadow MMU KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing" KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete tests: kvm: Add tests for KVM_SET_NESTED_STATE KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID tests: kvm: Add tests to .gitignore KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one KVM: Fix the bitmap range to copy during clear dirty KVM: arm64: Fix ptrauth ID register masking logic KVM: x86: use direct accessors for RIP and RSP KVM: VMX: Use accessors for GPRs outside of dedicated caching logic KVM: x86: Omit caching logic for always-available GPRs kvm, x86: Properly check whether a pfn is an MMIO or not ...
2019-05-15Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"Sean Christopherson
The RDPMC-exiting control is dependent on the existence of the RDPMC instruction itself, i.e. is not tied to the "Architectural Performance Monitoring" feature. For all intents and purposes, the control exists on all CPUs with VMX support since RDPMC also exists on all VCPUs with VMX supported. Per Intel's SDM: The RDPMC instruction was introduced into the IA-32 Architecture in the Pentium Pro processor and the Pentium processor with MMX technology. The earlier Pentium processors have performance-monitoring counters, but they must be read with the RDMSR instruction. Because RDPMC-exiting always exists, KVM requires the control and refuses to load if it's not available. As a result, hiding the PMU from a guest breaks nested virtualization if the guest attemts to use KVM. While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE checks, but before the counter referenced in ECX is checked for validity. In other words, the original KVM behavior of injecting a #GP was correct, and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP when the unit test guest (L3 in this case) executes RDPMC without RDPMC-exiting set in the unit test host (L2). This reverts commit e51bfdb68725dc052d16241ace40ea3140f938aa. Fixes: e51bfdb68725 ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU") Reported-by: David Hill <hilld@binarystorm.net> Cc: Saar Amar <saaramar@microsoft.com> Cc: Mihai Carabas <mihai.carabas@oracle.com> Cc: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15kvm: x86: Fix L1TF mitigation for shadow MMUKai Huang
Currently KVM sets 5 most significant bits of physical address bits reported by CPUID (boot_cpu_data.x86_phys_bits) for nonpresent or reserved bits SPTE to mitigate L1TF attack from guest when using shadow MMU. However for some particular Intel CPUs the physical address bits of internal cache is greater than physical address bits reported by CPUID. Use the kernel's existing boot_cpu_data.x86_cache_bits to determine the five most significant bits. Doing so improves KVM's L1TF mitigation in the unlikely scenario that system RAM overlaps the high order bits of the "real" physical address space as reported by CPUID. This aligns with the kernel's warnings regarding L1TF mitigation, e.g. in the above scenario the kernel won't warn the user about lack of L1TF mitigation if x86_cache_bits is greater than x86_phys_bits. Also initialize shadow_nonpresent_or_rsvd_mask explicitly to make it consistent with other 'shadow_{xxx}_mask', and opportunistically add a WARN once if KVM's L1TF mitigation cannot be applied on a system that is marked as being susceptible to L1TF. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Kai Huang <kai.huang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possibleSean Christopherson
If L1 is using an MSR bitmap, unconditionally merge the MSR bitmaps from L0 and L1 for MSR_{KERNEL,}_{FS,GS}_BASE. KVM unconditionally exposes MSRs L1. If KVM is also running in L1 then it's highly likely L1 is also exposing the MSRs to L2, i.e. KVM doesn't need to intercept L2 accesses. Based on code from Jintack Lim. Cc: Jintack Lim <jintack@xxxxxxxxxxxxxxx> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-15Merge tag 'pm-5.2-rc1-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull more power management updates from Rafael Wysocki: "These fix a recent regression causing kernels built with CONFIG_PM unset to crash on systems that support the Performance and Energy Bias Hint (EPB), clean up the cpufreq core and some users of transition notifiers and introduce a new power domain flag into the generic power domains framework (genpd). Specifics: - Fix recent regression causing kernels built with CONFIG_PM unset to crash on systems that support the Performance and Energy Bias Hint (EPB) by avoiding to compile the EPB-related code depending on CONFIG_PM when it is unset (Rafael Wysocki). - Clean up the transition notifier invocation code in the cpufreq core and change some users of cpufreq transition notifiers accordingly (Viresh Kumar). - Change MAINTAINERS to cover the schedutil governor as part of cpufreq (Viresh Kumar). - Simplify cpufreq_init_policy() to avoid redundant computations (Yue Hu). - Add explanatory comment to the cpufreq core (Rafael Wysocki). - Introduce a new flag, GENPD_FLAG_RPM_ALWAYS_ON, to the generic power domains (genpd) framework along with the first user of it (Leonard Crestez)" * tag 'pm-5.2-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: soc: imx: gpc: Use GENPD_FLAG_RPM_ALWAYS_ON for ERR009619 PM / Domains: Add GENPD_FLAG_RPM_ALWAYS_ON flag cpufreq: Update MAINTAINERS to include schedutil governor cpufreq: Don't find governor for setpolicy drivers in cpufreq_init_policy() cpufreq: Explain the kobject_put() in cpufreq_policy_alloc() cpufreq: Call transition notifier only once for each policy x86: intel_epb: Take CONFIG_PM into account
2019-05-14Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge misc updates from Andrew Morton: - a few misc things and hotfixes - ocfs2 - almost all of MM * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits) kernel/memremap.c: remove the unused device_private_entry_fault() export mm: delete find_get_entries_tag mm/huge_memory.c: make __thp_get_unmapped_area static mm/mprotect.c: fix compilation warning because of unused 'mm' variable mm/page-writeback: introduce tracepoint for wait_on_page_writeback() mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags mm/Kconfig: update "Memory Model" help text mm/vmscan.c: don't disable irq again when count pgrefill for memcg mm: memblock: make keeping memblock memory opt-in rather than opt-out hugetlbfs: always use address space in inode for resv_map pointer mm/z3fold.c: support page migration mm/z3fold.c: add structure for buddy handles mm/z3fold.c: improve compression by extending search mm/z3fold.c: introduce helper functions mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig mm/vmscan.c: simplify shrink_inactive_list() fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback xen/privcmd-buf.c: convert to use vm_map_pages_zero() xen/gntdev.c: convert to use vm_map_pages() ...
2019-05-14mm/gup: change GUP fast to use flags rather than a write 'bool'Ira Weiny
To facilitate additional options to get_user_pages_fast() change the singular write parameter to be gup_flags. This patch does not change any functionality. New functionality will follow in subsequent patches. Some of the get_user_pages_fast() call sites were unchanged because they already passed FOLL_WRITE or 0 for the write parameter. NOTE: It was suggested to change the ordering of the get_user_pages_fast() arguments to ensure that callers were converted. This breaks the current GUP call site convention of having the returned pages be the final parameter. So the suggestion was rejected. Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Mike Marshall <hubcap@omnibond.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-05-14Merge branch 'x86-mds-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...
2019-05-10cpufreq: Call transition notifier only once for each policyViresh Kumar
Currently, the notifiers are called once for each CPU of the policy->cpus cpumask. It would be more optimal if the notifier can be called only once and all the relevant information be provided to it. Out of the 23 drivers that register for the transition notifiers today, only 4 of them do per-cpu updates and the callback for the rest can be called only once for the policy without any impact. This would also avoid multiple function calls to the notifier callbacks and reduce multiple iterations of notifier core's code (which does locking as well). This patch adds pointer to the cpufreq policy to the struct cpufreq_freqs, so the notifier callback has all the information available to it with a single call. The five drivers which perform per-cpu updates are updated to use the cpufreq policy. The freqs->cpu field is redundant now and is removed. Acked-by: David S. Miller <davem@davemloft.net> (sparc) Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-05-08kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks completeAaron Lewis
nested_run_pending=1 implies we have successfully entered guest mode. Move setting from external state in vmx_set_nested_state() until after all other checks are complete. Based on a patch by Aaron Lewis. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-05-08KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting ↵Aaron Lewis
new state Move call to nested_enable_evmcs until after free_nested() is complete. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>