summaryrefslogtreecommitdiff
path: root/arch/x86/kvm/x86.c
AgeCommit message (Collapse)Author
2020-10-21Merge branch 'kvm-fixes' into 'next'Paolo Bonzini
Pick up bugfixes from 5.9, otherwise various tests fail.
2020-10-21KVM: x86: allow kvm_x86_ops.set_efer to return an error valueMaxim Levitsky
This will be used to signal an error to the userspace, in case the vendor code failed during handling of this msr. (e.g -ENOMEM) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20201001112954.6258-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21KVM: x86: report negative values from wrmsr emulation to userspaceMaxim Levitsky
This will allow the KVM to report such errors (e.g -ENOMEM) to the userspace. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20201001112954.6258-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21KVM: x86: xen_hvm_config: cleanup return valuesMaxim Levitsky
Return 1 on errors that are caused by wrong guest behavior (which will inject #GP to the guest) And return a negative error value on issues that are the kernel's fault (e.g -ENOMEM) Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20201001112954.6258-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21KVM: x86: allocate vcpu->arch.cpuid_entries dynamicallyVitaly Kuznetsov
The current limit for guest CPUID leaves (KVM_MAX_CPUID_ENTRIES, 80) is reported to be insufficient but before we bump it let's switch to allocating vcpu->arch.cpuid_entries[] array dynamically. Currently, 'struct kvm_cpuid_entry2' is 40 bytes so vcpu->arch.cpuid_entries is 3200 bytes which accounts for 1/4 of the whole 'struct kvm_vcpu_arch' but having it pre-allocated (for all vCPUs which we also pre-allocate) gives us no real benefits. Another plus of the dynamic allocation is that we now do kvm_check_cpuid() check before we assign anything to vcpu->arch.cpuid_nent/cpuid_entries so no changes are made in case the check fails. Opportunistically remove unneeded 'out' labels from kvm_vcpu_ioctl_set_cpuid()/kvm_vcpu_ioctl_set_cpuid2() and return directly whenever possible. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20201001130541.1398392-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
2020-10-21kvm: x86: only provide PV features if enabled in guest's CPUIDOliver Upton
KVM unconditionally provides PV features to the guest, regardless of the configured CPUID. An unwitting guest that doesn't check KVM_CPUID_FEATURES before use could access paravirt features that userspace did not intend to provide. Fix this by checking the guest's CPUID before performing any paravirtual operations. Introduce a capability, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, to gate the aforementioned enforcement. Migrating a VM from a host w/o this patch to a host with this patch could silently change the ABI exposed to the guest, warranting that we default to the old behavior and opt-in for the new one. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Change-Id: I202a0926f65035b872bfe8ad15307c026de59a98 Message-Id: <20200818152429.1923996-4-oupton@google.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21kvm: x86: set wall_clock in kvm_write_wall_clock()Oliver Upton
Small change to avoid meaningless duplication in the subsequent patch. No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Oliver Upton <oupton@google.com> Change-Id: I77ab9cdad239790766b7a49d5cbae5e57a3005ea Message-Id: <20200818152429.1923996-3-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fnOliver Upton
No functional change intended. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Oliver Upton <oupton@google.com> Change-Id: I7cbe71069db98d1ded612fd2ef088b70e7618426 Message-Id: <20200818152429.1923996-2-oupton@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21KVM: VMX: Forbid userspace MSR filters for x2APICPaolo Bonzini
Allowing userspace to intercept reads to x2APIC MSRs when APICV is fully enabled for the guest simply can't work. But more in general, the LAPIC could be set to in-kernel after the MSR filter is setup and allowing accesses by userspace would be very confusing. We could in principle allow userspace to intercept reads and writes to TPR, and writes to EOI and SELF_IPI, but while that could be made it work, it would still be silly. Cc: Alexander Graf <graf@amazon.com> Cc: Aaron Lewis <aaronlewis@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-10-21KVM: VMX: Ignore userspace MSR filters for x2APICSean Christopherson
Rework the resetting of the MSR bitmap for x2APIC MSRs to ignore userspace filtering. Allowing userspace to intercept reads to x2APIC MSRs when APICV is fully enabled for the guest simply can't work; the LAPIC and thus virtual APIC is in-kernel and cannot be directly accessed by userspace. To keep things simple we will in fact forbid intercepting x2APIC MSRs altogether, independent of the default_allow setting. Cc: Alexander Graf <graf@amazon.com> Cc: Aaron Lewis <aaronlewis@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20201005195532.8674-3-sean.j.christopherson@intel.com> [Modified to operate even if APICv is disabled, adjust documentation. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: do not attempt TSC synchronization on guest writesPaolo Bonzini
KVM special-cases writes to MSR_IA32_TSC so that all CPUs have the same base for the TSC. This logic is complicated, and we do not want it to have any effect once the VM is started. In particular, if any guest started to synchronize its TSCs with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST, the additional effect of kvm_write_tsc code would be uncharted territory. Therefore, this patch makes writes to MSR_IA32_TSC behave essentially the same as writes to MSR_IA32_TSC_ADJUST when they come from the guest. A new selftest (which passes both before and after the patch) checks the current semantics of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating from both the host and the guest. Upcoming work to remove the special side effects of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST will be able to build onto this test, adjusting the host side to use the new APIs and achieve the same effect. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: rename KVM_REQ_GET_VMCS12_PAGESPaolo Bonzini
We are going to use it for SVM too, so use a more generic name. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Introduce MSR filteringAlexander Graf
It's not desireable to have all MSRs always handled by KVM kernel space. Some MSRs would be useful to handle in user space to either emulate behavior (like uCode updates) or differentiate whether they are valid based on the CPU model. To allow user space to specify which MSRs it wants to see handled by KVM, this patch introduces a new ioctl to push filter rules with bitmaps into KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access. With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the denied MSR events to user space to operate on. If no filter is populated, MSR handling stays identical to before. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20200925143422.21718-8-graf@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Add infrastructure for MSR filteringAlexander Graf
In the following commits we will add pieces of MSR filtering. To ensure that code compiles even with the feature half-merged, let's add a few stubs and struct definitions before the real patches start. Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20200925143422.21718-4-graf@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Allow deflecting unknown MSR accesses to user spaceAlexander Graf
MSRs are weird. Some of them are normal control registers, such as EFER. Some however are registers that really are model specific, not very interesting to virtualization workloads, and not performance critical. Others again are really just windows into package configuration. Out of these MSRs, only the first category is necessary to implement in kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against certain CPU models and MSRs that contain information on the package level are much better suited for user space to process. However, over time we have accumulated a lot of MSRs that are not the first category, but still handled by in-kernel KVM code. This patch adds a generic interface to handle WRMSR and RDMSR from user space. With this, any future MSR that is part of the latter categories can be handled in user space. Furthermore, it allows us to replace the existing "ignore_msrs" logic with something that applies per-VM rather than on the full system. That way you can run productive VMs in parallel to experimental ones where you don't care about proper MSR handling. Signed-off-by: Alexander Graf <graf@amazon.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20200925143422.21718-3-graf@amazon.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Return -ENOENT on unimplemented MSRsAlexander Graf
When we find an MSR that we can not handle, bubble up that error code as MSR error return code. Follow up patches will use that to expose the fact that an MSR is not handled by KVM to user space. Suggested-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Alexander Graf <graf@amazon.com> Message-Id: <20200925143422.21718-2-graf@amazon.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Rename "shared_msrs" to "user_return_msrs"Sean Christopherson
Rename the "shared_msrs" mechanism, which is used to defer restoring MSRs that are only consumed when running in userspace, to a more banal but less likely to be confusing "user_return_msrs". The "shared" nomenclature is confusing as it's not obvious who is sharing what, e.g. reasonable interpretations are that the guest value is shared by vCPUs in a VM, or that the MSR value is shared/common to guest and host, both of which are wrong. "shared" is also misleading as the MSR value (in hardware) is not guaranteed to be shared/reused between VMs (if that's indeed the correct interpretation of the name), as the ability to share values between VMs is simply a side effect (albiet a very nice side effect) of deferring restoration of the host value until returning from userspace. "user_return" avoids the above confusion by describing the mechanism itself instead of its effects. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Add RIP to the kvm_entry, i.e. VM-Enter, tracepointSean Christopherson
Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit tracepoint won't be hit. Read RIP from within the tracepoint itself to avoid a potential VMREAD and retpoline if the guest's RIP isn't available. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: Add kvm_x86_ops hook to short circuit emulationSean Christopherson
Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a more generic is_emulatable(), and unconditionally call the new function in x86_emulate_instruction(). KVM will use the generic hook to support multiple security related technologies that prevent emulation in one way or another. Similar to the existing AMD #NPF case where emulation of the current instruction is not possible due to lack of information, AMD's SEV-ES and Intel's SGX and TDX will introduce scenarios where emulation is impossible due to the guest's register state being inaccessible. And again similar to the existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(), i.e. outside of the control of vendor-specific code. While the cause and architecturally visible behavior of the various cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean resume or complete shutdown, and SEV-ES and TDX "return" an error, the impact on the common emulation code is identical: KVM must stop emulation immediately and resume the guest. Query is_emulatable() in handle_ud() as well so that the force_emulation_prefix code doesn't incorrectly modify RIP before calling emulate_instruction() in the absurdly unlikely scenario that KVM encounters forced emulation in conjunction with "do not emulate". Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: x86: fix MSR_IA32_TSC read for nested migrationMaxim Levitsky
MSR reads/writes should always access the L1 state, since the (nested) hypervisor should intercept all the msrs it wants to adjust, and these that it doesn't should be read by the guest as if the host had read it. However IA32_TSC is an exception. Even when not intercepted, guest still reads the value + TSC offset. The write however does not take any TSC offset into account. This is documented in Intel's SDM and seems also to happen on AMD as well. This creates a problem when userspace wants to read the IA32_TSC value and then write it. (e.g for migration) In this case it reads L2 value but write is interpreted as an L1 value. To fix this make the userspace initiated reads of IA32_TSC return L1 value as well. Huge thanks to Dave Gilbert for helping me understand this very confusing semantic of MSR writes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: X86: Move handling of INVPCID types to x86Babu Moger
INVPCID instruction handling is mostly same across both VMX and SVM. So, move the code to common x86.c. Signed-off-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.cBabu Moger
Handling of kvm_read/write_guest_virt*() errors can be moved to common code. The same code can be used by both VMX and SVM. Signed-off-by: Babu Moger <babu.moger@amd.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-28KVM: LAPIC: Narrow down the kick target vCPUWanpeng Li
The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer fires on a different pCPU which vCPU is running on. This kick costs about 1000 clock cycles and we don't need this when injecting already-expired timer or when using the VMX preemption timer because kvm_lapic_expired_hv_timer() is called from the target vCPU. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-25KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKESean Christopherson
Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled. Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are toggled inadvertantly skipped the MMU context reset due to the mask of bits that triggers PDPTR loads also being used to trigger MMU context resets. Fixes: 427890aff855 ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode") Fixes: cb957adb4ea4 ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode") Cc: Jim Mattson <jmattson@google.com> Cc: Peter Shier <pshier@google.com> Cc: Oliver Upton <oupton@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-24KVM: x86: fix MSR_IA32_TSC read for nested migrationMaxim Levitsky
MSR reads/writes should always access the L1 state, since the (nested) hypervisor should intercept all the msrs it wants to adjust, and these that it doesn't should be read by the guest as if the host had read it. However IA32_TSC is an exception. Even when not intercepted, guest still reads the value + TSC offset. The write however does not take any TSC offset into account. This is documented in Intel's SDM and seems also to happen on AMD as well. This creates a problem when userspace wants to read the IA32_TSC value and then write it. (e.g for migration) In this case it reads L2 value but write is interpreted as an L1 value. To fix this make the userspace initiated reads of IA32_TSC return L1 value as well. Huge thanks to Dave Gilbert for helping me understand this very confusing semantic of MSR writes. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-23KVM: x86: VMX: Make smaller physical guest address space support ↵Mohammed Gamal
user-configurable This patch exposes allow_smaller_maxphyaddr to the user as a module parameter. Since smaller physical address spaces are only supported on VMX, the parameter is only exposed in the kvm_intel module. For now disable support by default, and let the user decide if they want to enable it. Modifications to VMX page fault and EPT violation handling will depend on whether that parameter is enabled. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Message-Id: <20200903141122.72908-1-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-22Merge branch 'x86-seves-for-paolo' of ↵Paolo Bonzini
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into HEAD
2020-09-11KVM: x86: always allow writing '0' to MSR_KVM_ASYNC_PF_ENVitaly Kuznetsov
Even without in-kernel LAPIC we should allow writing '0' to MSR_KVM_ASYNC_PF_EN as we're not enabling the mechanism. In particular, QEMU with 'kernel-irqchip=off' fails to start a guest with qemu-system-x86_64: error: failed to set MSR 0x4b564d02 to 0x0 Fixes: 9d3c447c72fb2 ("KVM: X86: Fix async pf caused null-ptr-deref") Reported-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200911093147.484565-1-vkuznets@redhat.com> [Actually commit the version proposed by Sean Christopherson. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-09-11Merge tag 'kvmarm-fixes-5.9-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for Linux 5.9, take #1 - Multiple stolen time fixes, with a new capability to match x86 - Fix for hugetlbfs mappings when PUD and PMD are the same level - Fix for hugetlbfs mappings when PTE mappings are enforced (dirty logging, for example) - Fix tracing output of 64bit values
2020-08-23treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva
Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-22Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm fixes from Paolo Bonzini: - PAE and PKU bugfixes for x86 - selftests fix for new binutils - MMU notifier fix for arm64 * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: arm64: Only reschedule if MMU_NOTIFIER_RANGE_BLOCKABLE is not set KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode KVM: x86: fix access code passed to gva_to_gpa selftests: kvm: Use a shorter encoding to clear RAX
2020-08-21arm64/x86: KVM: Introduce steal-time capAndrew Jones
arm64 requires a vcpu fd (KVM_HAS_DEVICE_ATTR vcpu ioctl) to probe support for steal-time. However this is unnecessary, as only a KVM fd is required, and it complicates userspace (userspace may prefer delaying vcpu creation until after feature probing). Introduce a cap that can be checked instead. While x86 can already probe steal-time support with a kvm fd (KVM_GET_SUPPORTED_CPUID), we add the cap there too for consistency. Signed-off-by: Andrew Jones <drjones@redhat.com> Signed-off-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Steven Price <steven.price@arm.com> Link: https://lore.kernel.org/r/20200804170604.42662-7-drjones@redhat.com
2020-08-17kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE modeJim Mattson
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: b9baba8614890 ("KVM, pkeys: expose CPUID/CR4 to guest") Cc: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE modeJim Mattson
See the SDM, volume 3, section 4.4.1: If PAE paging would be in use following an execution of MOV to CR0 or MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then the PDPTEs are loaded from the address in CR3. Fixes: 0be0226f07d14 ("KVM: MMU: fix SMAP virtualization") Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20200817181655.3716509-2-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-17KVM: x86: fix access code passed to gva_to_gpaPaolo Bonzini
The PK bit of the error code is computed dynamically in permission_fault and therefore need not be passed to gva_to_gpa: only the access bits (fetch, user, write) need to be passed down. Not doing so causes a splat in the pku test: WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm] Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020 RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm] Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89 RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007 RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000 RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007 R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000 FS: 00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: paging64_gva_to_gpa+0x3f/0xb0 [kvm] kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm] handle_exception_nmi+0x4fc/0x5b0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm] kvm_vcpu_ioctl+0x23e/0x5d0 [kvm] ksys_ioctl+0x92/0xb0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x3e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace d17eb998aee991da ]--- Reported-by: Sean Christopherson <sean.j.christopherson@intel.com> Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-12Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull more KVM updates from Paolo Bonzini: "PPC: - Improvements and bugfixes for secure VM support, giving reduced startup time and memory hotplug support. - Locking fixes in nested KVM code - Increase number of guests supported by HV KVM to 4094 - Preliminary POWER10 support ARM: - Split the VHE and nVHE hypervisor code bases, build the EL2 code separately, allowing for the VHE code to now be built with instrumentation - Level-based TLB invalidation support - Restructure of the vcpu register storage to accomodate the NV code - Pointer Authentication available for guests on nVHE hosts - Simplification of the system register table parsing - MMU cleanups and fixes - A number of post-32bit cleanups and other fixes MIPS: - compilation fixes x86: - bugfixes - support for the SERIALIZE instruction" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits) KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled MIPS: KVM: Convert a fallthrough comment to fallthrough MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64 x86: Expose SERIALIZE for supported cpuid KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort() KVM: arm64: Don't skip cache maintenance for read-only memslots KVM: arm64: Handle data and instruction external aborts the same way KVM: arm64: Rename kvm_vcpu_dabt_isextabt() KVM: arm: Add trace name for ARM_NISV KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE KVM: PPC: Book3S HV: Rework secure mem slot dropping KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up KVM: PPC: Book3S HV: Migrate hot plugged memory KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START ...
2020-08-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - IRQ bypass support for vdpa and IFC - MLX5 vdpa driver - Endianness fixes for virtio drivers - Misc other fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits) vdpa/mlx5: fix up endian-ness for mtu vdpa: Fix pointer math bug in vdpasim_get_config() vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config() vdpa/mlx5: fix memory allocation failure checks vdpa/mlx5: Fix uninitialised variable in core/mr.c vdpa_sim: init iommu lock virtio_config: fix up warnings on parisc vdpa/mlx5: Add VDPA driver for supported mlx5 devices vdpa/mlx5: Add shared memory registration code vdpa/mlx5: Add support library for mlx5 VDPA implementation vdpa/mlx5: Add hardware descriptive header file vdpa: Modify get_vq_state() to return error code net/vdpa: Use struct for set/get vq state vdpa: remove hard coded virtq num vdpasim: support batch updating vhost-vdpa: support IOTLB batching hints vhost-vdpa: support get/set backend features vhost: generialize backend features setting/getting vhost-vdpa: refine ioctl pre-processing vDPA: dont change vq irq after DRIVER_OK ...
2020-08-09KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabledSean Christopherson
Don't attempt to load PDPTRs if EFER.LME=1, i.e. if 64-bit mode is enabled. A recent change to reload the PDTPRs when CR0.CD or CR0.NW is toggled botched the EFER.LME handling and sends KVM down the PDTPR path when is_paging() is true, i.e. when the guest toggles CD/NW in 64-bit mode. Split the CR0 checks for 64-bit vs. 32-bit PAE into separate paths. The 64-bit path is specifically checking state when paging is toggled on, i.e. CR0.PG transititions from 0->1. The PDPTR path now needs to run if the new CR0 state has paging enabled, irrespective of whether paging was already enabled. Trying to shave a few cycles to make the PDPTR path an "else if" case is a mess. Fixes: d42e3fae6faed ("kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes") Cc: Jim Mattson <jmattson@google.com> Cc: Oliver Upton <oupton@google.com> Cc: Peter Shier <pshier@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200714015732.32426-1-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-08-06Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Paolo Bonzini: "s390: - implement diag318 x86: - Report last CPU for debugging - Emulate smaller MAXPHYADDR in the guest than in the host - .noinstr and tracing fixes from Thomas - nested SVM page table switching optimization and fixes Generic: - Unify shadow MMU cache data structures across architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) KVM: SVM: Fix sev_pin_memory() error handling KVM: LAPIC: Set the TDCR settable bits KVM: x86: Specify max TDP level via kvm_configure_mmu() KVM: x86/mmu: Rename max_page_level to max_huge_page_level KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDR KVM: VXM: Remove temporary WARN on expected vs. actual EPTP level mismatch KVM: x86: Pull the PGD's level from the MMU instead of recalculating it KVM: VMX: Make vmx_load_mmu_pgd() static KVM: x86/mmu: Add separate helper for shadow NPT root page role calc KVM: VMX: Drop a duplicate declaration of construct_eptp() KVM: nSVM: Correctly set the shadow NPT root level in its MMU role KVM: Using macros instead of magic values MIPS: KVM: Fix build error caused by 'kvm_run' cleanup KVM: nSVM: remove nonsensical EXITINFO1 adjustment on nested NPF KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support KVM: VMX: optimize #PF injection when MAXPHYADDR does not match KVM: VMX: Add guest physical address check in EPT violation and misconfig KVM: VMX: introduce vmx_need_pf_intercept KVM: x86: update exception bitmap on CPUID changes KVM: x86: rename update_bp_intercept to update_exception_bitmap ...
2020-08-05kvm: detect assigned device via irqbypass managerZhu Lingshan
vDPA devices has dedicated backed hardware like passthrough-ed devices. Then it is possible to setup irq offloading to vCPU for vDPA devices. Thus this patch tries to manipulated assigned device counters by kvm_arch_start/end_assignment() in irqbypass manager, so that assigned devices could be detected in update_pi_irte() We will increase/decrease the assigned device counter in kvm/x86. Both vDPA and VFIO would go through this code path. Only X86 uses these counters and kvm_arch_start/end_assignment(), so this code path only affect x86 for now. Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com> Suggested-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20200731065533.4144-3-lingshan.zhu@intel.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-08-04Merge tag 'x86-entry-2020-08-04' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 conversion to generic entry code from Thomas Gleixner: "The conversion of X86 syscall, interrupt and exception entry/exit handling to the generic code. Pretty much a straight-forward 1:1 conversion plus the consolidation of the KVM handling of pending work before entering guest mode" * tag 'x86-entry-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() x86/kvm: Use generic xfer to guest work function x86/entry: Cleanup idtentry_enter/exit x86/entry: Use generic interrupt entry/exit code x86/entry: Cleanup idtentry_entry/exit_user x86/entry: Use generic syscall exit functionality x86/entry: Use generic syscall entry function x86/ptrace: Provide pt_regs helper for entry/exit x86/entry: Move user return notifier out of loop x86/entry: Consolidate 32/64 bit syscall entry x86/entry: Consolidate check_user_regs() x86: Correct noinstr qualifiers x86/idtentry: Remove stale comment
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-07-30KVM: x86: Specify max TDP level via kvm_configure_mmu()Sean Christopherson
Capture the max TDP level during kvm_configure_mmu() instead of using a kvm_x86_ops hook to do it at every vCPU creation. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30KVM: x86: Dynamically calculate TDP level from max level and MAXPHYADDRSean Christopherson
Calculate the desired TDP level on the fly using the max TDP level and MAXPHYADDR instead of doing the same when CPUID is updated. This avoids the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior across x86. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-30x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu()Thomas Gleixner
The comments explicitely explain that the work flags check and handling in kvm_run_vcpu() is done with preemption and interrupts enabled as KVM invokes the check again right before entering guest mode with interrupts disabled which guarantees that the work flags are observed and handled before VMENTER. Nevertheless the flag pending check in kvm_run_vcpu() uses the helper variant which requires interrupts to be disabled triggering an instant lockdep splat. This was caught in testing before and then not fixed up in the patch before applying. :( Use the relaxed and intentionally racy __xfer_to_guest_mode_work_pending() instead. Fixes: 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function") Reported-by: Qian Cai <cai@lca.pw> writes: Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/87bljxa2sa.fsf@nanos.tec.linutronix.de
2020-07-24x86/kvm: Use generic xfer to guest work functionThomas Gleixner
Use the generic infrastructure to check for and handle pending work before transitioning into guest mode. This now handles TIF_NOTIFY_RESUME as well which was ignored so far. Handling it is important as this covers task work and task work will be used to offload the heavy lifting of POSIX CPU timers to thread context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de
2020-07-16treewide: Remove uninitialized_var() usageKees Cook
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-10KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR supportMohammed Gamal
This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X) The complications in this patch are due to unexpected (but documented) behaviour we see with NPF vmexit handling in AMD processor. If SVM is modified to add guest physical address checks in the NPF and guest #PF paths, we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001 Dump mapping: address: 0x123400000000 ------L4: 24c3027 ------L3: 24c4027 ------L2: 24c5021 ------L1: 1002000021 This is because the PTE's accessed bit is set by the CPU hardware before the NPF vmexit. This is handled completely by hardware and cannot be fixed in software. Therefore, availability of the new capability depends on a boolean variable allow_smaller_maxphyaddr which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Babu Moger <babu.moger@amd.com> Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Message-Id: <20200710154811.418214-10-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: x86: rename update_bp_intercept to update_exception_bitmapPaolo Bonzini
We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-07-10KVM: x86: Add helper functions for illegal GPA checking and page fault injectionMohammed Gamal
This patch adds two helper functions that will be used to support virtualizing MAXPHYADDR in both kvm-intel.ko and kvm.ko. kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA, while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits. Signed-off-by: Mohammed Gamal <mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-2-mgamal@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>