summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2020-06-01KVM: rename kvm_arch_can_inject_async_page_present() to ↵Vitaly Kuznetsov
kvm_arch_can_dequeue_async_page_present() An innocent reader of the following x86 KVM code: bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu) { if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED)) return true; ... may get very confused: if APF mechanism is not enabled, why do we report that we 'can inject async page present'? In reality, upon injection kvm_arch_async_page_present() will check the same condition again and, in case APF is disabled, will just drop the item. This is fine as the guest which deliberately disabled APF doesn't expect to get any APF notifications. Rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present() to make it clear what we are checking: if the item can be dequeued (meaning either injected or just dropped). On s390 kvm_arch_can_inject_async_page_present() always returns 'true' so the rename doesn't matter much. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: x86: extend struct kvm_vcpu_pv_apf_data with token infoVitaly Kuznetsov
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page ↵Vitaly Kuznetsov
Ready" exceptions simultaneously" Commit 9a6e7c39810e (""KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously") added a protection against 'page ready' notification coming before 'page not present' is delivered. This situation seems to be impossible since commit 2a266f23550b ("KVM MMU: check pending exception before injecting APF) which added 'vcpu->arch.exception.pending' check to kvm_can_do_async_pf. On x86, kvm_arch_async_page_present() has only one call site: kvm_check_async_pf_completion() loop and we only enter the loop when kvm_arch_can_inject_async_page_present(vcpu) which when async pf msr is enabled, translates into kvm_can_do_async_pf(). There is also one problem with the cancellation mechanism. We don't seem to check that the 'page not present' notification we're canceling matches the 'page ready' notification so in theory, we may erroneously drop two valid events. Revert the commit. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: VMX: Replace zero-length array with flexible-arrayGustavo A. R. Silva
The current codebase makes use of the zero-length array language extension to the C90 standard, but the preferred mechanism to declare variable-length types such as these ones is a flexible array member[1][2], introduced in C99: struct foo { int stuff; struct boo array[]; }; By making use of the mechanism above, we will get a compiler warning in case the flexible array does not occur last in the structure, which will help us prevent some kind of undefined behavior bugs from being inadvertently introduced[3] to the codebase from now on. Also, notice that, dynamic memory allocations won't be affected by this change: "Flexible array members have incomplete type, and so the sizeof operator may not be applied. As a quirk of the original implementation of zero-length arrays, sizeof evaluates to zero."[1] sizeof(flexible-array-member) triggers a warning because flexible array members have incomplete type[1]. There are some instances of code in which the sizeof operator is being incorrectly/erroneously applied to zero-length arrays and the result is zero. Such instances may be hiding some bugs. So, this work (flexible-array member conversions) will also help to get completely rid of those sorts of issues. This issue was found with the help of Coccinelle. [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html [2] https://github.com/KSPP/linux/issues/21 [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Message-Id: <20200507185618.GA14831@embeddedor> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: implement KVM_GET_NESTED_STATE and KVM_SET_NESTED_STATEPaolo Bonzini
Similar to VMX, the state that is captured through the currently available IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was running at the moment when the process was interrupted to save its state. In particular, the SVM-specific state for nested virtualization includes the L1 saved state (including the interrupt flag), the cached L2 controls, and the GIF. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: MMU: pass arbitrary CR0/CR4/EFER to kvm_init_shadow_mmuPaolo Bonzini
This allows fetching the registers from the hsave area when setting up the NPT shadow MMU, and is needed for KVM_SET_NESTED_STATE (which runs long after the CR0, CR4 and EFER values in vcpu have been switched to hold L2 guest state). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: leave guest mode when clearing EFER.SVMEPaolo Bonzini
According to the AMD manual, the effect of turning off EFER.SVME while a guest is running is undefined. We make it leave guest mode immediately, similar to the effect of clearing the VMX bit in MSR_IA32_FEAT_CTL. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: split nested_vmcb_check_controlsPaolo Bonzini
The authoritative state does not come from the VMCB once in guest mode, but KVM_SET_NESTED_STATE can still perform checks on L1's provided SVM controls because we get them from userspace. Therefore, split out a function to do them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: remove HF_HIF_MASKPaolo Bonzini
The L1 flags can be found in the save area of svm->nested.hsave, fish it from there so that there is one fewer thing to migrate. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: remove HF_VINTR_MASKPaolo Bonzini
Now that the int_ctl field is stored in svm->nested.ctl.int_ctl, we can use it instead of vcpu->arch.hflags to check whether L2 is running in V_INTR_MASKING mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: synthesize correct EXITINTINFO on vmexitPaolo Bonzini
This bit was added to nested VMX right when nested_run_pending was introduced, but it is not yet there in nSVM. Since we can have pending events that L0 injected directly into L2 on vmentry, we have to transfer them into L1's queue. For this to work, one important change is required: svm_complete_interrupts (which clears the "injected" fields from the previous VMRUN, and updates them from svm->vmcb's EXITINTINFO) must be placed before we inject the vmexit. This is not too scary though; VMX even does it in vmx_vcpu_run. While at it, the nested_vmexit_inject tracepoint is moved towards the end of nested_svm_vmexit. This ensures that the synthesized EXITINTINFO is visible in the trace. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: SVM: preserve VGIF across VMCB switchPaolo Bonzini
There is only one GIF flag for the whole processor, so make sure it is not clobbered when switching to L2 (in which case we also have to include the V_GIF_ENABLE_MASK, lest we confuse enable_gif/disable_gif/gif_set). When going back, L1 could in theory have entered L2 without issuing a CLGI so make sure the svm_set_gif is done last, after svm->vmcb->control.int_ctl has been copied back from hsave. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: extract svm_set_gifPaolo Bonzini
Extract the code that is needed to implement CLGI and STGI, so that we can run it from VMRUN and vmexit (and in the future, KVM_SET_NESTED_STATE). Skip the request for KVM_REQ_EVENT unless needed, subsuming the evaluate_pending_interrupts optimization that is found in enter_svm_guest_mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: remove unnecessary ifPaolo Bonzini
kvm_vcpu_apicv_active must be false when nested virtualization is enabled, so there is no need to check it in clgi_interception. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: synchronize VMCB controls updated by the processor on every vmexitPaolo Bonzini
The control state changes on every L2->L0 vmexit, and we will have to serialize it in the nested state. So keep it up to date in svm->nested.ctl and just copy them back to the nested VMCB in nested_svm_vmexit. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: restore clobbered INT_CTL fields after clearing VINTRPaolo Bonzini
Restore the INT_CTL value from the guest's VMCB once we've stopped using it, so that virtual interrupts can be injected as requested by L1. V_TPR is up-to-date however, and it can change if the guest writes to CR8, so keep it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: save all control fields in svm->nestedPaolo Bonzini
In preparation for nested SVM save/restore, store all data that matters from the VMCB control area into svm->nested. It will then become part of the nested SVM state that is saved by KVM_SET_NESTED_STATE and restored by KVM_GET_NESTED_STATE, just like the cached vmcs12 for nVMX. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: remove trailing padding for struct vmcb_control_areaPaolo Bonzini
Allow placing the VMCB structs on the stack or in other structs without wasting too much space. Add BUILD_BUG_ON as a quick safeguard against typos. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: pass vmcb_control_area to copy_vmcb_control_areaPaolo Bonzini
This will come in handy when we put a struct vmcb_control_area in svm->nested. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: clean up tsc_offset updatePaolo Bonzini
Use l1_tsc_offset to compute svm->vcpu.arch.tsc_offset and svm->vmcb->control.tsc_offset, instead of relying on hsave. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: move MMU setup to nested_prepare_vmcb_controlPaolo Bonzini
Everything that is needed during nested state restore is now part of nested_prepare_vmcb_control. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: extract preparation of VMCB for nested runPaolo Bonzini
Split out filling svm->vmcb.save and svm->vmcb.control before VMRUN. Only the latter will be useful when restoring nested SVM state. This patch introduces no semantic change, so the MMU setup is still done in nested_prepare_vmcb_save. The next patch will clean up things. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: extract load_nested_vmcb_controlPaolo Bonzini
When restoring SVM nested state, the control state cache in svm->nested will have to be filled, but the save state will not have to be moved into svm->vmcb. Therefore, pull the code that handles the control area into a separate function. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-06-01KVM: nSVM: move map argument out of enter_svm_guest_modePaolo Bonzini
Unmapping the nested VMCB in enter_svm_guest_mode is a bit of a wart, since the map argument is not used elsewhere in the function. There are just two callers, and those are also the place where kvm_vcpu_map is called, so it is cleaner to unmap there. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
xdp_umem.c had overlapping changes between the 64-bit math fix for the calculation of npgs and the removal of the zerocopy memory type which got rid of the chunk_size_nohdr member. The mlx5 Kconfig conflict is a case where we just take the net-next copy of the Kconfig entry dependency as it takes on the ESWITCH dependency by one level of indirection which is what the 'net' conflicting change is trying to ensure. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31Merge tag 'x86-urgent-2020-05-31' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A pile of x86 fixes: - Prevent a memory leak in ioperm which was caused by the stupid assumption that the exit cleanup is always called for current, which is not the case when fork fails after taking a reference on the ioperm bitmap. - Fix an arithmething overflow in the DMA code on 32bit systems - Fill gaps in the xstate copy with defaults instead of leaving them uninitialized - Revert: "Make __X32_SYSCALL_BIT be unsigned long" as it turned out that existing user space fails to build" * tag 'x86-urgent-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ioperm: Prevent a memory leak when fork fails x86/dma: Fix max PFN arithmetic overflow on 32 bit systems copy_xstate_to_kernel(): don't leave parts of destination uninitialized x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long"
2020-05-29take the dummy csum_and_copy_from_user() into net/checksum.hAl Viro
now that can be done conveniently - all non-trivial cases have _HAVE_ARCH_COPY_AND_CSUM_FROM_USER defined, so the fallback in net/checksum.h is used only for dummy (copy_from_user, then csum_partial) implementation. Allowing us to get rid of all dummy instances, both of csum_and_copy_from_user() and csum_partial_copy_from_user(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29x86: switch 32bit csum_and_copy_to_user() to user_access_{begin,end}()Al Viro
consolidate HAVE_CSUM_COPY_USER for 32bit and 64bit, while are at it Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29x86: switch both 32bit and 64bit to providing csum_and_copy_from_user()Al Viro
... rather than messing with the wrapper. As a side effect, 32bit variant gets access_ok() into it and can be switched to user_access_begin()/user_access_end() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29x86_64: csum_..._copy_..._user(): switch to unsafe_..._user()Al Viro
We already have stac/clac pair around the calls of csum_partial_copy_generic(). Stretch that area back, so that it covers the preceding loop (and convert the loop body from __{get,put}_user() to unsafe_{get,put}_user()). That brings the beginning of the areas to the earlier access_ok(), which allows to convert them into user_access_{begin,end}() ones. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-29Merge branch 'fixes' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into x86/urgent Pick up FPU register dump fixes from Al Viro. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-28x86/ioperm: Prevent a memory leak when fork failsJay Lang
In the copy_process() routine called by _do_fork(), failure to allocate a PID (or further along in the function) will trigger an invocation to exit_thread(). This is done to clean up from an earlier call to copy_thread_tls(). Naturally, the child task is passed into exit_thread(), however during the process, io_bitmap_exit() nullifies the parent's io_bitmap rather than the child's. As copy_thread_tls() has been called ahead of the failure, the reference count on the calling thread's io_bitmap is incremented as we would expect. However, io_bitmap_exit() doesn't accept any arguments, and thus assumes it should trash the current thread's io_bitmap reference rather than the child's. This is pretty sneaky in practice, because in all instances but this one, exit_thread() is called with respect to the current task and everything works out. A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to get a bitmap allocated, and force a clone3() syscall to fail by passing in a zeroed clone_args structure. The kernel handles the erroneous struct and the buggy code path is followed, and even though the parent's reference to the io_bitmap is trashed, the child still holds a reference and thus the structure will never be freed. Fix this by tweaking io_bitmap_exit() and its subroutines to accept a task_struct argument which to operate on. Fixes: ea5f1cd7ab49 ("x86/ioperm: Remove bitmap if all permissions dropped") Signed-off-by: Jay Lang <jaytlang@mit.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable#@vger.kernel.org Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu
2020-05-28x86/spinlock: Remove obsolete ticket spinlock macros and typesWaiman Long
Even though the x86 ticket spinlock code has been removed with cfd8983f03c7 ("x86, locking/spinlocks: Remove ticket (spin)lock implementation") a while ago, there are still some ticket spinlock specific macros and types left in the asm/spinlock_types.h header file that are no longer used. Remove those as well to avoid confusion. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200526122014.25241-1-longman@redhat.com
2020-05-28x86/split_lock: Add Icelake microserver and Tigerlake CPU modelsFenghua Yu
Icelake microserver CPU supports split lock detection while it doesn't have the split lock enumeration bit in IA32_CORE_CAPABILITIES. Tigerlake CPUs do enumerate the MSR. [ bp: Merge the two model-adding patches into one. ] Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/1588290395-2677-1-git-send-email-fenghua.yu@intel.com
2020-05-28x86/dma: Fix max PFN arithmetic overflow on 32 bit systemsAlexander Dahl
The intermediate result of the old term (4UL * 1024 * 1024 * 1024) is 4 294 967 296 or 0x100000000 which is no problem on 64 bit systems. The patch does not change the later overall result of 0x100000 for MAX_DMA32_PFN (after it has been shifted by PAGE_SHIFT). The new calculation yields the same result, but does not require 64 bit arithmetic. On 32 bit systems the old calculation suffers from an arithmetic overflow in that intermediate term in braces: 4UL aka unsigned long int is 4 byte wide and an arithmetic overflow happens (the 0x100000000 does not fit in 4 bytes), the in braces result is truncated to zero, the following right shift does not alter that, so MAX_DMA32_PFN evaluates to 0 on 32 bit systems. That wrong value is a problem in a comparision against MAX_DMA32_PFN in the init code for swiotlb in pci_swiotlb_detect_4gb() to decide if swiotlb should be active. That comparison yields the opposite result, when compiling on 32 bit systems. This was not possible before 1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too") when that MAX_DMA32_PFN was first made visible to x86_32 (and which landed in v3.0). In practice this wasn't a problem, unless CONFIG_SWIOTLB is active on x86-32. However if one has set CONFIG_IOMMU_INTEL, since c5a5dc4cbbf4 ("iommu/vt-d: Don't switch off swiotlb if bounce page is used") there's a dependency on CONFIG_SWIOTLB, which was not necessarily active before. That landed in v5.4, where we noticed it in the fli4l Linux distribution. We have CONFIG_IOMMU_INTEL active on both 32 and 64 bit kernel configs there (I could not find out why, so let's just say historical reasons). The effect is at boot time 64 MiB (default size) were allocated for bounce buffers now, which is a noticeable amount of memory on small systems like pcengines ALIX 2D3 with 256 MiB memory, which are still frequently used as home routers. We noticed this effect when migrating from kernel v4.19 (LTS) to v5.4 (LTS) in fli4l and got that kernel messages for example: Linux version 5.4.22 (buildroot@buildroot) (gcc version 7.3.0 (Buildroot 2018.02.8)) #1 SMP Mon Nov 26 23:40:00 CET 2018 … Memory: 183484K/261756K available (4594K kernel code, 393K rwdata, 1660K rodata, 536K init, 456K bss , 78272K reserved, 0K cma-reserved, 0K highmem) … PCI-DMA: Using software bounce buffering for IO (SWIOTLB) software IO TLB: mapped [mem 0x0bb78000-0x0fb78000] (64MB) The initial analysis and the suggested fix was done by user 'sourcejedi' at stackoverflow and explicitly marked as GPLv2 for inclusion in the Linux kernel: https://unix.stackexchange.com/a/520525/50007 The new calculation, which does not suffer from that overflow, is the same as for arch/mips now as suggested by Robin Murphy. The fix was tested by fli4l users on round about two dozen different systems, including both 32 and 64 bit archs, bare metal and virtualized machines. [ bp: Massage commit message. ] Fixes: 1b7e03ef7570 ("x86, NUMA: Enable emulation on 32bit too") Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com> Suggested-by: Robin Murphy <robin.murphy@arm.com> Signed-off-by: Alexander Dahl <post@lespocky.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: stable@vger.kernel.org Link: https://unix.stackexchange.com/q/520065/50007 Link: https://web.nettworks.org/bugs/browse/FFL-2560 Link: https://lkml.kernel.org/r/20200526175749.20742-1-post@lespocky.de
2020-05-28x86/mm: Drop deprecated DISCONTIGMEM support for 32-bitMike Rapoport
The DISCONTIGMEM support was marked as deprecated in v5.2 and since there were no complaints about it for almost 5 releases it can be completely removed. Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20200223094322.15206-1-rppt@kernel.org
2020-05-28x86/Kconfig: Update config and kernel doc for MPK feature on AMDBabu Moger
AMD's next generation of EPYC processors support the MPK (Memory Protection Keys) feature. Update the dependency and documentation. Signed-off-by: Babu Moger <babu.moger@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/159068199556.26992.17733929401377275140.stgit@naples-babu.amd.com
2020-05-28KVM: nVMX: always update CR3 in VMCSPaolo Bonzini
vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. Fixes: 04f11ef45810 ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: SVM: always update CR3 in VMCBPaolo Bonzini
svm_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. This was was added in commit 689f3bf21628 ("KVM: x86: unify callbacks to load paging root", 2020-03-16) just to keep the two vendor-specific modules closer, but we'll fix VMX too. Fixes: 689f3bf21628 ("KVM: x86: unify callbacks to load paging root") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: nSVM: correctly inject INIT vmexitsPaolo Bonzini
The usual drill at this point, except there is no code to remove because this case was not handled at all. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: nSVM: remove exit_requiredPaolo Bonzini
All events now inject vmexits before vmentry rather than after vmexit. Therefore, exit_required is not set anymore and we can remove it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: nSVM: inject exceptions via svm_check_nested_eventsPaolo Bonzini
This allows exceptions injected by the emulator to be properly delivered as vmexits. The code also becomes simpler, because we can just let all L0-intercepted exceptions go through the usual path. In particular, our emulation of the VMX #DB exit qualification is very much simplified, because the vmexit injection path can use kvm_deliver_exception_payload to update DR6. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28KVM: x86: enable event window in inject_pending_eventPaolo Bonzini
In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2020-05-28perf/x86/rapl: Add AMD Fam17h RAPL supportStephane Eranian
This patch enables AMD Fam17h RAPL support for the Package level metric. The support is as per AMD Fam17h Model31h (Zen2) and model 00-ffh (Zen1) PPR. The same output is available via the energy-pkg pseudo event: $ perf stat -a -I 1000 --per-socket -e power/energy-pkg/ Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-6-eranian@google.com
2020-05-28perf/x86/rapl: Make perf_probe_msr() more robust and flexibleStephane Eranian
This patch modifies perf_probe_msr() by allowing passing of struct perf_msr array where some entries are not populated, i.e., they have either an msr address of 0 or no attribute_group pointer. This helps with certain call paths, e.g., RAPL. In case the grp is NULL, the default sysfs visibility rule applies which is to make the group visible. Without the patch, you would get a kernel crash with a NULL group. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-5-eranian@google.com
2020-05-28perf/x86/rapl: Flip logic on default events visibilityStephane Eranian
This patch modifies the default visibility of the attribute_group for each RAPL event. By default if the grp.is_visible field is NULL, sysfs considers that it must display the attribute group. If the field is not NULL (callback function), then the return value of the callback determines the visibility (0 = not visible). The RAPL attribute groups had the field set to NULL, meaning that unless they failed the probing from perf_msr_probe(), they would be visible. We want to avoid having to specify attribute groups that are not supported by the HW in the rapl_msrs[] array, they don't have an MSR address to begin with. Therefore, we intialize the visible field of all RAPL attribute groups to a callback that returns 0. If the RAPL msr goes through probing and succeeds the is_visible field will be set back to NULL (visible). If the probing fails the field is set to a callback that return 0 (not visible). Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-4-eranian@google.com
2020-05-28perf/x86/rapl: Refactor to share the RAPL code between Intel and AMD CPUsStephane Eranian
This patch modifies the rapl_model struct to include architecture specific knowledge in this previously Intel specific structure, and in particular it adds the MSR for POWER_UNIT and the rapl_msrs array. No functional changes. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-3-eranian@google.com
2020-05-28perf/x86/rapl: Move RAPL support to common x86 codeStephane Eranian
To prepare for support of both Intel and AMD RAPL. As per the AMD PPR, Fam17h support Package RAPL counters to monitor power usage. The RAPL counter operates as with Intel RAPL, and as such it is beneficial to share the code. No change in functionality. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200527224659.206129-2-eranian@google.com
2020-05-28Merge tag 'v5.7-rc7' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-05-27x86/PCI: Drop unused xen_register_pirq() gsi_override parameterWei Liu
All callers of xen_register_pirq() pass -1 (no override) for the gsi_override parameter. Remove it and related code. Link: https://lore.kernel.org/r/20200428153640.76476-1-wei.liu@kernel.org Signed-off-by: Wei Liu <wei.liu@kernel.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>