summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2019-09-15x86: bug.h: use asm_inline in _BUG_FLAGS definitionsRasmus Villemoes
This helps preventing a BUG* or WARN* in some static inline from preventing that (or one of its callers) being inlined, so should allow gcc to make better informed inlining decisions. For example, with gcc 9.2, tcp_fastopen_no_cookie() vanishes from net/ipv4/tcp_fastopen.o. It does not itself have any BUG or WARN, but it calls dst_metric() which has a WARN_ON_ONCE - and despite that WARN_ON_ONCE vanishing since the condition is compile-time false, dst_metric() is apparently sufficiently "large" that when it gets inlined into tcp_fastopen_no_cookie(), the latter becomes too large for inlining. Overall, if one asks size(1), .text decreases a little and .data increases by about the same amount (x86-64 defconfig) $ size vmlinux.{before,after} text data bss dec hex filename 19709726 5202600 1630280 26542606 195020e vmlinux.before 19709330 5203068 1630280 26542678 1950256 vmlinux.after while bloat-o-meter says add/remove: 10/28 grow/shrink: 103/51 up/down: 3669/-2854 (815) ... Total: Before=14783683, After=14784498, chg +0.01% Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2019-09-15x86: alternative.h: use asm_inline for all alternative variantsRasmus Villemoes
Most, if not all, uses of the alternative* family just provide one or two instructions in .text, but the string literal can be quite large, causing gcc to overestimate the size of the generated code. That in turn affects its decisions about inlining of the function containing the alternative() asm statement. New enough versions of gcc allow one to overrule the estimated size by using "asm inline" instead of just "asm". So replace asm by the helper asm_inline, which for older gccs just expands to asm. Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
2019-09-14KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslotSean Christopherson
James Harvey reported a livelock that was introduced by commit d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot""). The livelock occurs because kvm_mmu_zap_all() as it exists today will voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs to add shadow pages. With enough vCPUs, kvm_mmu_zap_all() can get stuck in an infinite loop as it can never zap all pages before observing lock contention or the need to reschedule. The equivalent of kvm_mmu_zap_all() that was in use at the time of the reverted commit (4e103134b8623, "KVM: x86/mmu: Zap only the relevant pages when removing a memslot") employed a fast invalidate mechanism and was not susceptible to the above livelock. There are three ways to fix the livelock: - Reverting the revert (commit d012a06ab1d23) is not a viable option as the revert is needed to fix a regression that occurs when the guest has one or more assigned devices. It's unlikely we'll root cause the device assignment regression soon enough to fix the regression timely. - Remove the conditional reschedule from kvm_mmu_zap_all(). However, although removing the reschedule would be a smaller code change, it's less safe in the sense that the resulting kvm_mmu_zap_all() hasn't been used in the wild for flushing memslots since the fast invalidate mechanism was introduced by commit 6ca18b6950f8d ("KVM: x86: use the fast way to invalidate all pages"), back in 2013. - Reintroduce the fast invalidate mechanism and use it when zapping shadow pages in response to a memslot being deleted/moved, which is what this patch does. For all intents and purposes, this is a revert of commit ea145aacf4ae8 ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of commit 7390de1e99a70 ("Revert "KVM: x86: use the fast way to invalidate all pages""), i.e. restores the behavior of commit 5304b8d37c2a5 ("KVM: MMU: fast invalidate all pages") and commit 6ca18b6950f8d ("KVM: x86: use the fast way to invalidate all pages") respectively. Fixes: d012a06ab1d23 ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"") Reported-by: James Harvey <jamespharvey20@gmail.com> Cc: Alex Willamson <alex.williamson@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-14KVM: x86: work around leak of uninitialized stack contentsFuqian Huang
Emulation of VMPTRST can incorrectly inject a page fault when passed an operand that points to an MMIO address. The page fault will use uninitialized kernel stack memory as the CR2 and error code. The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR exit to userspace; however, it is not an easy fix, so for now just ensure that the error code and CR2 are zero. Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com> Cc: stable@vger.kernel.org [add comment] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-14KVM: nVMX: handle page fault in vmreadPaolo Bonzini
The implementation of vmread to memory is still incomplete, as it lacks the ability to do vmread to I/O memory just like vmptrst. Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-14KVM: X86: Use IPI shorthands in kvm guest when supportWanpeng Li
IPI shorthand is supported now by linux apic/x2apic driver, switch to IPI shorthand for all excluding self and all including self destination shorthand in kvm guest, to avoid splitting the target mask into several PV IPI hypercalls. This patch removes the kvm_send_ipi_all() and kvm_send_ipi_allbutself() since the callers in APIC codes have already taken care of apic_use_ipi_shorthand and fallback to ->send_IPI_mask and ->send_IPI_mask_allbutself if it is false. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Nadav Amit <namit@vmware.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-12Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "A KVM guest fix, and a kdump kernel relocation errors fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/timer: Force PIT initialization when !X86_FEATURE_ARAT x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large to fix kexec relocation errors
2019-09-11KVM: x86: Fix INIT signal handling in various CPU statesLiran Alon
Commit cd7764fe9f73 ("KVM: x86: latch INITs while in system management mode") changed code to latch INIT while vCPU is in SMM and process latched INIT when leaving SMM. It left a subtle remark in commit message that similar treatment should also be done while vCPU is in VMX non-root-mode. However, INIT signals should actually be latched in various vCPU states: (*) For both Intel and AMD, INIT signals should be latched while vCPU is in SMM. (*) For Intel, INIT should also be latched while vCPU is in VMX operation and later processed when vCPU leaves VMX operation by executing VMXOFF. (*) For AMD, INIT should also be latched while vCPU runs with GIF=0 or in guest-mode with intercept defined on INIT signal. To fix this: 1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor can define the various CPU states in which INIT signals should be blocked and modify kvm_apic_accept_events() to use it. 2) Modify vmx_check_nested_events() to check for pending INIT signal while vCPU in guest-mode. If so, emualte vmexit on EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour but is currently left as a TODO comment to implement in the future because nSVM don't yet implement svm_check_nested_events(). Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to guest (See nested_vmx_setup_ctls_msrs()). If and when support for this activity state will be implemented, kvm_check_nested_events() would need to avoid emulating vmexit on INIT signal in case activity-state is wait-for-SIPI. In addition, kvm_apic_accept_events() would need to be modified to avoid discarding SIPI in case VMX activity-state is wait-for-SIPI but instead delay SIPI processing to vmx_check_nested_events() that would clear pending APIC events and emulate vmexit on SIPI. Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: VMX: Introduce exit reason for receiving INIT signal on guest-modeLiran Alon
According to Intel SDM section 25.2 "Other Causes of VM Exits", When INIT signal is received on a CPU that is running in VMX non-root mode it should cause an exit with exit-reason of 3. (See Intel SDM Appendix C "VMX BASIC EXIT REASONS") This patch introduce the exit-reason definition. Reviewed-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Co-developed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: VMX: Stop the preemption timer during vCPU resetWanpeng Li
The hrtimer which is used to emulate lapic timer is stopped during vcpu reset, preemption timer should do the same. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: LAPIC: Micro optimize IPI latencyWanpeng Li
This patch optimizes the virtual IPI emulation sequence: write ICR2 write ICR2 write ICR read ICR2 read ICR ==> send virtual IPI read ICR2 write ICR send virtual IPI It can reduce kvm-unit-tests/vmexit.flat IPI testing latency(from sender send IPI to sender receive the ACK) from 3319 cycles to 3203 cycles on SKylake server. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11kvm: Nested KVM MMUs need PAE root tooJiří Paleček
On AMD processors, in PAE 32bit mode, nested KVM instances don't work. The L0 host get a kernel OOPS, which is related to arch.mmu->pae_root being NULL. The reason for this is that when setting up nested KVM instance, arch.mmu is set to &arch.guest_mmu (while normally, it would be &arch.root_mmu). However, the initialization and allocation of pae_root only creates it in root_mmu. KVM code (ie. in mmu_alloc_shadow_roots) then accesses arch.mmu->pae_root, which is the unallocated arch.guest_mmu->pae_root. This fix just allocates (and frees) pae_root in both guest_mmu and root_mmu (and also lm_root if it was allocated). The allocation is subject to previous restrictions ie. it won't allocate anything on 64-bit and AFAIK not on Intel. Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=203923 Fixes: 14c07ad89f4d ("x86/kvm/mmu: introduce guest_mmu") Signed-off-by: Jiri Palecek <jpalecek@web.de> Tested-by: Jiri Palecek <jpalecek@web.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: set ctxt->have_exception in x86_decode_insn()Jan Dakinevich
x86_emulate_instruction() takes into account ctxt->have_exception flag during instruction decoding, but in practice this flag is never set in x86_decode_insn(). Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: always stop emulation on page faultJan Dakinevich
inject_emulated_exception() returns true if and only if nested page fault happens. However, page fault can come from guest page tables walk, either nested or not nested. In both cases we should stop an attempt to read under RIP and give guest to step over its own page fault handler. This is also visible when an emulated instruction causes a #GP fault and the VMware backdoor is enabled. To handle the VMware backdoor, KVM intercepts #GP faults; with only the next patch applied, x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL instead of EMULATE_DONE. EMULATE_FAIL causes handle_exception_nmi() (or gp_interception() for SVM) to re-inject the original #GP because it thinks emulation failed due to a non-VMware opcode. This patch prevents the issue as x86_emulate_instruction() will return EMULATE_DONE after injecting the #GP. Fixes: 6ea6e84309ca ("KVM: x86: inject exceptions produced by x86_decode_insn") Cc: stable@vger.kernel.org Cc: Denis Lunev <den@virtuozzo.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Signed-off-by: Jan Dakinevich <jan.dakinevich@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11cpuidle-haltpoll: Enable kvm guest polling when dedicated physical CPUs are ↵Wanpeng Li
available The downside of guest side polling is that polling is performed even with other runnable tasks in the host. However, even if poll in kvm can aware whether or not other runnable tasks in the same pCPU, it can still incur extra overhead in over-subscribe scenario. Now we can just enable guest polling when dedicated pCPUs are available. Acked-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-09-11KVM: nVMX: trace nested VM-Enter failures detected by H/WSean Christopherson
Use the recently added tracepoint for logging nested VM-Enter failures instead of spamming the kernel log when hardware detects a consistency check failure. Take the opportunity to print the name of the error code instead of dumping the raw hex number, but limit the symbol table to error codes that can reasonably be encountered by KVM. Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so that tracing of "invalid control field" errors isn't suppressed when nested early checks are enabled. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: nVMX: add tracepoint for failed nested VM-EnterSean Christopherson
Debugging a failed VM-Enter is often like searching for a needle in a haystack, e.g. there are over 80 consistency checks that funnel into the "invalid control field" error code. One way to expedite debug is to run the buggy code as an L1 guest under KVM (and pray that the failing check is detected by KVM). However, extracting useful debug information out of L0 KVM requires attaching a debugger to KVM and/or modifying the source, e.g. to log which check is failing. Make life a little less painful for VMM developers and add a tracepoint for failed VM-Enter consistency checks. Ideally the tracepoint would capture both what check failed and precisely why it failed, but logging why a checked failed is difficult to do in a generic tracepoint without resorting to invasive techniques, e.g. generating a custom string on failure. That being said, for the vast majority of VM-Enter failures the most difficult step is figuring out exactly what to look at, e.g. figuring out which bit was incorrectly set in a control field is usually not too painful once the guilty field as been identified. To reach a happy medium between precision and ease of use, simply log the code that detected a failed check, using a macro to execute the check and log the trace event on failure. This approach enables tracing arbitrary code, e.g. it's not limited to function calls or specific formats of checks, and the changes to the existing code are minimally invasive. A macro with a two-character name is desirable as usage of the macro doesn't result in overly long lines or confusing alignment, while still retaining some amount of readability. I.e. a one-character name is a little too terse, and a three-character name results in the contents being passed to the macro aligning with an indented line when the macro is used an in if-statement, e.g.: if (VCC(nested_vmx_check_long_line_one(...) && nested_vmx_check_long_line_two(...))) return -EINVAL; And that is the story of how the CC(), a.k.a. Consistency Check, macro got its name. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11x86: KVM: svm: Fix a check in nested_svm_vmrun()Dan Carpenter
We refactored this code a bit and accidentally deleted the "-" character from "-EINVAL". The kvm_vcpu_map() function never returns positive EINVAL. Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()") Cc: stable@vger.kernel.org Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11KVM: x86: Return to userspace with internal error on unexpected exit reasonLiran Alon
Receiving an unexpected exit reason from hardware should be considered as a severe bug in KVM. Therefore, instead of just injecting #UD to guest and ignore it, exit to userspace on internal error so that it could handle it properly (probably by terminating guest). In addition, prefer to use vcpu_unimpl() instead of WARN_ONCE() as handling unexpected exit reason should be a rare unexpected event (that was expected to never happen) and we prefer to print a message on it every time it occurs to guest. Furthermore, dump VMCS/VMCB to dmesg to assist diagnosing such cases. Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-11swiotlb-xen: simplify cache maintainanceChristoph Hellwig
Now that we know we always have the dma-noncoherent.h helpers available if we are on an architecture with support for non-coherent devices, we can just call them directly, and remove the calls to the dma-direct routines, including the fact that we call the dma_direct_map_page routines but ignore the value returned from it. Instead we now have Xen wrappers for the arch_sync_dma_for_{device,cpu} helpers that call the special Xen versions of those routines for foreign pages. Note that the new helpers get the physical address passed in addition to the dma address to avoid another translation for the local cache maintainance. The pfn_valid checks remain on the dma address as in the old code, even if that looks a little funny. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
2019-09-11xen: remove the exports for xen_{create,destroy}_contiguous_regionChristoph Hellwig
These routines are only used by swiotlb-xen, which cannot be modular. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
2019-09-11Merge branches 'arm/omap', 'arm/exynos', 'arm/smmu', 'arm/mediatek', ↵Joerg Roedel
'arm/qcom', 'arm/renesas', 'x86/amd', 'x86/vt-d' and 'core' into next
2019-09-10KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM codeSean Christopherson
Move RDMSR and WRMSR emulation into common x86 code to consolidate nearly identical SVM and VMX code. Note, consolidating RDMSR introduces an extra indirect call, i.e. retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a guest kernel likely has bigger problems if increasing the latency of RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM performance. E.g. the only recurring RDMSR VM-Exits (after booting) on my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via arch_cpu_idle_enter(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callersSean Christopherson
Refactor the top-level MSR accessors to take/return the index and value directly instead of requiring the caller to dump them into a msr_data struct. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Tune PLE Window tracepointPeter Xu
The PLE window tracepoint triggers even if the window is not changed, and the wording can be a bit confusing too. One example line: kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096) It easily let people think of "the window now is 4096 which is shrinked", but the truth is the value actually didn't change (4096). Let's only dump this message if the value really changed, and we make the message even simpler like: kvm_ple_window: vcpu 4 old 4096 new 8192 (growed) Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: VMX: Change ple_window type to unsigned intPeter Xu
The VMX ple_window is 32 bits wide, so logically it can overflow with an int. The module parameter is declared as unsigned int which is good, however the dynamic variable is not. Switching all the ple_window references to use unsigned int. The tracepoint changes will also affect SVM, but SVM is using an even smaller width (16 bits) so it's always fine. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Remove tailing newline for tracepointsPeter Xu
It's done by TP_printk() already. Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: X86: Trace vcpu_id for vmexitPeter Xu
Tracing the ID helps to pair vmenters and vmexits for guests with multiple vCPUs. Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10Merge tag 'kvmarm-5.4' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm updates for 5.4 - New ITS translation cache - Allow up to 512 CPUs to be supported with GICv3 (for real this time) - Now call kvm_arch_vcpu_blocking early in the blocking sequence - Tidy-up device mappings in S2 when DIC is available - Clean icache invalidation on VMID rollover - General cleanup
2019-09-10Merge tag 'kvm-ppc-next-5.4-1' of ↵Paolo Bonzini
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD PPC KVM update for 5.4 - Some prep for extending the uses of the rmap array - Various minor fixes - Commits from the powerpc topic/ppc-kvm branch, which fix a problem with interrupts arriving after free_irq, causing host hangs and crashes.
2019-09-10KVM: x86: Manually calculate reserved bits when loading PDPTRSSean Christopherson
Manually generate the PDPTR reserved bit mask when explicitly loading PDPTRs. The reserved bits that are being tracked by the MMU reflect the current paging mode, which is unlikely to be PAE paging in the vast majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation, __set_sregs(), etc... This can cause KVM to incorrectly signal a bad PDPTR, or more likely, miss a reserved bit check and subsequently fail a VM-Enter due to a bad VMCS.GUEST_PDPTR. Add a one off helper to generate the reserved bits instead of sharing code across the MMU's calculations and the PDPTR emulation. The PDPTR reserved bits are basically set in stone, and pushing a helper into the MMU's calculation adds unnecessary complexity without improving readability. Oppurtunistically fix/update the comment for load_pdptrs(). Note, the buggy commit also introduced a deliberate functional change, "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was effectively (and correctly) reverted by commit cd9ae5fe47df ("KVM: x86: Fix page-tables reserved bits"). A bit of SDM archaeology shows that the SDM from late 2008 had a bug (likely a copy+paste error) where it listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved for 2mb entries. I.e. the SDM contradicted itself, and bits 6:5 are and always have been reserved. Fixes: 20c466b56168d ("KVM: Use rsvd_bits_mask in load_pdptrs()") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Reported-by: Doug Reiland <doug.reiland@intel.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10KVM: x86: Disable posted interrupts for non-standard IRQs delivery modesAlexander Graf
We can easily route hardware interrupts directly into VM context when they target the "Fixed" or "LowPriority" delivery modes. However, on modes such as "SMI" or "Init", we need to go via KVM code to actually put the vCPU into a different mode of operation, so we can not post the interrupt Add code in the VMX and SVM PI logic to explicitly refuse to establish posted mappings for advanced IRQ deliver modes. This reflects the logic in __apic_accept_irq() which also only ever passes Fixed and LowPriority interrupts as posted interrupts into the guest. This fixes a bug I have with code which configures real hardware to inject virtual SMIs into my guest. Signed-off-by: Alexander Graf <graf@amazon.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2019-09-10x86/umip: Add emulation (spoofing) for UMIP covered instructions in 64-bit ↵Brendan Shanks
processes as well Add emulation (spoofing) of the SGDT, SIDT, and SMSW instructions for 64-bit processes. Wine users have encountered a number of 64-bit Windows games that use these instructions (particularly SGDT), and were crashing when run on UMIP-enabled systems. Originally-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Signed-off-by: Brendan Shanks <bshanks@codeweavers.com> Reviewed-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190905232222.14900-1-bshanks@codeweavers.com [ Minor edits: capitalization, added 'spoofing' wording. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-09crypto: x86/aes-ni - use AES library instead of single-use AES cipherArd Biesheuvel
The RFC4106 key derivation code instantiates an AES cipher transform to encrypt only a single block before it is freed again. Switch to the new AES library which is more suitable for such use cases. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-08x86/timer: Force PIT initialization when !X86_FEATURE_ARATJan Stancek
KVM guests with commit c8c4076723da ("x86/timer: Skip PIT initialization on modern chipsets") applied to guest kernel have been observed to have unusually higher CPU usage with symptoms of increase in vm exits for HLT and MSW_WRITE (MSR_IA32_TSCDEADLINE). This is caused by older QEMUs lacking support for X86_FEATURE_ARAT. lapic clock retains CLOCK_EVT_FEAT_C3STOP and nohz stays inactive. There's no usable broadcast device either. Do the PIT initialization if guest CPU lacks X86_FEATURE_ARAT. On real hardware it shouldn't matter as ARAT and DEADLINE come together. Fixes: c8c4076723da ("x86/timer: Skip PIT initialization on modern chipsets") Signed-off-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2019-09-07Revert "x86/apic: Include the LDR when clearing out APIC registers"Linus Torvalds
This reverts commit 558682b5291937a70748d36fd9ba757fb25b99ae. Chris Wilson reports that it breaks his CPU hotplug test scripts. In particular, it breaks offlining and then re-onlining the boot CPU, which we treat specially (and the BIOS does too). The symptoms are that we can offline the CPU, but it then does not come back online again: smpboot: CPU 0 is now offline smpboot: Booting Node 0 Processor 0 APIC 0x0 smpboot: do_boot_cpu failed(-1) to wakeup CPU#0 Thomas says he knows why it's broken (my personal suspicion: our magic handling of the "cpu0_logical_apicid" thing), but for 5.3 the right fix is to just revert it, since we've never touched the LDR bits before, and it's not worth the risk to do anything else at this stage. [ Hotpluging of the boot CPU is special anyway, and should be off by default. See the "BOOTPARAM_HOTPLUG_CPU0" config option and the cpu0_hotplug kernel parameter. In general you should not do it, and it has various known limitations (hibernate and suspend require the boot CPU, for example). But it should work, even if the boot CPU is special and needs careful treatment - Linus ] Link: https://lore.kernel.org/lkml/156785100521.13300.14461504732265570003@skylake-alporthouse-com/ Reported-by: Chris Wilson <chris@chris-wilson.co.uk> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Bandan Das <bsd@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-09-06x86/asm: Make some functions local labelsJiri Slaby
Boris suggests to make a local label (prepend ".L") to these functions to eliminate them from the symbol table. These are functions with very local names and really should not be visible anywhere. Note that objtool won't see these functions anymore (to generate ORC debug info). But all the functions are not annotated with ENDPROC, so they won't have objtool's attention anyway. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Cao jin <caoj.fnst@cn.fujitsu.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steve Winslow <swinslow@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wei Huang <wei@redhat.com> Cc: x86-ml <x86@kernel.org> Cc: Xiaoyao Li <xiaoyao.li@linux.intel.com> Link: https://lkml.kernel.org/r/20190906075550.23435-2-jslaby@suse.cz
2019-09-06x86/asm/suspend: Get rid of bogus_64_magicJiri Slaby
bogus_64_magic is only a dead-end loop. There is no need for an out-of-order function (and unannotated local label), so just handle it in-place and also store 0xbad-m-a-g-i-c to %rcx beforehand, in case someone is inspecting registers. Here a qemu+gdb example: Remote debugging using localhost:1235 wakeup_long64 () at arch/x86/kernel/acpi/wakeup_64.S:26 26 jmp 1b (gdb) info registers rax 0x123456789abcdef0 1311768467463790320 rbx 0x0 0 rcx 0xbad6d61676963 3286910041024867 ^^^^^^^^^^^^^^^ [ bp: Add the gdb example. ] Signed-off-by: Jiri Slaby <jslaby@suse.cz> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: linux-pm@vger.kernel.org Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190906075550.23435-1-jslaby@suse.cz
2019-09-06x86/purgatory: Change compiler flags from -mcmodel=kernel to -mcmodel=large ↵Steve Wahl
to fix kexec relocation errors The last change to this Makefile caused relocation errors when loading a kdump kernel. Restore -mcmodel=large (not -mcmodel=kernel), -ffreestanding, and -fno-zero-initialized-bsss, without reverting to the former practice of resetting KBUILD_CFLAGS. Purgatory.ro is a standalone binary that is not linked against the rest of the kernel. Its image is copied into an array that is linked to the kernel, and from there kexec relocates it wherever it desires. With the previous change to compiler flags, the error "kexec: Overflow in relocation type 11 value 0x11fffd000" was encountered when trying to load the crash kernel. This is from kexec code trying to relocate the purgatory.ro object. From the error message, relocation type 11 is R_X86_64_32S. The x86_64 ABI says: "The R_X86_64_32 and R_X86_64_32S relocations truncate the computed value to 32-bits. The linker must verify that the generated value for the R_X86_64_32 (R_X86_64_32S) relocation zero-extends (sign-extends) to the original 64-bit value." This type of relocation doesn't work when kexec chooses to place the purgatory binary in memory that is not reachable with 32 bit addresses. The compiler flag -mcmodel=kernel allows those type of relocations to be emitted, so revert to using -mcmodel=large as was done before. Also restore the -ffreestanding and -fno-zero-initialized-bss flags because they are appropriate for a stand alone piece of object code which doesn't explicitly zero the bss, and one other report has said undefined symbols are encountered without -ffreestanding. These identical compiler flag changes need to happen for every object that becomes part of the purgatory.ro object, so gather them together first into PURGATORY_CFLAGS_REMOVE and PURGATORY_CFLAGS, and then apply them to each of the objects that have C source. Do not apply any of these flags to kexec-purgatory.o, which is not part of the standalone object but part of the kernel proper. Tested-by: Vaibhav Rustagi <vaibhavrustagi@google.com> Tested-by: Andreas Smas <andreas@lonelycoder.com> Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: None Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: clang-built-linux@googlegroups.com Cc: dimitri.sivanich@hpe.com Cc: mike.travis@hpe.com Cc: russ.anderson@hpe.com Fixes: b059f801a937 ("x86/purgatory: Use CFLAGS_REMOVE rather than reset KBUILD_CFLAGS") Link: https://lkml.kernel.org/r/20190905202346.GA26595@swahl-linux Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06x86/platform/uv: Fix kmalloc() NULL check routineAustin Kim
The result of kmalloc() should have been checked ahead of below statement: pqp = (struct bau_pq_entry *)vp; Move BUG_ON(!vp) before above statement. Signed-off-by: Austin Kim <austindh.kim@gmail.com> Cc: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Hedi Berriche <hedi.berriche@hpe.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Travis <mike.travis@hpe.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Steve Wahl <steve.wahl@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: allison@lohutok.net Cc: andy@infradead.org Cc: armijn@tjaldur.nl Cc: bp@alien8.de Cc: dvhart@infradead.org Cc: gregkh@linuxfoundation.org Cc: hpa@zytor.com Cc: kjlu@umn.edu Cc: platform-driver-x86@vger.kernel.org Link: https://lkml.kernel.org/r/20190905232951.GA28779@LGEARND20B15 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06Merge tag 'v5.3-rc7' into x86/platform, to refresh the branchIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06x86/cpu: Update init data for new Airmont CPU modelRahul Tanwar
Update properties for newly added Airmont CPU variant. Signed-off-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Gayatri Kammela <gayatri.kammela@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190905193020.14707-5-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06x86/cpu: Add new Airmont variant to Intel familyRahul Tanwar
Add new Airmont variant CPU model to Intel family. Signed-off-by: Rahul Tanwar <rahul.tanwar@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Gayatri Kammela <gayatri.kammela@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190905193020.14707-4-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06x86/cpu: Add Elkhart Lake to Intel familyGayatri Kammela
Add the model number/CPUID of atom based Elkhart Lake to the Intel family. Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rahul Tanwar <rahul.tanwar@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190905193020.14707-3-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06x86/cpu: Add Tiger Lake to Intel familyGayatri Kammela
Add the model numbers/CPUIDs of Tiger Lake mobile and desktop to the Intel family. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Gayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rahul Tanwar <rahul.tanwar@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190905193020.14707-2-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-06Merge branch 'x86/cleanups' into x86/cpu, to pick up dependent changesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-05Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - EFI boot fix for signed kernels - an AC flags fix related to UBSAN - Hyper-V infinite loop fix" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hyper-v: Fix overflow bug in fill_gva_list() x86/uaccess: Don't leak the AC flags into __get_user() argument evaluation x86/boot: Preserve boot_params.secure_boot from sanitizing
2019-09-05x86/mm: Fix cpumask_of_node() error conditionPeter Zijlstra
When CONFIG_DEBUG_PER_CPU_MAPS=y we validate that the @node argument of cpumask_of_node() is a valid node_id. It however forgets to check for negative numbers. Fix this by explicitly casting to unsigned int. (unsigned)node >= nr_node_ids verifies: 0 <= node < nr_node_ids Also ammend the error message to match the condition. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Yunsheng Lin <linyunsheng@huawei.com> Link: https://lkml.kernel.org/r/20190903075352.GY2369@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-09-05crypto: sha256 - Merge crypto/sha256.h into crypto/sha.hHans de Goede
The generic sha256 implementation from lib/crypto/sha256.c uses data structs defined in crypto/sha.h, so lets move the function prototypes there too. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-09-05crypto: x86 - Rename functions to avoid conflict with crypto/sha256.hHans de Goede
Rename static / file-local functions so that they do not conflict with the functions declared in crypto/sha256.h. This is a preparation patch for folding crypto/sha256.h into crypto/sha.h. Signed-off-by: Hans de Goede <hdegoede@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>