summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2017-10-12KVM: SVM: limit kvm_handle_page_fault to #PF handlingPaolo Bonzini
It has always annoyed me a bit how SVM_EXIT_NPF is handled by pf_interception. This is also the only reason behind the under-documented need_unprotect argument to kvm_handle_page_fault. Let NPF go straight to kvm_mmu_page_fault, just like VMX does in handle_ept_violation and handle_ept_misconfig. Reviewed-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: SVM: unconditionally wake up VCPU on IOMMU interruptPaolo Bonzini
Checking the mode is unnecessary, and is done without a memory barrier separating the LAPIC write from the vcpu->mode read; in addition, kvm_vcpu_wake_up is already doing a check for waiters on the wait queue that has the same effect. In practice it's safe because spin_lock has full-barrier semantics on x86, but don't be too clever. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12arch/x86: remove redundant null checks before kmem_cache_destroyTim Hansen
Remove redundant null checks before calling kmem_cache_destroy. Found with make coccicheck M=arch/x86/kvm on linux-next tag next-20170929. Signed-off-by: Tim Hansen <devtimhansen@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: VMX: Don't expose unrestricted_guest is enabled if ept is disabledWanpeng Li
SDM mentioned: "If either the “unrestricted guest†VM-execution control or the “mode-based execute control for EPT†VM- execution control is 1, the “enable EPT†VM-execution control must also be 1." However, we can still observe unrestricted_guest is Y after inserting the kvm-intel.ko w/ ept=N. It depends on later starts a guest in order that the function vmx_compute_secondary_exec_control() can be executed, then both the module parameter and exec control fields will be amended. This patch fixes it by amending module parameter immediately during vmcs data setup. Reviewed-by: Jim Mattson <jmattson@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: X86: Processor States following Reset or INITWanpeng Li
- XCR0 is reset to 1 by RESET but not INIT - XSS is zeroed by both RESET and INIT - BNDCFGU, BND0-BND3, BNDCFGS, BNDSTATUS are zeroed by both RESET and INIT This patch does this according to SDM. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: x86: thoroughly disarm LAPIC timer around TSC deadline switchRadim Krčmář
Our routines look at tscdeadline and period when deciding state of a timer. The timer is disarmed when switching between TSC deadline and other modes, so we should set everything to disarmed state. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: x86: really disarm lapic timer when clearing TMICTRadim Krčmář
preemption timer only looks at tscdeadline and could inject already disarmed timer. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: x86: handle 0 write to TSC_DEADLINE MSRRadim Krčmář
0 should disable the timer, but start_hv_timer will recognize it as an expired timer instead. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12kvm, mm: account kvm related kmem slabs to kmemcgShakeel Butt
The kvm slabs can consume a significant amount of system memory and indeed in our production environment we have observed that a lot of machines are spending significant amount of memory that can not be left as system memory overhead. Also the allocations from these slabs can be triggered directly by user space applications which has access to kvm and thus a buggy application can leak such memory. So, these caches should be accounted to kmemcg. Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12KVM: VMX: rename RDSEED and RDRAND vmx ctrls to reflect exitingDavid Hildenbrand
Let's just name these according to the SDM. This should make it clearer that the are used to enable exiting and not the feature itself. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: x86: allow setting identity map addr with no vcpus onlyDavid Hildenbrand
Changing it afterwards doesn't make too much sense and will only result in inconsistencies. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: cleanup init_rmode_identity_map()David Hildenbrand
No need for another enable_ept check. kvm->arch.ept_identity_map_addr only has to be inititalized once. Having alloc_identity_pagetable() is overkill and dropping BUG_ONs is always nice. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: nVMX: no need to set ept/vpid caps to 0David Hildenbrand
They are inititally 0, so no need to reset them to 0. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: nVMX: no need to set vcpu->cpu when switching vmcsDavid Hildenbrand
vcpu->cpu is not cleared when doing a vmx_vcpu_put/load, so this can be dropped. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: drop unnecessary function declarationsDavid Hildenbrand
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: require INVEPT GLOBAL for EPTDavid Hildenbrand
Without this, we won't be able to do any flushes, so let's just require it. Should be absent in very strange configurations. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: call ept_sync_global() with enable_ept onlyDavid Hildenbrand
ept_* function should only be called with enable_ept being set. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: drop enable_ept check from ept_sync_context()David Hildenbrand
This function is only called with enable_ept. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: x86: no need to inititalize vcpu members to 0David Hildenbrand
vmx and svm use zalloc, so this is not necessary. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: vmx_vcpu_setup() cannot failDavid Hildenbrand
Make it a void and drop error handling code. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: x86: drop BUG_ON(vcpu->kvm)David Hildenbrand
And also get rid of that superfluous local variable "kvm". Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: x86: mmu: free_page can handle NULLDavid Hildenbrand
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: x86: mmu: returning void in a void function is strangeDavid Hildenbrand
Let's just drop the return. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: LAPIC: Apply change to TDCR right away to the timerWanpeng Li
The description in the Intel SDM of how the divide configuration register is used: "The APIC timer frequency will be the processor's bus clock or core crystal clock frequency divided by the value specified in the divide configuration register." Observation of baremetal shown that when the TDCR is change, the TMCCT does not change or make a big jump in value, but the rate at which it count down change. The patch update the emulation to APIC timer to so that a change to the divide configuration would be reflected in the value of the counter and when the next interrupt is triggered. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Fixed some whitespace and added a check for negative delta and running timer. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: LAPIC: Keep timer running when switching between one-shot and periodic modeWanpeng Li
If we take TSC-deadline mode timer out of the picture, the Intel SDM does not say that the timer is disable when the timer mode is change, either from one-shot to periodic or vice versa. After this patch, the timer is no longer disarmed on change of mode, so the counter (TMCCT) keeps counting down. So what does a write to LVTT changes ? On baremetal, the change of mode is probably taken into account only when the counter reach 0. When this happen, LVTT is use to figure out if the counter should restard counting down from TMICT (so periodic mode) or stop counting (if one-shot mode). This patch is based on observation of the behavior of the APIC timer on baremetal as well as check that they does not go against the description written in the Intel SDM. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Fixed rate limiting of periodic timer.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: LAPIC: Introduce limit_periodic_timer_frequencyWanpeng Li
Extract the logic of limit lapic periodic timer frequency to a new function, this function will be used by later patches. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: LAPIC: Fix lapic timer mode transitionWanpeng Li
SDM 10.5.4.1 TSC-Deadline Mode mentioned that "Transitioning between TSC-Deadline mode and other timer modes also disarms the timer". So the APIC Timer Initial Count Register for one-shot/periodic mode should be reset. This patch do it. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Removed unnecessary definition of APIC_LVT_TIMER_MASK.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: VMX: Don't expose PLE enable if there is no hardware supportWanpeng Li
KVM doesn't expose the PLE capability to the L1 hypervisor, however, ple_window still shows the default value on L1 hypervisor. This patch fixes it by clearing all the PLE related module parameter if there is no PLE capability. Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-10-12KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exitHaozhong Zhang
When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the guest CR4. Before this CR4 loading, the guest CR4 refers to L2 CR4. Because these two CR4's are in different levels of guest, we should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which is used to handle guest writes to its CR4, checks the guest change to CR4 and may fail if the change is invalid. The failure may cause trouble. Consider we start a L1 guest with non-zero L1 PCID in use, (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0) and a L2 guest with L2 PCID disabled, (i.e. L2 CR4.PCIDE == 0) and following events may happen: 1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4 into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e. vcpu->arch.cr4) is left to the value of L2 CR4. 2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit, kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID, because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1 CR3.PCID != 0, L0 KVM will inject GP to L1 guest. Fixes: 4704d0befb072 ("KVM: nVMX: Exiting from L2 to L1") Cc: qemu-stable@nongnu.org Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-12x86/apic/vector: Ignore set_affinity call for inactive interruptsThomas Gleixner
The core interrupt code can call the affinity setter for inactive interrupts under certain circumstances. For inactive intererupts which use managed or reservation mode this is a pointless exercise as the activation will assign a vector which fits the destination mask. Check for this and return w/o going through the vector assignment. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-10-12Merge branch 'irq/urgent' into x86/apicThomas Gleixner
Pick up core changes which affect the vector rework.
2017-10-12x86/unwinder: Make CONFIG_UNWINDER_ORC=y the default in the 64-bit defconfigIngo Molnar
Increase testing coverage by turning on the primary x86 unwinder for the 64-bit defconfig. Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-11x86/mm: Disable various instrumentations of mm/mem_encrypt.c and mm/tlb.cTom Lendacky
Some routines in mem_encrypt.c are called very early in the boot process, e.g. sme_enable(). When CONFIG_KCOV=y is defined the resulting code added to sme_enable() (and others) for KCOV instrumentation results in a kernel crash. Disable the KCOV instrumentation for mem_encrypt.c by adding KCOV_INSTRUMENT_mem_encrypt.o := n to arch/x86/mm/Makefile. In order to avoid other possible early boot issues, model mem_encrypt.c after head64.c in regards to tools. In addition to disabling KCOV as stated above and a previous patch that disables branch profiling, also remove the "-pg" CFLAG if CONFIG_FUNCTION_TRACER is enabled and set KASAN_SANITIZE to "n", each of which are done on a file basis. Reported-by: kernel test robot <lkp@01.org> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171010194504.18887.38053.stgit@tlendack-t1.amdoffice.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-11x86/boot: Remove unnecessary #include <generated/utsrelease.h>Masahiro Yamada
The <generated/utsrelease.h> defines UTS_RELEASE, but I do not see any reference to it in arch/x86/boot/header.S. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1505921232-8960-1-git-send-email-yamada.masahiro@socionext.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10KVM: MMU: always terminate page walks at level 1Ladi Prosek
is_last_gpte() is not equivalent to the pseudo-code given in commit 6bb69c9b69c31 ("KVM: MMU: simplify last_pte_bitmap") because an incorrect value of last_nonleaf_level may override the result even if level == 1. It is critical for is_last_gpte() to return true on level == 1 to terminate page walks. Otherwise memory corruption may occur as level is used as an index to various data structures throughout the page walking code. Even though the actual bug would be wherever the MMU is initialized (as in the previous patch), be defensive and ensure here that is_last_gpte() returns the correct value. This patch is also enough to fix CVE-2017-12188. Fixes: 6bb69c9b69c315200ddc2bc79aee14c0184cf5b2 Cc: stable@vger.kernel.org Cc: Andy Honig <ahonig@google.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> [Panic if walk_addr_generic gets an incorrect level; this is a serious bug and it's not worth a WARN_ON where the recovery path might hide further exploitable issues; suggested by Andrew Honig. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-10KVM: nVMX: update last_nonleaf_level when initializing nested EPTLadi Prosek
The function updates context->root_level but didn't call update_last_nonleaf_level so the previous and potentially wrong value was used for page walks. For example, a zero value of last_nonleaf_level would allow a potential out-of-bounds access in arch/x86/mmu/paging_tmpl.h's walk_addr_generic function (CVE-2017-12188). Fixes: 155a97a3d7c78b46cef6f1a973c831bc5a4f82bb Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-10-10xen/vcpu: Use a unified name about cpu hotplug state for pv and pvhvmZhenzhong Duan
As xen_cpuhp_setup is called by PV and PVHVM, the name of "x86/xen/hvm_guest" is confusing. Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
2017-10-10x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushingMarcelo Henrique Cerri
Do not consider the fixed size of hv_vp_set when passing the variable header size to hv_do_rep_hypercall(). The Hyper-V hypervisor specification states that for a hypercall with a variable header only the size of the variable portion should be supplied via the input control. For HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX calls that means the fixed portion of hv_vp_set should not be considered. That fixes random failures of some applications that are unexpectedly killed with SIGBUS or SIGSEGV. Signed-off-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: devel@linuxdriverproject.org Fixes: 628f54cc6451 ("x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls") Link: http://lkml.kernel.org/r/1507210469-29065-1-git-send-email-marcelo.cerri@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/hyperv: Don't use percpu areas for pcpu_flush/pcpu_flush_ex structuresVitaly Kuznetsov
hv_do_hypercall() does virt_to_phys() translation and with some configs (CONFIG_SLAB) this doesn't work for percpu areas, we pass wrong memory to hypervisor and get #GP. We could use working slow_virt_to_phys() instead but doing so kills the performance. Move pcpu_flush/pcpu_flush_ex structures out of percpu areas and allocate memory on first call. The additional level of indirection gives us a small performance penalty, in future we may consider introducing hypercall functions which avoid virt_to_phys() conversion and cache physical addresses of pcpu_flush/pcpu_flush_ex structures somewhere. Reported-by: Simon Xiao <sixiao@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171005113924.28021-1-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/hyperv: Clear vCPU banks between calls to avoid flushing unneeded vCPUsVitaly Kuznetsov
hv_flush_pcpu_ex structures are not cleared between calls for performance reasons (they're variable size up to PAGE_SIZE each) but we must clear hv_vp_set.bank_contents part of it to avoid flushing unneeded vCPUs. The rest of the structure is formed correctly. To do the clearing in an efficient way stash the maximum possible vCPU number (this may differ from Linux CPU id). Reported-by: Jork Loeser <Jork.Loeser@microsoft.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171006154854.18092-1-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10perf/x86/intel/uncore: Fix memory leaks on allocation failuresColin Ian King
Currently if an allocation fails then the error return paths don't free up any currently allocated pmus[].boxes and pmus causing a memory leak. Add an error clean up exit path that frees these objects. Detected by CoverityScan, CID#711632 ("Resource Leak") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Fixes: 087bfbb03269 ("perf/x86: Add generic Intel uncore PMU support") Link: http://lkml.kernel.org/r/20171009172655.6132-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Disable unwinder warnings on 32-bitJosh Poimboeuf
x86-32 doesn't have stack validation, so in most cases it doesn't make sense to warn about bad frame pointers. Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a69658760800bf281e6353248c23e0fa0acf5230.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Align stack pointer in unwinder dumpJosh Poimboeuf
When printing the unwinder dump, the stack pointer could be unaligned, for one of two reasons: - stack corruption; or - GCC created an unaligned stack. There's no way for the unwinder to tell the difference between the two, so we have to assume one or the other. GCC unaligned stacks are very rare, and have only been spotted before GCC 5. Presumably, if we're doing an unwinder stack dump, stack corruption is more likely than a GCC unaligned stack. So always align the stack before starting the dump. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/2f540c515946ab09ed267e1a1d6421202a0cce08.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Use MSB for frame pointer encoding on 32-bitJosh Poimboeuf
On x86-32, Tetsuo Handa and Fengguang Wu reported unwinder warnings like: WARNING: kernel stack regs at f60bb9c8 in swapper:1 has bad 'bp' value 0ba00000 And also there were some stack dumps with a bunch of unreliable '?' symbols after an apic_timer_interrupt symbol, meaning the unwinder got confused when it tried to read the regs. The cause of those issues is that, with GCC 4.8 (and possibly older), there are cases where GCC misaligns the stack pointer in a leaf function for no apparent reason: c124a388 <acpi_rs_move_data>: c124a388: 55 push %ebp c124a389: 89 e5 mov %esp,%ebp c124a38b: 57 push %edi c124a38c: 56 push %esi c124a38d: 89 d6 mov %edx,%esi c124a38f: 53 push %ebx c124a390: 31 db xor %ebx,%ebx c124a392: 83 ec 03 sub $0x3,%esp ... c124a3e3: 83 c4 03 add $0x3,%esp c124a3e6: 5b pop %ebx c124a3e7: 5e pop %esi c124a3e8: 5f pop %edi c124a3e9: 5d pop %ebp c124a3ea: c3 ret If an interrupt occurs in such a function, the regs on the stack will be unaligned, which breaks the frame pointer encoding assumption. So on 32-bit, use the MSB instead of the LSB to encode the regs. This isn't an issue on 64-bit, because interrupts align the stack before writing to it. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/279a26996a482ca716605c7dbc7f2db9d8d91e81.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10x86/unwind: Fix dereference of untrusted pointerJosh Poimboeuf
Tetsuo Handa and Fengguang Wu reported a panic in the unwinder: BUG: unable to handle kernel NULL pointer dereference at 000001f2 IP: update_stack_state+0xd4/0x340 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 18728 Comm: 01-cpu-hotplug Not tainted 4.13.0-rc4-00170-gb09be67 #592 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 task: bb0b53c0 task.stack: bb3ac000 EIP: update_stack_state+0xd4/0x340 EFLAGS: 00010002 CPU: 0 EAX: 0000a570 EBX: bb3adccb ECX: 0000f401 EDX: 0000a570 ESI: 00000001 EDI: 000001ba EBP: bb3adc6b ESP: bb3adc3f DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: 000001f2 CR3: 0b3a7000 CR4: 00140690 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: fffe0ff0 DR7: 00000400 Call Trace: ? unwind_next_frame+0xea/0x400 ? __unwind_start+0xf5/0x180 ? __save_stack_trace+0x81/0x160 ? save_stack_trace+0x20/0x30 ? __lock_acquire+0xfa5/0x12f0 ? lock_acquire+0x1c2/0x230 ? tick_periodic+0x3a/0xf0 ? _raw_spin_lock+0x42/0x50 ? tick_periodic+0x3a/0xf0 ? tick_periodic+0x3a/0xf0 ? debug_smp_processor_id+0x12/0x20 ? tick_handle_periodic+0x23/0xc0 ? local_apic_timer_interrupt+0x63/0x70 ? smp_trace_apic_timer_interrupt+0x235/0x6a0 ? trace_apic_timer_interrupt+0x37/0x3c ? strrchr+0x23/0x50 Code: 0f 95 c1 89 c7 89 45 e4 0f b6 c1 89 c6 89 45 dc 8b 04 85 98 cb 74 bc 88 4d e3 89 45 f0 83 c0 01 84 c9 89 04 b5 98 cb 74 bc 74 3b <8b> 47 38 8b 57 34 c6 43 1d 01 25 00 00 02 00 83 e2 03 09 d0 83 EIP: update_stack_state+0xd4/0x340 SS:ESP: 0068:bb3adc3f CR2: 00000000000001f2 ---[ end trace 0d147fd4aba8ff50 ]--- Kernel panic - not syncing: Fatal exception in interrupt On x86-32, after decoding a frame pointer to get a regs address, regs_size() dereferences the regs pointer when it checks regs->cs to see if the regs are user mode. This is dangerous because it's possible that what looks like a decoded frame pointer is actually a corrupt value, and we don't want the unwinder to make things worse. Instead of calling regs_size() on an unsafe pointer, just assume they're kernel regs to start with. Later, once it's safe to access the regs, we can do the user mode check and corresponding safety check for the remaining two regs. Reported-and-tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 5ed8d8bb38c5 ("x86/unwind: Move common code into update_stack_state()") Link: http://lkml.kernel.org/r/7f95b9a6993dec7674b3f3ab3dcd3294f7b9644d.1507597785.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch: Remove dummy arch_{read,spin,write}_lock_flags() implementationsWill Deacon
The arch_{read,spin,write}_lock_flags() macros are simply mapped to the non-flags versions by the majority of architectures, so do this in core code and remove the dummy implementations. Also remove the implementation in spinlock_up.h, since all callers of do_raw_spin_lock_flags() call local_irq_save(flags) anyway. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-4-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch: Remove dummy arch_{read,spin,write}_relax() implementationsWill Deacon
arch_{read,spin,write}_relax() are defined as cpu_relax() by the core code, so architectures that can't do better (i.e. most of them) don't need to bother with the dummy definitions. Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: paulmck@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1507055129-12300-3-git-send-email-will.deacon@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/arch, x86: Add __down_read_killable()Kirill Tkhai
Similar to __down_write_killable(), add read killable primitive: extract current __down_read() code to macros and teach it to get different functions as slow_path argument: store ax register to ret, and add sp register and preserve its value. Add call_rwsem_down_read_failed_killable() assembly entry similar to call_rwsem_down_read_failed(): push dx register to stack in additional to common registers, as it's not declarated as modifiable in ____down_read(). Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arnd@arndb.de Cc: avagin@virtuozzo.com Cc: davem@davemloft.net Cc: fenghua.yu@intel.com Cc: gorcunov@virtuozzo.com Cc: heiko.carstens@de.ibm.com Cc: hpa@zytor.com Cc: ink@jurassic.park.msu.ru Cc: mattst88@gmail.com Cc: rientjes@google.com Cc: rth@twiddle.net Cc: schwidefsky@de.ibm.com Cc: tony.luck@intel.com Cc: viro@zeniv.linux.org.uk Link: http://lkml.kernel.org/r/150670118802.23930.1316107715255410256.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/spinlocks, paravirt, xen: Correct the xen_nopvspin caseJuergen Gross
With the boot parameter "xen_nopvspin" specified a Xen guest should not make use of paravirt spinlocks, but behave as if running on bare metal. This is not true, however, as the qspinlock code will fall back to a test-and-set scheme when it is detecting a hypervisor. In order to avoid this disable the virt_spin_lock_key. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: boris.ostrovsky@oracle.com Cc: chrisw@sous-sol.org Cc: hpa@zytor.com Cc: jeremy@goop.org Cc: rusty@rustcorp.com.au Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170906173625.18158-3-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-10locking/paravirt: Use new static key for controlling call of virt_spin_lock()Juergen Gross
There are cases where a guest tries to switch spinlocks to bare metal behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this has the downside of falling back to unfair test and set scheme for qspinlocks due to virt_spin_lock() detecting the virtualized environment. Add a static key controlling whether virt_spin_lock() should be called or not. When running on bare metal set the new key to false. Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: boris.ostrovsky@oracle.com Cc: chrisw@sous-sol.org Cc: hpa@zytor.com Cc: jeremy@goop.org Cc: rusty@rustcorp.com.au Cc: virtualization@lists.linux-foundation.org Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170906173625.18158-2-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>