summaryrefslogtreecommitdiff
path: root/arch/x86
AgeCommit message (Collapse)Author
2015-09-17arch/x86/intel-mid: Use kmemdup rather than duplicating its implementationAndrzej Hajda
The patch was generated using fixed coccinelle semantic patch scripts/coccinelle/api/memdup.cocci [1]. [1]: http://permalink.gmane.org/gmane.linux.kernel/2014320 Signed-off-by: Andrzej Hajda <a.hajda@samsung.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Link: http://lkml.kernel.org/r/1438934377-4922-9-git-send-email-a.hajda@samsung.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-17Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - misc fixes all around the map - block non-root vm86(old) if mmap_min_addr != 0 - two small debuggability improvements - removal of obsolete paravirt op * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/platform: Fix Geode LX timekeeping in the generic x86 build x86/apic: Serialize LVTT and TSC_DEADLINE writes x86/ioapic: Force affinity setting in setup_ioapic_dest() x86/paravirt: Remove the unused pv_time_ops::get_tsc_khz method x86/ldt: Fix small LDT allocation for Xen x86/vm86: Fix the misleading CONFIG_VM86 Kconfig help text x86/cpu: Print family/model/stepping in hex x86/vm86: Block non-root vm86(old) if mmap_min_addr != 0 x86/alternatives: Make optimize_nops() interrupt safe and synced x86/mm/srat: Print non-volatile flag in SRAT x86/cpufeatures: Enable cpuid for Intel SHA extensions
2015-09-17Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo MOlnar: "Mostly tooling fixes, but also two x86 PMU driver fixes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf tests: Fix software clock events test setting maps perf tests: Fix task exit test setting maps perf evlist: Fix create_syswide_maps() not propagating maps perf evlist: Fix add() not propagating maps perf evlist: Factor out a function to propagate maps for a single evsel perf evlist: Make create_maps() use set_maps() perf evlist: Make set_maps() more resilient perf evsel: Add own_cpus member perf evlist: Fix missing thread_map__put in propagate_maps() perf evlist: Fix splice_list_tail() not setting evlist perf evlist: Add has_user_cpus member perf evlist: Remove redundant validation from propagate_maps() perf evlist: Simplify set_maps() logic perf evlist: Simplify propagate_maps() logic perf top: Fix segfault pressing -> with no hist entries perf header: Fixup reading of HEADER_NRCPUS feature perf/x86/intel: Fix constraint access perf/x86/intel/bts: Set event->hw.itrace_started in pmu::start to match the new logic perf tools: Fix use of wrong event when processing exit events perf tools: Fix parse_events_add_pmu caller
2015-09-17Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fixes from Ingo Molnar: "Spinlock performance regression fix, plus documentation fixes" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/static_keys: Fix up the static keys documentation locking/qspinlock/x86: Only emit the test-and-set fallback when building guest support locking/qspinlock/x86: Fix performance regression under unaccelerated VMs locking/static_keys: Fix a silly typo
2015-09-17x86/pci/dma: Fix gfp flags for coherent DMA memory allocationJunichi Nomura
Commit 6894258eda2f reversed the order of gfp_flags adjustment in dma_alloc_attrs() for x86 [arch/x86/kernel/pci-dma.c] As a result, relevant flags set by dma_alloc_coherent_gfp_flags() are just discarded and cause coherent DMA memory allocation failure on some devices. Fixes: 6894258eda2f ("dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}") Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Tested-by: Tony Luck <tony.luck@intel.com> Acked-by: Christoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/20150914073834.GA13077@xzibit.linux.bs1.fc.nec.co.jp Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16x86/platform: Fix Geode LX timekeeping in the generic x86 buildDavid Woodhouse
In 2007, commit 07190a08eef36 ("Mark TSC on GeodeLX reliable") bypassed verification of the TSC on Geode LX. However, this code (now in the check_system_tsc_reliable() function in arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was set. OpenWRT has recently started building its generic Geode target for Geode GX, not LX, to include support for additional platforms. This broke the timekeeping on LX-based devices, because the TSC wasn't marked as reliable: https://dev.openwrt.org/ticket/20531 By adding a runtime check on is_geode_lx(), we can also include the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus fixing the problem. Signed-off-by: David Woodhouse <David.Woodhouse@intel.com> Cc: Andres Salomon <dilinger@queued.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Marcelo Tosatti <marcelo@kvack.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-16genirq: Remove irq argument from irq flow handlersThomas Gleixner
Most interrupt flow handlers do not use the irq argument. Those few which use it can retrieve the irq number from the irq descriptor. Remove the argument. Search and replace was done with coccinelle and some extra helper scripts around it. Thanks to Julia for her help! Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Julia Lawall <Julia.Lawall@lip6.fr> Cc: Jiang Liu <jiang.liu@linux.intel.com>
2015-09-16genirq: Move field 'affinity' from irq_data into irq_common_dataJiang Liu
Irq affinity mask is per-irq instead of per irqchip, so move it into struct irq_common_data. Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Kevin Cernekee <cernekee@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Link: http://lkml.kernel.org/r/1433303281-27688-1-git-send-email-jiang.liu@linux.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16KVM: vmx: fix VPID is 0000H in non-root operationWanpeng Li
Reference SDM 28.1: The current VPID is 0000H in the following situations: - Outside VMX operation. (This includes operation in system-management mode under the default treatment of SMIs and SMM with VMX operation; see Section 34.14.) - In VMX root operation. - In VMX non-root operation when the “enable VPID” VM-execution control is 0. The VPID should never be 0000H in non-root operation when "enable VPID" VM-execution control is 1. However, commit 34a1cd60 ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()") remove the codes which reserve 0000H for VMX root operation. This patch fix it by again reserving 0000H for VMX root operation. Cc: stable@vger.kernel.org # 3.19+ Fixes: 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4 Reported-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-16KVM: add halt_attempted_poll to VCPU statsPaolo Bonzini
This new statistic can help diagnosing VCPUs that, for any reason, trigger bad behavior of halt_poll_ns autotuning. For example, say halt_poll_ns = 480000, and wakeups are spaced exactly like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes 10+20+40+80+160+320+480 = 1110 microseconds out of every 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then is consuming about 30% more CPU than it would use without polling. This would show as an abnormally high number of attempted polling compared to the successful polls. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com< Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-09-15PCI: Revert "PCI: Call pci_read_bridge_bases() from core instead of arch code"Bjorn Helgaas
Revert dff22d2054b5 ("PCI: Call pci_read_bridge_bases() from core instead of arch code"). Reading PCI bridge windows is not arch-specific in itself, but there is PCI core code that doesn't work correctly if we read them too early. For example, Hannes found this case on an ARM Freescale i.mx6 board: pci_bus 0000:00: root bus resource [mem 0x01000000-0x01efffff] pci 0000:00:00.0: PCI bridge to [bus 01-ff] pci 0000:00:00.0: BAR 8: no space for [mem size 0x01000000] (mem window) pci 0000:01:00.0: BAR 2: failed to assign [mem size 0x00200000] pci 0000:01:00.0: BAR 1: failed to assign [mem size 0x00004000] pci 0000:01:00.0: BAR 0: failed to assign [mem size 0x00000100] The 00:00.0 mem window needs to be at least 3MB: the 01:00.0 device needs 0x204100 of space, and mem windows are megabyte-aligned. Bus sizing can increase a bridge window size, but never *decrease* it (see d65245c3297a ("PCI: don't shrink bridge resources")). Prior to dff22d2054b5, ARM didn't read bridge windows at all, so the "original size" was zero, and we assigned a 3MB window. After dff22d2054b5, we read the bridge windows before sizing the bus. The firmware programmed a 16MB window (size 0x01000000) in 00:00.0, and since we never decrease the size, we kept 16MB even though we only needed 3MB. But 16MB doesn't fit in the host bridge aperture, so we failed to assign space for the window and the downstream devices. I think this is a defect in the PCI core: we shouldn't rely on the firmware to assign sensible windows. Ray reported a similar problem, also on ARM, with Broadcom iProc. Issues like this are too hard to fix right now, so revert dff22d2054b5. Reported-by: Hannes <oe5hpm@gmail.com> Reported-by: Ray Jui <rjui@broadcom.com> Link: http://lkml.kernel.org/r/CAAa04yFQEUJm7Jj1qMT57-LG7ZGtnhNDBe=PpSRa70Mj+XhW-A@mail.gmail.com Link: http://lkml.kernel.org/r/55F75BB8.4070405@broadcom.com Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Yinghai Lu <yinghai@kernel.org> Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
2015-09-15x86/fpu/math-emu: Remove !NO_UNDOC_CODEDenys Vlasenko
We always want to support all FPU opcodes, including undocumented ones. That define was fully justified ~20 years ago but not today. Let's not complicate the code with it. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1440699330-1305-1-git-send-email-dvlasenk@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/apic: Serialize LVTT and TSC_DEADLINE writesShaohua Li
The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not guaranteed that the write to LVTT has reached the APIC before the TSC_DEADLINE MSR is written. In such a case the write to the MSR is ignored and as a consequence the local timer interrupt never fires. The SDM decribes this issue for xAPIC and x2APIC modes. The serialization methods recommended by the SDM differ. xAPIC: "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b. 2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter. 3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2. 4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline." x2APIC: "To allow for efficient access to the APIC registers in x2APIC mode, the serializing semantics of WRMSR are relaxed when writing to the APIC registers. Thus, system software should not use 'WRMSR to APIC registers in x2APIC mode' as a serializing instruction. Read and write accesses to the APIC registers will occur in program order. A WRMSR to an APIC register may complete before all preceding stores are globally visible; software can prevent this by inserting a serializing instruction, an SFENCE, or an MFENCE before the WRMSR." The xAPIC method is to just wait for the memory mapped write to hit the LVTT by checking whether the MSR write has reached the hardware. There is no reason why a proper MFENCE after the memory mapped write would not do the same. Andi Kleen confirmed that MFENCE is sufficient for the xAPIC case as well. Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE support. [ tglx: Massaged the changelog ] Signed-off-by: Shaohua Li <shli@fb.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: <Kernel-team@fb.com> Cc: <lenb@kernel.org> Cc: <fenghua.yu@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: stable@vger.kernel.org #v3.7+ Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-14x86/ioapic: Force affinity setting in setup_ioapic_dest()Thomas Gleixner
The recent ioapic cleanups changed the affinity setting in setup_ioapic_dest() from a direct write to the hardware to the delayed affinity setup via irq_set_affinity(). That results in a warning from chained_irq_exit(): WARNING: CPU: 0 PID: 5 at kernel/irq/migration.c:32 irq_move_masked_irq [<ffffffff810a0a88>] irq_move_masked_irq+0xb8/0xc0 [<ffffffff8103c161>] ioapic_ack_level+0x111/0x130 [<ffffffff812bbfe8>] intel_gpio_irq_handler+0x148/0x1c0 The reason is that irq_set_affinity() does not write directly to the hardware. It marks the affinity setting as pending and executes it from the next interrupt. The chained handler infrastructure does not take the irq descriptor lock for performance reasons because such a chained interrupt is not visible to any interfaces. So the delayed affinity setting triggers the warning in irq_move_masked_irq(). Restore the old behaviour by calling the set_affinity function of the ioapic chip in setup_ioapic_dest(). This is safe as none of the interrupts can be on the fly at this point. Fixes: aa5cb97f14a2 'x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces' Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: jarkko.nikula@linux.intel.com
2015-09-14x86/platform/uv: Implement simple dump failover if kdump failsMike Travis
The ability to trigger a kdump using the system NMI command was added by commit 12ba6c990fab ("x86/UV: Add kdump to UV NMI handler") Author: Mike Travis <travis@sgi.com> Date: Mon Sep 23 16:25:03 2013 -0500 This is useful because when kdump is working the information gathered is more informative than the original per CPU stack traces or "dump" option. However a number of things can go wrong with kdump and then the stack traces are more useful than nothing. The two most common reasons for kdump to not be available are: 1) if a problem occurs during boot before the kdump service is started, or 2) the kdump daemon failed to start. In either case the call to crash_kexec() returns unexpectedly. When this happens uv_nmi_kdump() also sets the uv_nmi_kexec_failed flag which causes the slave CPU's to also return to the NMI handler. Upon this unexpected return to the NMI handler, the NMI handler will revert to the "dump" action which uses show_regs() to obtain a process trace dump for all the CPU's. Other minor changes: The "dump" action now generates both the show_regs() stack trace and show instruction pointer information. Whereas the "ips" action only shows instruction pointers for non-idle CPU's. This is more like an abbreviated "ps" display. Change printk(KERN_DEFAULT...) --> pr_info() Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: George Beshers <gbeshers@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Hedi Berriche <hedi@sgi.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/paravirt: Remove the unused pv_time_ops::get_tsc_khz methodJuergen Gross
It's not used anywhere. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: chrisw@sous-sol.org Cc: jeremy@goop.org Cc: virtualization@lists.linux-foundation.org Link: http://lkml.kernel.org/r/1442227343-403-1-git-send-email-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Check CPU-provided sizes against struct declarationsDave Hansen
We now have C structures defined for each of the XSAVE state components that we support. This patch adds checks during our verification pass to ensure that the CPU-provided data enumerated in CPUID leaves matches our C structures. If not, we warn and dump all the XSAVE CPUID leaves. Note: this *actually* found an inconsistency with the MPX 'bndcsr' state. The hardware pads it out differently from our C structures. This patch caught it and warned. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233131.A8DB36DA@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Check to ensure increasing-offset xstate offsetsDave Hansen
The xstate CPUID leaves enumerate where each state component is inside the XSAVE buffer, along with the size of the entire buffer. Our new XSAVE sanity-checking code extrapolates an expected _total_ buffer size by looking at the last component that it encounters. That method requires that the highest-numbered component also be the one with the highest offset. This is a pretty safe assumption, but let's add some code to ensure it stays true. To make this check work correctly, we also need to ensure we only consider the offsets from enabled features because the offset register (ebx) will return 0 on unsupported features. This also means that we will preserve the -1's that we initialized xstate_offsets/sizes[] with. That will help find bugs. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.0843AB15@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Correct and check XSAVE xstate size calculationsDave Hansen
Note: our xsaves support is currently broken and disabled. This patch does not fix it, but it is an incremental improvement. This might be useful to someone backporting the entire set of XSAVES patches at some point, but it should not be backported alone. Ingo said he wanted something like this (bullets 2 and 3): http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com There are currently two xsave buffer formats: standard and compacted. The standard format is waht 'XSAVE' and 'XSAVEOPT' produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet buffer. (The kernel never uses XSAVEC) But, the XSAVES buffer *ALSO* contains "system state components" which are never saved by a plain XSAVE. So, XSAVES has two things that might make its buffer differently-sized from an XSAVE-produced one. The current code assumes that an XSAVES buffer's size is simply the sum of the sizes of the (user) states which are supported. This seems to work in most cases, but it is not consistent with what the SDM says, and it breaks if we 'align' a component in the buffer. The calculation is also unnecessary work since the CPU *tells* us the size of the buffer directly. This patch just reads the size of the buffer right out of the CPUID leaf instead of trying to derive it. But, blindly trusting the CPU like this is dangerous. We add a verification pass in do_extra_xstate_size_checks() to ensure that the size we calculate matches with what we see from the hardware. When it comes down to it, we trust but verify the CPU. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Add C structures for AVX-512 state componentsDave Hansen
AVX-512 has 3 separate state components: 1. opmask registers 2. zmm upper half of registers 0-15 3. new zmm registers (16-31) This patch adds C structures for the three components along with a few comments mostly lifted from the SDM to explain what they do. This will allow us to check our structures against what the hardware tells us about the sizes of the components. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233129.F2433B98@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Rework YMM definitionDave Hansen
We are about to rework all of the "extended state" definitions. This makes the 'ymm' naming consistent with the AVX-512 types we will introduce later. We also add a convenience type: "reg_128_bit" so that we do not have to spell out our arithmetic. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233129.B4EB045F@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu/mpx: Rework MPX 'xstate' typesDave Hansen
MPX includes two separate "extended state components". There is no real need to have an 'mpx_struct' because we never really manage the states together. We also separate out the actual data in 'mpx_bndcsr_state' from the padding. We will shortly be checking the state sizes against our structures and need them to match. For consistency, we also ensure to prefix these types with 'mpx_'. Lastly, we add some comments to mirror some of the descriptions in the Intel documents (SDM) of the various state components. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233129.384B73EB@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Add xfeature_enabled() helper instead of test_bit()Dave Hansen
We currently use test_bit() in a few places to see if an xfeature is enabled. It ends up being a bit ugly because 'xfeatures_mask' is a u64 and test_bit wants an 'unsigned long' so it requires a cast. The *_bit() functions are also techincally atomic, which we have no need for here. So, remove the test_bit()s and replace with the new xfeature_enabled() helper. This also provides a central place to add a comment about the future need to support 'system xstates'. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233129.B1534F86@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Remove 'xfeature_nr'Dave Hansen
xfeature_nr ended up being initialized too late for me to use it in the "xsave size sanity check" patch which is later in the series. I tried to move around its initialization but realized that it was just as easy to get rid of it. We only have 9 XFEATURES. Instead of dynamically calculating and storing the last feature, just use the compile-time max: XFEATURES_NR_MAX. Note that even with 'xfeatures_nr' we can had "holes" in the xfeatures_mask that we had to deal with. We also change a 'leaf' variable to be a plain 'i'. Although it is used to grab a cpuid leaf in this one loop, all of the other loops just use an 'i' and I find it much more obvious to keep the naming consistent across all the similar loops. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233128.3F30DF5A@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Rework XSTATE_* macros to remove magic '2'Dave Hansen
The 'xstate.c' code has a bunch of references to '2'. This is because we have a lot more work to do for the "extended" xstates than the "legacy" ones and state component 2 is the first "extended" state. This patch replaces all of the instances of '2' with FIRST_EXTENDED_XFEATURE, which clearly explains what is going on. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233128.A8C0BF51@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Rename XFEATURES_NR_MAXDave Hansen
This is a logcal followon to the last patch. It makes the XFEATURE_MAX naming consistent with the other enum values. This is what Ingo suggested. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233127.A541448F@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Rename XSAVE macrosDave Hansen
There are two concepts that have some confusing naming: 1. Extended State Component numbers (currently called XFEATURE_BIT_*) 2. Extended State Component masks (currently called XSTATE_*) The numbers are (currently) from 0-9. State component 3 is the bounds registers for MPX, for instance. But when we want to enable "state component 3", we go set a bit in XCR0. The bit we set is 1<<3. We can check to see if a state component feature is enabled by looking at its bit. The current 'xfeature_bit's are at best xfeature bit _numbers_. Calling them bits is at best inconsistent with ending the enum list with 'XFEATURES_NR_MAX'. This patch renames the enum to be 'xfeature'. These also happen to be what the Intel documentation calls a "state component". We also want to differentiate these from the "XSTATE_*" macros. The "XSTATE_*" macros are a mask, and we rename them to match. These macros are reasonably widely used so this patch is a wee bit big, but this really is just a rename. The only non-mechanical part of this is the s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/ We need a better name for it, but that's another patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com [ Ported to v4.3-rc1. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/ldt: Fix small LDT allocation for XenJan Beulich
While the following commit: 37868fe113 ("x86/ldt: Make modify_ldt synchronous") added a nice comment explaining that Xen needs page-aligned whole page chunks for guest descriptor tables, it then nevertheless used kzalloc() on the small size path. As I'm unaware of guarantees for kmalloc(PAGE_SIZE, ) to return page-aligned memory blocks, I believe this needs to be switched back to __get_free_page() (or better get_zeroed_page()). Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/55E735D6020000780009F1E6@prv-mh.provo.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Remove partial LWP support definitionsDave Hansen
LightWeight Profiling was evidently an AMD profiling feature that we never got around to implementing. Remove the references to it. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.7E602284@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Remove XSTATE_RESERVEDave Hansen
The original purpose of XSTATE_RESERVE was to carve out space to store all of the possible extended state components that get saved with the XSAVE instruction(s). However, we are now almost entirely dynamically allocating the buffers we use for XSAVE by placing them at the end of the task_struct and them sizing them at boot. The one exception for that is the init_task. The maximum extended state component size that we have today is on systems with space for AVX-512 and Memory Protection Keys: 2696 bytes. We have reserved a PAGE_SIZE buffer in the init_task via fpregs_state->__padding. This check ensures that even if the component sizes or layout were changed (which we do not expect), that we will still not overflow the init_task's buffer. In the case that we detect we might overflow the buffer, we completely disable XSAVE support in the kernel and try to boot as if we had 'legacy x87 FPU' support in place. This is a crippled state without any of the XSAVE-enabled features (MPX, AVX, etc...). But, it at least let us boot safely. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Move XSAVE-disabling code to a helperDave Hansen
When we want to _completely_ disable XSAVE support as far as the kernel is concerned, we have a big set of feature flags to clear. We currently only do this in cases where the user asks for it to be disabled, but we are about to expand the places where we do it to handle errors too. Move the code in to xstate.c, and put it in the xstate.h header. We will use it in the next patch too. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233124.EA9A70E5@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/fpu: Print xfeature buffer size in decimalDave Hansen
This is utterly a personal taste thing, but I find it way easier to read structure sizes in decimal than in hex. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: dave@sr71.net Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150902233124.1A8B04A8@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/headers: Clean up too long linesPeter Zijlstra
Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: brgerst@gmail.com Cc: dvlasenk@redhat.com Cc: luto@amacapital.net Cc: mikko.rapeli@iki.fi Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/20150909071244.GM3644@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14x86/vm86: Fix the misleading CONFIG_VM86 Kconfig help textIngo Molnar
The CONFIG_VM86 Kconfig help text is actively misleading, so fix it: - Don't mark it 'obsolete' in the text as we'll support the ABI as long as CPUs support it. - Qualify the part about software emulation and mention that for some apps you want a real vm86 mode. - Don't scare users away from the option, instead explain what it does. Reported-by: Stas Sergeev <stsp@list.ru> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Austin S Hemmelgarn <ahferroin7@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Josh Boyer <jwboyer@fedoraproject.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Garrett <mjg59@srcf.ucam.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/core: Drop PERF_EVENT_TXNSukadev Bhattiprolu
We currently use PERF_EVENT_TXN flag to determine if we are in the middle of a transaction. If in a transaction, we defer the schedulability checks from pmu->add() operation to the pmu->commit() operation. Now that we have "transaction types" (PERF_PMU_TXN_ADD, PERF_PMU_TXN_READ) we can use the type to determine if we are in a transaction and drop the PERF_EVENT_TXN flag. When PERF_EVENT_TXN is dropped, the cpuhw->group_flag on some architectures becomes unused, so drop that field as well. This is an extension of the Powerpc patch from Peter Zijlstra to s390, Sparc and x86 architectures. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1441336073-22750-11-git-send-email-sukadev@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/core: Add a 'flags' parameter to the PMU transactional interfacesSukadev Bhattiprolu
Currently, the PMU interface allows reading only one counter at a time. But some PMUs like the 24x7 counters in Power, support reading several counters at once. To leveage this functionality, extend the transaction interface to support a "transaction type". The first type, PERF_PMU_TXN_ADD, refers to the existing transactions, i.e. used to _schedule_ all the events on the PMU as a group. A second transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch, by the 24x7 counters to read several counters at once. Extend the transaction interfaces to the PMU to accept a 'txn_flags' parameter and use this parameter to ignore any transactions that are not of type PERF_PMU_TXN_ADD. Thanks to Peter Zijlstra for his input. Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> [peterz: s390 compile fix] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/x86/intel/pt: Fix KVM warning due to doing rdmsr() before the CPUID testHuaitong Han
If KVM does not support INTEL_PT, guest MSR_IA32_RTIT_CTL reading will produce host warning like "kvm [2469]: vcpu0 unhandled rdmsr: 0x570". Guest can determine whether the CPU supports Intel_PT according to CPUID, so test_cpu_cap function is added before rdmsr,and it is more in line with the code style. Signed-off-by: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Link: http://lkml.kernel.org/r/1441009262-9792-1-git-send-email-huaitong.han@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/x86/intel: Fix LBR callstack issue caused by FREEZE_LBRS_ON_PMIKan Liang
This patch fixes an issue which introduced by commit 1a78d93750bb5f61abdc59a91fc3bd06a214542a ("perf/x86/intel: Streamline LBR MSR handling in PMI"). The old patch not only avoids writing LBR_SELECT MSR in PMI, but also avoids updating lbr_select variable. So in PMI, FREEZE_LBRS_ON_PMI bit is always mistakenly set for IA32_DEBUGCTLMSR MSR, which causes superfluous increase/decrease of LBR_TOS when collecting LBR callstack. Reported-by: Milian Wolff <mail@milianw.de> Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1439815051-8616-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/x86/intel/bts: Disallow use by unprivileged users on paranoid systemsAlexander Shishkin
BTS leaks kernel addresses even in userspace-only mode due to imprecise IP sampling, so sometimes syscall entry points or page fault handler addresses end up in a userspace trace. Now, intel_bts driver exports trace data zero-copy, it does not scan through it to filter out the kernel addresses and it's would be a O(n) job. To work around this situation, this patch forbids the use of intel_bts driver by unprivileged users on systems with the paranoid setting above the (kernel's) default "1", which still allows kernel profiling. In other words, using intel_bts driver implies kernel tracing, regardless of the "exclude_kernel" attribute setting. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1441030168-6853-3-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/x86/intel/ds: Work around BTS leaking kernel addressesAlexander Shishkin
BTS leaks kernel addresses even in userspace-only mode due to imprecise IP sampling, so sometimes syscall entry points or page fault handler addresses end up in a userspace trace. Since this driver uses a relatively small buffer for BTS records and it has to iterate through them anyway, it can also take on the additional job of filtering out the records that contain kernel addresses when kernel space tracing is not enabled. This patch changes the bts code to skip the offending records from perf output. In order to request the exact amount of space on the ring buffer, we need to do an extra pass through the records to know how many there are of the valid ones, but considering the small size of the buffer, this extra pass adds very little overhead to the nmi handler. This way we won't end up with awkward IP samples with zero IPs in the perf stream. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1441030168-6853-2-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/x86: Improve accuracy of perf/sched clockAdrian Hunter
When TSC is stable perf/sched clock is based on it. However the conversion from cycles to nanoseconds is not as accurate as it could be. Because CYC2NS_SCALE_FACTOR is 10, the accuracy is +/- 1/2048 The change is to calculate the maximum shift that results in a multiplier that is still a 32-bit number. For example all frequencies over 1 GHz will have a shift of 32, making the accuracy of the conversion +/- 1/(2^33). That is achieved by using the 'clocks_calc_mult_shift()' function. Signed-off-by: Adrian Hunter <adrian.hunter@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1440147918-22250-1-git-send-email-adrian.hunter@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying ↵Ingo Molnar
new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13Merge tag 'v4.3-rc1' into perf/core, to refresh the treeIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13perf/x86/intel: Fix constraint accessPeter Zijlstra
Sasha reported that we can get here with .idx==-1, and cpuc->event_constraints unallocated. Suggested-by: Stephane Eranian <eranian@google.com> Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Fixes: b371b5943178 ("perf/x86: Fix event/group validation") Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13x86/cpu: Print family/model/stepping in hexBorislav Petkov
924e101a7ab6 ("x86/debug: Dump family, model, stepping of the boot CPU") had its good intentions to dump the exact F/M/S as an aid during debugging sessions but its output can be ambiguous. Fix that: -smpboot: CPU0: Intel Core Processor (Broadwell) (fam: 06, model: 47, stepping: 02) +smpboot: CPU0: Intel Core Processor (Broadwell) (family: 0x6, model: 0x47, stepping: 0x2) Also, spell out "family". Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1441914927-32037-1-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13x86/platform/uv: Insert per_cpu accessor function on uv_hub_nmiGeorge Beshers
UV: NMI: insert this_cpu_read accessor function on uv_hub_nmi. On SGI UV systems a 'power nmi' command from the CMC causes all processors to drop into uv_handle_nmi(). With the 4.0 kernel this results in BUG: unable to handle kernel paging request The bug is caused by the current code trying to use the PER_CPU variable uv_cpu_nmi.hub without an appropriate accessor function. That oversight occurred in commit e16321709c82 ("uv: Replace __get_cpu_var") Author: Christoph Lameter <cl@linux.com> Date: Sun Aug 17 12:30:41 2014 -0500 This patch inserts this_cpu_read() in the uv_hub_nmi macro restoring the intended functionality. Signed-off-by: George Beshers <gbeshers@sgi.com> Acked-by: Mike Travis <travis@sgi.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Hedi Berriche <hedi@sgi.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russ Anderson <rja@sgi.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11sys_membarrier(): system-wide memory barrier (generic, x86)Mathieu Desnoyers
Here is an implementation of a new system call, sys_membarrier(), which executes a memory barrier on all threads running on the system. It is implemented by calling synchronize_sched(). It can be used to distribute the cost of user-space memory barriers asymmetrically by transforming pairs of memory barriers into pairs consisting of sys_membarrier() and a compiler barrier. For synchronization primitives that distinguish between read-side and write-side (e.g. userspace RCU [1], rwlocks), the read-side can be accelerated significantly by moving the bulk of the memory barrier overhead to the write-side. The existing applications of which I am aware that would be improved by this system call are as follows: * Through Userspace RCU library (http://urcu.so) - DNS server (Knot DNS) https://www.knot-dns.cz/ - Network sniffer (http://netsniff-ng.org/) - Distributed object storage (https://sheepdog.github.io/sheepdog/) - User-space tracing (http://lttng.org) - Network storage system (https://www.gluster.org/) - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf) - Financial software (https://lkml.org/lkml/2015/3/23/189) Those projects use RCU in userspace to increase read-side speed and scalability compared to locking. Especially in the case of RCU used by libraries, sys_membarrier can speed up the read-side by moving the bulk of the memory barrier cost to synchronize_rcu(). * Direct users of sys_membarrier - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198) Microsoft core dotnet GC developers are planning to use the mprotect() side-effect of issuing memory barriers through IPIs as a way to implement Windows FlushProcessWriteBuffers() on Linux. They are referring to sys_membarrier in their github thread, specifically stating that sys_membarrier() is what they are looking for. To explain the benefit of this scheme, let's introduce two example threads: Thread A (non-frequent, e.g. executing liburcu synchronize_rcu()) Thread B (frequent, e.g. executing liburcu rcu_read_lock()/rcu_read_unlock()) In a scheme where all smp_mb() in thread A are ordering memory accesses with respect to smp_mb() present in Thread B, we can change each smp_mb() within Thread A into calls to sys_membarrier() and each smp_mb() within Thread B into compiler barriers "barrier()". Before the change, we had, for each smp_mb() pairs: Thread A Thread B previous mem accesses previous mem accesses smp_mb() smp_mb() following mem accesses following mem accesses After the change, these pairs become: Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses As we can see, there are two possible scenarios: either Thread B memory accesses do not happen concurrently with Thread A accesses (1), or they do (2). 1) Non-concurrent Thread A vs Thread B accesses: Thread A Thread B prev mem accesses sys_membarrier() follow mem accesses prev mem accesses barrier() follow mem accesses In this case, thread B accesses will be weakly ordered. This is OK, because at that point, thread A is not particularly interested in ordering them with respect to its own accesses. 2) Concurrent Thread A vs Thread B accesses Thread A Thread B prev mem accesses prev mem accesses sys_membarrier() barrier() follow mem accesses follow mem accesses In this case, thread B accesses, which are ensured to be in program order thanks to the compiler barrier, will be "upgraded" to full smp_mb() by synchronize_sched(). * Benchmarks On Intel Xeon E5405 (8 cores) (one thread is calling sys_membarrier, the other 7 threads are busy looping) 1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call. * User-space user of this system call: Userspace RCU library Both the signal-based and the sys_membarrier userspace RCU schemes permit us to remove the memory barrier from the userspace RCU rcu_read_lock() and rcu_read_unlock() primitives, thus significantly accelerating them. These memory barriers are replaced by compiler barriers on the read-side, and all matching memory barriers on the write-side are turned into an invocation of a memory barrier on all active threads in the process. By letting the kernel perform this synchronization rather than dumbly sending a signal to every process threads (as we currently do), we diminish the number of unnecessary wake ups and only issue the memory barriers on active threads. Non-running threads do not need to execute such barrier anyway, because these are implied by the scheduler context switches. Results in liburcu: Operations in 10s, 6 readers, 2 writers: memory barriers in reader: 1701557485 reads, 2202847 writes signal-based scheme: 9830061167 reads, 6700 writes sys_membarrier: 9952759104 reads, 425 writes sys_membarrier (dyn. check): 7970328887 reads, 425 writes The dynamic sys_membarrier availability check adds some overhead to the read-side compared to the signal-based scheme, but besides that, sys_membarrier slightly outperforms the signal-based scheme. However, this non-expedited sys_membarrier implementation has a much slower grace period than signal and memory barrier schemes. Besides diminishing the number of wake-ups, one major advantage of the membarrier system call over the signal-based scheme is that it does not need to reserve a signal. This plays much more nicely with libraries, and with processes injected into for tracing purposes, for which we cannot expect that signals will be unused by the application. An expedited version of this system call can be added later on to speed up the grace period. Its implementation will likely depend on reading the cpu_curr()->mm without holding each CPU's rq lock. This patch adds the system call to x86 and to asm-generic. [1] http://urcu.so membarrier(2) man page: MEMBARRIER(2) Linux Programmer's Manual MEMBARRIER(2) NAME membarrier - issue memory barriers on a set of threads SYNOPSIS #include <linux/membarrier.h> int membarrier(int cmd, int flags); DESCRIPTION The cmd argument is one of the following: MEMBARRIER_CMD_QUERY Query the set of supported commands. It returns a bitmask of supported commands. MEMBARRIER_CMD_SHARED Execute a memory barrier on all threads running on the system. Upon return from system call, the caller thread is ensured that all running threads have passed through a state where all memory accesses to user-space addresses match program order between entry to and return from the system call (non-running threads are de facto in such a state). This covers threads from all pro=E2=80=90 cesses running on the system. This command returns 0. The flags argument needs to be 0. For future extensions. All memory accesses performed in program order from each targeted thread is guaranteed to be ordered with respect to sys_membarrier(). If we use the semantic "barrier()" to represent a compiler barrier forcing memory accesses to be performed in program order across the barrier, and smp_mb() to represent explicit memory barriers forcing full memory ordering across the barrier, we have the following ordering table for each pair of barrier(), sys_membarrier() and smp_mb(): The pair ordering is detailed as (O: ordered, X: not ordered): barrier() smp_mb() sys_membarrier() barrier() X X O smp_mb() X O O sys_membarrier() O O O RETURN VALUE On success, these system calls return zero. On error, -1 is returned, and errno is set appropriately. For a given command, with flags argument set to 0, this system call is guaranteed to always return the same value until reboot. ERRORS ENOSYS System call is not implemented. EINVAL Invalid arguments. Linux 2015-04-15 MEMBARRIER(2) Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nicholas Miell <nmiell@comcast.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: David Howells <dhowells@redhat.com> Cc: Pranith Kumar <bobby.prani@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Shuah Khan <shuahkh@osg.samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-11perf/x86/intel/bts: Set event->hw.itrace_started in pmu::start to match the ↵Alexander Shishkin
new logic Since event->hw.itrace_started is now set in pmu::start() to signal the beginning of the trace, do so also in the intel_bts driver. Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1437140050-23363-4-git-send-email-alexander.shishkin@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11locking/qspinlock/x86: Only emit the test-and-set fallback when building ↵Peter Zijlstra
guest support Only emit the test-and-set fallback for Hypervisors lacking PARAVIRT_SPINLOCKS support when building for guests. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: stable@vger.kernel.org # 4.2 Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-11locking/qspinlock/x86: Fix performance regression under unaccelerated VMsPeter Zijlstra
Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS set and Linus noted that the test-and-set implementation was retarded. One should spin on the variable with a load, not a RMW. While there, remove 'queued' from the name, as the lock isn't queued at all, but a simple test-and-set. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Dave Chinner <david@fromorbit.com> Tested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <Waiman.Long@hp.com> Cc: stable@vger.kernel.org # v4.2+ Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>