summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2016-01-06perf/x86/intel/uncore: Remove hard coding of PMON box control MSR offsetHarish Chegondi
Call uncore_pci_box_ctl() function to get the PMON box control MSR offset instead of hard coding the offset. This would allow us to use this snbep_uncore_pci_init_box() function for other PCI PMON devices whose box control MSR offset is different from SNBEP_PCI_PMON_BOX_CTL. Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Harish Chegondi <harish.chegondi@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/872e8ef16cfc38e5ff3b45fac1094e6f1722e4ad.1449470704.git.harish.chegondi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86/intel: Add perf core PMU support for Intel Knights LandingHarish Chegondi
Knights Landing core is based on Silvermont core with several differences. Like Silvermont, Knights Landing has 8 pairs of LBR MSRs. However, the LBR MSRs addresses match those of the Xeon cores' first 8 pairs of LBR MSRs Unlike Silvermont, Knights Landing supports hyperthreading. Knights Landing offcore response events config register mask is different from that of the Silvermont. This patch was developed based on a patch from Andi Kleen. For more details, please refer to the public document: https://software.intel.com/sites/default/files/managed/15/8d/IntelXeonPhi%E2%84%A2x200ProcessorPerformanceMonitoringReferenceManual_Volume1_Registers_v0%206.pdf Signed-off-by: Harish Chegondi <harish.chegondi@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Harish Chegondi <harish.chegondi@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/d14593c7311f78c93c9cf6b006be843777c5ad5c.1449517401.git.harish.chegondi@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86/intel/uncore: Add Broadwell-EP uncore supportKan Liang
The uncore subsystem for Broadwell-EP is similar to Haswell-EP. There are some differences in pci device IDs, box number and constraints. This patch extends the Broadwell-DE codes to support Broadwell-EP. Signed-off-by: Kan Liang <kan.liang@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449176411-9499-1-git-send-email-kan.liang@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86/rapl: Use unified perf_event_sysfs_show instead of special interfaceHuang Rui
Actually, rapl_sysfs_show is a duplicate of perf_event_sysfs_show. We prefer to use the unified interface. Signed-off-by: Huang Rui <ray.huang@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dasaratharaman Chandramouli<dasaratharaman.chandramouli@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449223661-2437-1-git-send-email-ray.huang@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Enable cycles:pp for Intel AtomStephane Eranian
This patch updates the PEBS support for Intel Atom to provide an alias for the cycles:pp event used by perf record/top by default nowadays. On Atom, only INST_RETIRED:ANY supports PEBS, so we use this event instead with a large cmask to count cycles. Given that Core2 has the same issue, we use the intel_pebs_aliases_core2() function for Atom as well. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1449172990-30183-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: fix PEBS issues on Intel Atom/Core2Stephane Eranian
This patch fixes broken PEBS support on Intel Atom and Core2 due to wrong pointer arithmetic in intel_pmu_drain_pebs_core(). The get_next_pebs_record_by_bit() was called on PEBS format fmt0 which does not use the pebs_record_nhm layout. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Fixes: 21509084f999 ("perf/x86/intel: Handle multiple records in the PEBS buffer") Link: http://lkml.kernel.org/r/1449182000-31524-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Fix LBR related crashes on Intel AtomStephane Eranian
This patches fixes the LBR kernel crashes on Intel Atom. The kernel was assuming that if the CPU supports 64-bit format LBR, then it has an LBR_SELECT MSR. Atom uses 64-bit LBR format but does not have LBR_SELECT. That was causing NULL pointer dereferences in a couple of places. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: kan.liang@intel.com Fixes: 96f3eda67fcf ("perf/x86/intel: Fix static checker warning in lbr enable") Link: http://lkml.kernel.org/r/1449182000-31524-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Fix filter_events() bug with event mappingsStephane Eranian
This patch fixes a bug in the filter_events() function. The patch fixes the bug whereby if some mappings did not exist, e.g., STALLED_CYCLES_FRONTEND, then any event after it in the attrs array would disappear from the published list of events in /sys/devices/cpu/events. This could be verified easily on any system post SNB (which do not publish STALLED_CYCLES_FRONTEND): $ ./perf stat -e cycles,ref-cycles true Performance counter stats for 'true': 1,217,348 cycles <not supported> ref-cycles The problem is that in filter_events() there is an assumption that the argument (attrs) is organized in increasing continuous event indexes related to the event_map(). But if we remove the non-supported events by shifing the position in the array, then the lookup x86_pmu.event_map() needs to compensate for it, otherwise we are looking up the wrong index. This patch corrects this problem by compensating for the deleted events and with that ref-cycles reappears (here shown on Haswell): $ perf stat -e ref-cycles,cycles true Performance counter stats for 'true': 4,525,910 ref-cycles 1,064,920 cycles 0.002943888 seconds time elapsed Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: jolsa@kernel.org Cc: kan.liang@intel.com Fixes: 8300daa26755 ("perf/x86: Filter out undefined events from sysfs events attribute") Link: http://lkml.kernel.org/r/1449516805-6637-1-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Use INST_RETIRED.PREC_DIST for cycles: pppAndi Kleen
Add a new 'three-p' precise level, that uses INST_RETIRED.PREC_DIST as base. The basic mechanism of abusing the inverse cmask to get all cycles works the same as before. PREC_DIST is available on Sandy Bridge or later. It had some problems on Sandy Bridge, so we only use it on IvyBridge and later. I tested it on Broadwell and Skylake. PREC_DIST has special support for avoiding shadow effects, which can give better results compare to UOPS_RETIRED. The drawback is that PREC_DIST can only schedule on counter 1, but that is ok for cycle sampling, as there is normally no need to do multiple cycle sampling runs in parallel. It is still possible to run perf top in parallel, as that doesn't use precise mode. Also of course the multiplexing can still allow parallel operation. :pp stays with the previous event. Example: Sample a loop with 10 sqrt with old cycles:pp 0.14 │10: sqrtps %xmm1,%xmm0 <-------------- 9.13 │ sqrtps %xmm1,%xmm0 11.58 │ sqrtps %xmm1,%xmm0 11.51 │ sqrtps %xmm1,%xmm0 6.27 │ sqrtps %xmm1,%xmm0 10.38 │ sqrtps %xmm1,%xmm0 12.20 │ sqrtps %xmm1,%xmm0 12.74 │ sqrtps %xmm1,%xmm0 5.40 │ sqrtps %xmm1,%xmm0 10.14 │ sqrtps %xmm1,%xmm0 10.51 │ ↑ jmp 10 We expect all 10 sqrt to get roughly the sample number of samples. But you can see that the instruction directly after the JMP is systematically underestimated in the result, due to sampling shadow effects. With the new PREC_DIST based sampling this problem is gone and all instructions show up roughly evenly: 9.51 │10: sqrtps %xmm1,%xmm0 11.74 │ sqrtps %xmm1,%xmm0 11.84 │ sqrtps %xmm1,%xmm0 6.05 │ sqrtps %xmm1,%xmm0 10.46 │ sqrtps %xmm1,%xmm0 12.25 │ sqrtps %xmm1,%xmm0 12.18 │ sqrtps %xmm1,%xmm0 5.26 │ sqrtps %xmm1,%xmm0 10.13 │ sqrtps %xmm1,%xmm0 10.43 │ sqrtps %xmm1,%xmm0 0.16 │ ↑ jmp 10 Even with PREC_DIST there is still sampling skid and the result is not completely even, but systematic shadow effects are significantly reduced. The improvements are mainly expected to make a difference in high IPC code. With low IPC it should be similar. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1448929689-13771-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Use INST_RETIRED.TOTAL_CYCLES_PS for cycles:pp for SkylakeAndi Kleen
I added UOPS_RETIRED.ALL by mistake to the Skylake PEBS event list for cycles:pp. But the event is not documented for Skylake, and has some issues. The recommended replacement for cycles:pp is to use INST_RETIRED.ANY+pebs as a base, similar to what CPUs before Sandy Bridge did. This new event is called INST_RETIRED.TOTAL_CYCLES_PS. The event is not really new, but has been already used by perf before Sandy Bridge for the original cycles:p Note the SDM doesn't document that event either, but it's being documented in the latest version of the event list on: https://download.01.org/perfmon/SKL This patch does: - Remove UOPS_RETIRED.ALL from the Skylake PEBS event list - Add INST_RETIRED.ANY to the Skylake PEBS event list, and an table entry to allow cmask=16,inv=1 for cycles:pp - We don't need an extra entry for the base INST_RETIRED event, because it is already covered by the catch-all PEBS table entry. - Switch Skylake to use the Core2 PEBS alias (which is INST_RETIRED.TOTAL_CYCLES_PS) Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: hpa@zytor.com Link: http://lkml.kernel.org/r/1448929689-13771-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Allow zero PEBS status with only single active eventAndi Kleen
Normally we drop PEBS events with a zero status field. But when there is only a single PEBS event active we can assume the PEBS record is for that event. The PEBS buffer is always flushed when PEBS events are disabled, so there is no risk of mishandling state PEBS records this way. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449177740-5422-2-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06perf/x86: Remove warning for zero PEBS statusAndi Kleen
The recent commit: 75f80859b130 ("perf/x86/intel/pebs: Robustify PEBS buffer drain") causes lots of warnings on different CPUs before Skylake when running PEBS intensive workloads. They can have a zero status field in the PEBS record when PEBS is racing with clearing of GLOBAl_STATUS. This also can cause hangs (it seems there are still problems with printk in NMI). Disable the warning, but still ignore the record. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1449177740-5422-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-06x86/fpu: Properly align size in CHECK_MEMBER_AT_END_OF() macroJiri Olsa
The CHECK_MEMBER_AT_END_OF(TYPE, MEMBER) checks whether MEMBER is last member of TYPE by evaluating: offsetof(TYPE::MEMBER) + sizeof(TYPE::MEMBER) == sizeof(TYPE) and ensuring TYPE::MEMBER is the last member of the TYPE. This condition breaks on structs that are padded to be aligned. This patch ensures the TYPE alignment is taken into account. This bug was revealed after adding cacheline alignment into struct sched_entity, which broke task_struct::thread check: CHECK_MEMBER_AT_END_OF(struct task_struct, thread); Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave@sr71.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1450707930-3445-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-01-04x86: ftrace: Fix the comments for ftrace_modify_code_direct()Li Bin
There is no need to worry about module and __init text disappearing case, because that ftrace has a module notifier that is called when a module is being unloaded and before the text goes away and this code grabs the ftrace_lock mutex and removes the module functions from the ftrace list, such that it will no longer do any modifications to that module's text, the update to make functions be traced or not is done under the ftrace_lock mutex as well. And by now, __init section codes should not been modified by ftrace, because it is black listed in recordmcount.c and ignored by ftrace. Link: http://lkml.kernel.org/r/1449367378-29430-6-git-send-email-huawei.libin@huawei.com Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Li Bin <huawei.libin@huawei.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-01-04Merge branch 'memdup_user_nul' into work.miscAl Viro
2015-12-30x86/numachip: Fix NumaConnect2 MMCFG PCI accessDaniel J Blueman
The MMCFG PCI accessors weren't being setup for NumacConnect2 correctly due to over-early assignment; this would create the potential for the wrong PCI domain to be accessed. Fix this by using the correct arch-specific PCI init function. Signed-off-by: Daniel J Blueman <daniel@numascale.com> Acked-by: Steffen Persvold <sp@numascale.com> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1451498807-15920-1-git-send-email-daniel@numascale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-29x86/LDT: Print the real LDT base addressJan Beulich
This was meant to print base address and entry count; make it do so again. Fixes: 37868fe113ff "x86/ldt: Make modify_ldt synchronous" Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Andy Lutomirski <luto@kernel.org> Link: http://lkml.kernel.org/r/56797D8402000078000C24F0@prv-mh.provo.novell.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-29arch/x86/kernel/ptrace.c: Remove unused arg_offs_tablechengang@emindsoft.com.cn
The related warning from gcc 6.0: arch/x86/kernel/ptrace.c:127:18: warning: ‘arg_offs_table’ defined but not used [-Wunused-const-variable] static const int arg_offs_table[] = { ^~~~~~~~~~~~~~ Signed-off-by: Chen Gang <gang.chen.5i5j@gmail.com> Link: http://lkml.kernel.org/r/1451137798-28701-1-git-send-email-chengang@emindsoft.com.cn Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-23new helpers: no_seek_end_llseek{,_size}()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-23Merge branch 'for-linus' into for-nextTakashi Iwai
Conflicts: drivers/gpu/drm/i915/intel_pm.c
2015-12-20x86/irq: Export functions to allow MSI domains in modulesJake Oshins
The Linux kernel already has the concept of IRQ domain, wherein a component can expose a set of IRQs which are managed by a particular interrupt controller chip or other subsystem. The PCI driver exposes the notion of an IRQ domain for Message-Signaled Interrupts (MSI) from PCI Express devices. This patch exposes the functions which are necessary for creating a MSI IRQ domain within a module. [ tglx: Split it into x86 and core irq parts ] Signed-off-by: Jake Oshins <jakeo@microsoft.com> Cc: gregkh@linuxfoundation.org Cc: kys@microsoft.com Cc: devel@linuxdriverproject.org Cc: olaf@aepfle.de Cc: apw@canonical.com Cc: vkuznets@redhat.com Cc: haiyangz@microsoft.com Cc: marc.zyngier@arm.com Cc: bhelgaas@google.com Link: http://lkml.kernel.org/r/1449769983-12948-4-git-send-email-jakeo@microsoft.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19x86/paravirt: Prevent rtc_cmos platform device init on PV guestsDavid Vrabel
Adding the rtc platform device in non-privileged Xen PV guests causes an IRQ conflict because these guests do not have legacy PIC and may allocate irqs in the legacy range. In a single VCPU Xen PV guest we should have: /proc/interrupts: CPU0 0: 4934 xen-percpu-virq timer0 1: 0 xen-percpu-ipi spinlock0 2: 0 xen-percpu-ipi resched0 3: 0 xen-percpu-ipi callfunc0 4: 0 xen-percpu-virq debug0 5: 0 xen-percpu-ipi callfuncsingle0 6: 0 xen-percpu-ipi irqwork0 7: 321 xen-dyn-event xenbus 8: 90 xen-dyn-event hvc_console ... But hvc_console cannot get its interrupt because it is already in use by rtc0 and the console does not work. genirq: Flags mismatch irq 8. 00000000 (hvc_console) vs. 00000000 (rtc0) We can avoid this problem by realizing that unprivileged PV guests (both Xen and lguests) are not supposed to have rtc_cmos device and so adding it is not necessary. Privileged guests (i.e. Xen's dom0) do use it but they should not have irq conflicts since they allocate irqs above legacy range (above gsi_top, in fact). Instead of explicitly testing whether the guest is privileged we can extend pv_info structure to include information about guest's RTC support. Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it> Signed-off-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: vkuznets@redhat.com Cc: xen-devel@lists.xenproject.org Cc: konrad.wilk@oracle.com Cc: stable@vger.kernel.org # 4.2+ Link: http://lkml.kernel.org/r/1449842873-2613-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19x86/cpufeature: Remove unused and seldomly used cpu_has_xx macrosBorislav Petkov
Those are stupid and code should use static_cpu_has_safe() or boot_cpu_has() instead. Kill the least used and unused ones. The remaining ones need more careful inspection before a conversion can happen. On the TODO. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1449481182-27541-4-git-send-email-bp@alien8.de Cc: David Sterba <dsterba@suse.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19x86/cpufeature: Cleanup get_cpu_cap()Borislav Petkov
Add an enum for the ->x86_capability array indices and cleanup get_cpu_cap() by killing some redundant local vars. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1449481182-27541-3-git-send-email-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19x86/cpufeature: Move some of the scattered feature bits to x86_capabilityBorislav Petkov
Turn the CPUID leafs which are proper CPUID feature bit leafs into separate ->x86_capability words. Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1449481182-27541-2-git-send-email-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19Merge branch 'linus' into x86/cleanupsThomas Gleixner
Pull in upstream changes so we can apply depending patches.
2015-12-19x86/nmi: Save regs in crash dump on external NMIHidehiro Kawai
Now, multiple CPUs can receive an external NMI simultaneously by specifying the "apic_extnmi=all" command line parameter. When we take a crash dump by using external NMI with this option, we fail to save registers into the crash dump. This happens as follows: CPU 0 CPU 1 ================================ ============================= receive an external NMI default_do_nmi() receive an external NMI spin_lock(&nmi_reason_lock) default_do_nmi() io_check_error() spin_lock(&nmi_reason_lock) panic() busy loop ... kdump_nmi_shootdown_cpus() issue NMI IPI -----------> blocked until IRET busy loop... Here, since CPU 1 is in NMI context, an additional NMI from CPU 0 remains unhandled until CPU 1 IRETs. However, CPU 1 will never execute IRET so the NMI is not handled and the callback function to save registers is never called. To solve this issue, we check if the IPI for crash dumping was issued while waiting for nmi_reason_lock to be released, and if so, call its callback function directly. If the IPI is not issued (e.g. kdump is disabled), the actual behavior doesn't change. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20151210065245.4587.39316.stgit@softrs Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19x86/apic: Introduce apic_extnmi command line parameterHidehiro Kawai
This patch introduces a command line parameter apic_extnmi: apic_extnmi=( bsp|all|none ) The default value is "bsp" and this is the current behavior: only the Boot-Strapping Processor receives an external NMI. "all" allows external NMIs to be broadcast to all CPUs. This would raise the success rate of panic on NMI when BSP hangs in NMI context or the external NMI is swallowed by other NMI handlers on the BSP. If you specify "none", no CPUs receive external NMIs. This is useful for the dump capture kernel so that it cannot be shot down by accidentally pressing the external NMI button (on platforms which have it) while saving a crash dump. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Bandan Das <bsd@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ricardo Ribalda Delgado <ricardo.ribalda@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20151210014632.25437.43778.stgit@softrs Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19panic, x86: Allow CPUs to save registers even if looping in NMI contextHidehiro Kawai
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends an NMI IPI to CPUs which haven't called panic() to stop them, save their register information and do some cleanups for crash dumping. However, if such a CPU is infinitely looping in NMI context, we fail to save its register information into the crash dump. For example, this can happen when unknown NMIs are broadcast to all CPUs as follows: CPU 0 CPU 1 =========================== ========================== receive an unknown NMI unknown_nmi_error() panic() receive an unknown NMI spin_trylock(&panic_lock) unknown_nmi_error() crash_kexec() panic() spin_trylock(&panic_lock) panic_smp_self_stop() infinite loop kdump_nmi_shootdown_cpus() issue NMI IPI -----------> blocked until IRET infinite loop... Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET, so the NMI is not handled and the callback function to save registers is never called. In practice, this can happen on some servers which broadcast NMIs to all CPUs when the NMI button is pushed. To save registers in this case, we need to: a) Return from NMI handler instead of looping infinitely or b) Call the callback function directly from the infinite loop Inherently, a) is risky because NMI is also used to prevent corrupted data from being propagated to devices. So, we chose b). This patch does the following: 1. Move the infinite looping of CPUs which haven't called panic() in NMI context (actually done by panic_smp_self_stop()) outside of panic() to enable us to refer pt_regs. Please note that panic_smp_self_stop() is still used for normal context. 2. Call a callback of kdump_nmi_shootdown_cpus() directly to save registers and do some cleanups after setting waiting_for_crash_ipi which is used for counting down the number of CPUs which handled the callback Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Seth Jennings <sjenning@redhat.com> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs [ Cleanup comments, fixup formatting. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19panic, x86: Fix re-entrance problem due to panic on NMIHidehiro Kawai
If panic on NMI happens just after panic() on the same CPU, panic() is recursively called. Kernel stalls, as a result, after failing to acquire panic_lock. To avoid this problem, don't call panic() in NMI context if we've already entered panic(). For that, introduce nmi_panic() macro to reduce code duplication. In the case of panic on NMI, don't return from NMI handlers if another CPU already panicked. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjenning@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs [ Cleanup comments, fixup formatting. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19Merge branch 'linus' into x86/apicThomas Gleixner
Pull in update changes so we can apply conflicting patches
2015-12-19x86/mce: Ensure offline CPUs don't participate in rendezvous processAshok Raj
Intel's MCA implementation broadcasts MCEs to all CPUs on the node. This poses a problem for offlined CPUs which cannot participate in the rendezvous process: Kernel panic - not syncing: Timeout: Not all CPUs entered broadcast exception handler Kernel Offset: disabled Rebooting in 100 seconds.. More specifically, Linux does a soft offline of a CPU when writing a 0 to /sys/devices/system/cpu/cpuX/online, which doesn't prevent the #MC exception from being broadcasted to that CPU. Ensure that offline CPUs don't participate in the MCE rendezvous and clear the RIP valid status bit so that a second MCE won't cause a shutdown. Without the patch, mce_start() will increment mce_callin and wait for all CPUs. Offlined CPUs should avoid participating in the rendezvous process altogether. Signed-off-by: Ashok Raj <ashok.raj@intel.com> [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1449742346-21470-2-git-send-email-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-14Merge tag 'v4.4-rc5' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-11x86/vdso: Remove pvclock fixmap machineryAndy Lutomirski
Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/4933029991103ae44672c82b97a20035f5c1fe4f.1449702533.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-11x86/vdso: Get pvclock data from the vvar VMA instead of the fixmapAndy Lutomirski
Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/9d37826fdc7e2d2809efe31d5345f97186859284.1449702533.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-08Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "This tree includes four core perf fixes for misc bugs, three fixes to x86 PMU drivers, and two updates to old email addresses" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf: Do not send exit event twice perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macro perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on Haswell perf: Fix PERF_EVENT_IOC_PERIOD deadlock treewide: Remove old email address perf/x86: Fix LBR call stack save/restore perf: Update email address in MAINTAINERS perf/core: Robustify the perf_cgroup_from_task() RCU checks perf/core: Fix RCU problem with cgroup context switching code
2015-12-08Back merge tag 'v4.4-rc4' into drm-nextDave Airlie
We've picked up a few conflicts and it would be nice to resolve them before we move onwards.
2015-12-06Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thoma Gleixner: "Another round of fixes for x86: - Move the initialization of the microcode driver to late_initcall to make sure everything that init function needs is available. - Make sure that lockdep knows about interrupts being off in the entry code before calling into c-code. - Undo the cpu hotplug init delay regression. - Use the proper conditionals in the mpx instruction decoder. - Fixup restart_syscall for x32 tasks. - Fix the hugepage regression on PAE kernels which was introduced with the latest PAT changes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/signal: Fix restart_syscall number for x32 tasks x86/mpx: Fix instruction decoder condition x86/mm: Fix regression with huge pages on PAE x86 smpboot: Re-enable init_udelay=0 by default on modern CPUs x86/entry/64: Fix irqflag tracing wrt context tracking x86/microcode: Initialize the driver late when facilities are up
2015-12-06perf/x86: Remove old MSR perf tracing codeAndi Kleen
Now that we have generic MSR trace points we can remove the old hackish perf MSR read tracing code. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/1449018060-1742-4-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06perf/x86/intel: Fix __initconst declaration in the RAPL perf driverAndi Kleen
Fix a definition in the perf rapl driver. __initconst must be applied to a const object, but to declare a const pointer you need to use * const ..., not const ... * This fixes a section attribute conflict with LTO builds. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1448905722-2767-1-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06perf/x86/intel: Fix INTEL_FLAGS_UEVENT_CONSTRAINT_DATALA_NA macroJiri Olsa
We need to add rest of the flags to the constraint mask instead of another INTEL_ARCH_EVENT_MASK, fixing a typo. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1447061071-28085-1-git-send-email-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06perf/x86/intel: Make L1D_PEND_MISS.FB_FULL not constrained on HaswellYuanfang Chen
There was a mistake in the Haswell constraints table. Signed-off-by: Yuanfang Chen <cheny@udel.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1448384701-9110-1-git-send-email-cheny@udel.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06x86/mm/64: Enable SWIOTLB if system has SRAT memory regions above MAX_DMA32_PFNIgor Mammedov
when memory hotplug enabled system is booted with less than 4GB of RAM and then later more RAM is hotplugged 32-bit devices stop functioning with following error: nommu_map_single: overflow 327b4f8c0+1522 of device mask ffffffff the reason for this is that if x86_64 system were booted with RAM less than 4GB, it doesn't enable SWIOTLB and when memory is hotplugged beyond MAX_DMA32_PFN, devices that expect 32-bit addresses can't handle 64-bit addresses. Fix it by tracking max possible PFN when parsing memory affinity structures from SRAT ACPI table and enable SWIOTLB if there is hotpluggable memory regions beyond MAX_DMA32_PFN. It fixes KVM guests when they use emulated devices (reproduces with ata_piix, e1000 and usb devices, RHBZ: 1275941, 1275977, 1271527) It also fixes the HyperV, VMWare with emulated devices which are affected by this issue as well. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: fujita.tomonori@lab.ntt.co.jp Cc: konrad.wilk@oracle.com Cc: pbonzini@redhat.com Cc: revers@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1449234426-273049-3-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-06x86/mm: Introduce max_possible_pfnIgor Mammedov
max_possible_pfn will be used for tracking max possible PFN for memory that isn't present in E820 table and could be hotplugged later. By default max_possible_pfn is initialized with max_pfn, but later it could be updated with highest PFN of hotpluggable memory ranges declared in ACPI SRAT table if any present. Signed-off-by: Igor Mammedov <imammedo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: akataria@vmware.com Cc: fujita.tomonori@lab.ntt.co.jp Cc: konrad.wilk@oracle.com Cc: pbonzini@redhat.com Cc: revers@redhat.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1449234426-273049-2-git-send-email-imammedo@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-05x86/signal: Fix restart_syscall number for x32 tasksDmitry V. Levin
When restarting a syscall with regs->ax == -ERESTART_RESTARTBLOCK, regs->ax is assigned to a restart_syscall number. For x32 tasks, this syscall number must have __X32_SYSCALL_BIT set, otherwise it will be an x86_64 syscall number instead of a valid x32 syscall number. This issue has been there since the introduction of x32. Reported-by: strace/tests/restart_syscall.test Reported-and-tested-by: Elvira Khabirova <lineprinter0@gmail.com> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Cc: Elvira Khabirova <lineprinter0@gmail.com> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20151130215436.GA25996@altlinux.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-04livepatch: Cleanup module page permission changesJosh Poimboeuf
Calling set_memory_rw() and set_memory_ro() for every iteration of the loop in klp_write_object_relocations() is messy, inefficient, and error-prone. Change all the read-only pages to read-write before the loop and convert them back to read-only again afterwards. Suggested-by: Miroslav Benes <mbenes@suse.cz> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-04module: use a structure to encapsulate layout.Rusty Russell
Makes it easier to handle init vs core cleanly, though the change is fairly invasive across random architectures. It simplifies the rbtree code immediately, however, while keeping the core data together in the same cachline (now iff the rbtree code is enabled). Acked-by: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2015-12-04x86/mm/mtrr: Mark the 'range_new' static variable in mtrr_calc_range_state() ↵Rasmus Villemoes
as __initdata 'range_new' doesn't seem to be used after init. It is only passed to memset(), sum_ranges(), memcmp() and x86_get_mtrr_mem_range(), the latter of which also only passes it on to various *range* library functions. So mark it __initdata to free up an extra page after init. Its contents are wiped at every call to mtrr_calc_range_state(), so it being static is not about preserving state between calls, but simply to avoid a 4k+ stack frame. While there, add a comment explaining this and why it's safe. We could also mark nr_range_new as __initdata, but since it's just a single int and also doesn't carry state between calls (it is unconditionally assigned to before it is read), we might as well make it an ordinary automatic variable. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Link: http://lkml.kernel.org/r/1449002691-20783-1-git-send-email-linux@rasmusvillemoes.dk Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-11-30libnvdimm, e820: skip module loading when no type-12Dan Williams
If there are no persistent memory ranges present then don't bother creating the platform device. Otherwise, it loads the full libnvdimm sub-system only to discover no resources present. Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>