summaryrefslogtreecommitdiff
path: root/arch/x86/kernel
AgeCommit message (Collapse)Author
2012-03-08x86/32: Print control and debug registers for kerenel contextJan Beulich
While for a user mode register dump it may be reasonable to skip those (albeit x86-64 doesn't do so), for kernel mode dumps these should be printed to make sure all information possibly necessary for analysis is available. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/4F58889202000078000770E7@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-07x86, mce: Fix rcu splat in drain_mce_log_buffer()Srivatsa S. Bhat
While booting, the following message is seen: [ 21.665087] =============================== [ 21.669439] [ INFO: suspicious RCU usage. ] [ 21.673798] 3.2.0-0.0.0.28.36b5ec9-default #2 Not tainted [ 21.681353] ------------------------------- [ 21.685864] arch/x86/kernel/cpu/mcheck/mce.c:194 suspicious rcu_dereference_index_check() usage! [ 21.695013] [ 21.695014] other info that might help us debug this: [ 21.695016] [ 21.703488] [ 21.703489] rcu_scheduler_active = 1, debug_locks = 1 [ 21.710426] 3 locks held by modprobe/2139: [ 21.714754] #0: (&__lockdep_no_validate__){......}, at: [<ffffffff8133afd3>] __driver_attach+0x53/0xa0 [ 21.725020] #1: [ 21.725323] ioatdma: Intel(R) QuickData Technology Driver 4.00 [ 21.733206] (&__lockdep_no_validate__){......}, at: [<ffffffff8133afe1>] __driver_attach+0x61/0xa0 [ 21.743015] #2: (i7core_edac_lock){+.+.+.}, at: [<ffffffffa01cfa5f>] i7core_probe+0x1f/0x5c0 [i7core_edac] [ 21.753708] [ 21.753709] stack backtrace: [ 21.758429] Pid: 2139, comm: modprobe Not tainted 3.2.0-0.0.0.28.36b5ec9-default #2 [ 21.768253] Call Trace: [ 21.770838] [<ffffffff810977cd>] lockdep_rcu_suspicious+0xcd/0x100 [ 21.777366] [<ffffffff8101aa41>] drain_mcelog_buffer+0x191/0x1b0 [ 21.783715] [<ffffffff8101aa78>] mce_register_decode_chain+0x18/0x20 [ 21.790430] [<ffffffffa01cf8db>] i7core_register_mci+0x2fb/0x3e4 [i7core_edac] [ 21.798003] [<ffffffffa01cfb14>] i7core_probe+0xd4/0x5c0 [i7core_edac] [ 21.804809] [<ffffffff8129566b>] local_pci_probe+0x5b/0xe0 [ 21.810631] [<ffffffff812957c9>] __pci_device_probe+0xd9/0xe0 [ 21.816650] [<ffffffff813362e4>] ? get_device+0x14/0x20 [ 21.822178] [<ffffffff81296916>] pci_device_probe+0x36/0x60 [ 21.828061] [<ffffffff8133ac8a>] really_probe+0x7a/0x2b0 [ 21.833676] [<ffffffff8133af23>] driver_probe_device+0x63/0xc0 [ 21.839868] [<ffffffff8133b01b>] __driver_attach+0x9b/0xa0 [ 21.845718] [<ffffffff8133af80>] ? driver_probe_device+0xc0/0xc0 [ 21.852027] [<ffffffff81339168>] bus_for_each_dev+0x68/0x90 [ 21.857876] [<ffffffff8133aa3c>] driver_attach+0x1c/0x20 [ 21.863462] [<ffffffff8133a64d>] bus_add_driver+0x16d/0x2b0 [ 21.869377] [<ffffffff8133b6dc>] driver_register+0x7c/0x160 [ 21.875220] [<ffffffff81296bda>] __pci_register_driver+0x6a/0xf0 [ 21.881494] [<ffffffffa01fe000>] ? 0xffffffffa01fdfff [ 21.886846] [<ffffffffa01fe047>] i7core_init+0x47/0x1000 [i7core_edac] [ 21.893737] [<ffffffff810001ce>] do_one_initcall+0x3e/0x180 [ 21.899670] [<ffffffff810a9b95>] sys_init_module+0xc5/0x220 [ 21.905542] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b Fix this by using ACCESS_ONCE() instead of rcu_dereference_check_mce() over mcelog.next. Since the access to each entry is controlled by the ->finished field, ACCESS_ONCE() should work just fine. An rcu_dereference is unnecessary here. Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
2012-03-06x86/kprobes: Split out optprobe related code to kprobes-opt.cMasami Hiramatsu
Split out optprobe related code to arch/x86/kernel/kprobes-opt.c for maintenanceability. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Suggested-by: Ingo Molnar <mingo@elte.hu> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133222.5982.54794.stgit@localhost.localdomain [ Tidied up the code a tiny bit ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-06x86/kprobes: Fix a bug which can modify kernel code permanentlyMasami Hiramatsu
Fix a bug in kprobes which can modify kernel code permanently at run-time. In the result, kernel can crash when it executes the modified code. This bug can happen when we put two probes enough near and the first probe is optimized. When the second probe is set up, it copies a byte which is already modified by the first probe, and executes it when the probe is hit. Even worse, the first probe and the second probe are removed respectively, the second probe writes back the copied (modified) instruction. To fix this bug, kprobes always recovers the original code and copies the first byte from recovered instruction. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133215.5982.31991.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-06x86/kprobes: Fix instruction recovery on optimized pathMasami Hiramatsu
Current probed-instruction recovery expects that only breakpoint instruction modifies instruction. However, since kprobes jump optimization can replace original instructions with a jump, that expectation is not enough. And it may cause instruction decoding failure on the function where an optimized probe already exists. This bug can reproduce easily as below: 1) find a target function address (any kprobe-able function is OK) $ grep __secure_computing /proc/kallsyms ffffffff810c19d0 T __secure_computing 2) decode the function $ objdump -d vmlinux --start-address=0xffffffff810c19d0 --stop-address=0xffffffff810c19eb vmlinux: file format elf64-x86-64 Disassembly of section .text: ffffffff810c19d0 <__secure_computing>: ffffffff810c19d0: 55 push %rbp ffffffff810c19d1: 48 89 e5 mov %rsp,%rbp ffffffff810c19d4: e8 67 8f 72 00 callq ffffffff817ea940 <mcount> ffffffff810c19d9: 65 48 8b 04 25 40 b8 mov %gs:0xb840,%rax ffffffff810c19e0: 00 00 ffffffff810c19e2: 83 b8 88 05 00 00 01 cmpl $0x1,0x588(%rax) ffffffff810c19e9: 74 05 je ffffffff810c19f0 <__secure_computing+0x20> 3) put a kprobe-event at an optimize-able place, where no call/jump places within the 5 bytes. $ su - # cd /sys/kernel/debug/tracing # echo p __secure_computing+0x9 > kprobe_events 4) enable it and check it is optimized. # echo 1 > events/kprobes/p___secure_computing_9/enable # cat ../kprobes/list ffffffff810c19d9 k __secure_computing+0x9 [OPTIMIZED] 5) put another kprobe on an instruction after previous probe in the same function. # echo p __secure_computing+0x12 >> kprobe_events bash: echo: write error: Invalid argument # dmesg | tail -n 1 [ 1666.500016] Probing address(0xffffffff810c19e2) is not an instruction boundary. 6) however, if the kprobes optimization is disabled, it works. # echo 0 > /proc/sys/debug/kprobes-optimization # cat ../kprobes/list ffffffff810c19d9 k __secure_computing+0x9 # echo p __secure_computing+0x12 >> kprobe_events (no error) This is because kprobes doesn't recover the instruction which is overwritten with a relative jump by another kprobe when finding instruction boundary. It only recovers the breakpoint instruction. This patch fixes kprobes to recover such instructions. With this fix: # echo p __secure_computing+0x9 > kprobe_events # echo 1 > events/kprobes/p___secure_computing_9/enable # cat ../kprobes/list ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED] # echo p __secure_computing+0x12 >> kprobe_events # cat ../kprobes/list ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED] ffffffff810c1ab2 k __secure_computing+0x12 [DISABLED] Changes in v4: - Fix a bug to ensure optimized probe is really optimized by jump. - Remove kprobe_optready() dependency. - Cleanup code for preparing optprobe separation. Changes in v3: - Fix a build error when CONFIG_OPTPROBE=n. (Thanks, Ingo!) To fix the error, split optprobe instruction recovering path from kprobes path. - Cleanup comments/styles. Changes in v2: - Fix a bug to recover original instruction address in RIP-relative instruction fixup. - Moved on tip/master. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133209.5982.36568.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05x32: Add ptrace for x32H.J. Lu
X32 ptrace is a hybrid of 64bit ptrace and compat ptrace with 32bit address and longs. It use 64bit ptrace to access the full 64bit registers. PTRACE_PEEKUSR and PTRACE_POKEUSR are only allowed to access segment and debug registers. PTRACE_PEEKUSR returns the lower 32bits and PTRACE_POKEUSR zero-extends 32bit value to 64bit. It works since the upper 32bits of segment and debug registers of x32 process are always zero. GDB only uses PTRACE_PEEKUSR and PTRACE_POKEUSR to access segment and debug registers. [ hpa: changed TIF_X32 test to use !is_ia32_task() instead, and moved the system call number to the now-unused 521 slot. ] Signed-off-by: "H.J. Lu" <hjl.tools@gmail.com> Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/1329696488-16970-1-git-send-email-hpa@zytor.com
2012-03-05perf: Add callback to flush branch_stack on context switchStephane Eranian
With branch stack sampling, it is possible to filter by priv levels. In system-wide mode, that means it is possible to capture only user level branches. The builtin SW LBR filter needs to disassemble code based on LBR captured addresses. For that, it needs to know the task the addresses are associated with. Because of context switches, the content of the branch stack buffer may contain addresses from different tasks. We need a callback on context switch to either flush the branch stack or save it. This patch adds a new callback in struct pmu which is called during context switches. The callback is called only when necessary. That is when a system-wide context has, at least, one event which uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for per-thread context. In this version, the Intel x86 code simply flushes (resets) the LBR on context switches (fills it with zeroes). Those zeroed branches are then filtered out by the SW filter. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf: Disable PERF_SAMPLE_BRANCH_* when not supportedStephane Eranian
PERF_SAMPLE_BRANCH_* is disabled for: - SW events (sw counters, tracepoints) - HW breakpoints - ALL but Intel x86 architecture - AMD64 processors Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Add LBR software filter support for Intel CPUsStephane Eranian
This patch adds an internal sofware filter to complement the (optional) LBR hardware filter. The software filter is necessary: - as a substitute when there is no HW LBR filter (e.g., Atom, Core) - to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere) - to provide finer grain filtering (e.g., all processors) Sometimes the LBR HW filter cannot distinguish between two types of branches. For instance, to capture syscall as CALLS, it is necessary to enable the LBR_FAR filter which will also capture JMP instructions. Thus, a second pass is necessary to filter those out, this is what the SW filter can do. The SW filter is built on top of the internal x86 disassembler. It is a best effort filter especially for user level code. It is subject to the availability of the text page of the program. The SW filter is enabled on all Intel processors. It is bypassed when the user is capturing all branches at all priv levels. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-9-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Implement PERF_SAMPLE_BRANCH for Intel CPUsStephane Eranian
This patch implements PERF_SAMPLE_BRANCH support for Intel x86processors. It connects PERF_SAMPLE_BRANCH to the actual LBR. The patch adds the hooks in the PMU irq handler to save the LBR on counter overflow for both regular and PEBS modes. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-8-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Disable LBR support for older Intel Atom processorsStephane Eranian
The patch adds a restriction for Intel Atom LBR support. Only steppings 10 (PineView) and more recent are supported. Older models do not have a functional LBR. Their LBR does not freeze on PMU interrupt which makes LBR unusable in the context of perf_events. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-7-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Add Intel LBR mappings for PERF_SAMPLE_BRANCH filtersStephane Eranian
This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_* filters to the actual Intel x86LBR filters, whenever they exist. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-6-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Sync branch stack sampling with precise_samplingStephane Eranian
If precise sampling is enabled on Intel x86 then perf_event uses PEBS. To correct for the off-by-one error of PEBS, perf_event uses LBR when precise_sample > 1. On Intel x86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR, therefore both features must be coordinated as they may not configure LBR the same way. For PEBS, LBR needs to capture all branches at the priv level of the associated event. This patch checks that the branch type and priv level of BRANCH_STACK is compatible with that of the PEBS LBR requirement, thereby allowing: $ perf record -b any,u -e instructions:upp .... But: $ perf record -b any_call,u -e instructions:upp Is not possible. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-5-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Add Intel LBR sharing logicStephane Eranian
The Intel LBR on some recent processor is capable of filtering branches by type. The filter is configurable via the LBR_SELECT MSR register. There are limitation on how this register can be used. On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads when HT is on. It is private to each core when HT is off. On SandyBridge, the LBR_SELECT register is private to each thread when HT is on. It is private to each core when HT is off. The kernel must manage the sharing of LBR_SELECT. It allows multiple users on the same logical CPU to use LBR_SELECT as long as they program it with the same value. Across sibling CPUs (HT threads), the same restriction applies on NHM/WSM. This patch implements this sharing logic by leveraging the mechanism put in place for managing the offcore_response shared MSR. We modify __intel_shared_reg_get_constraints() to cause x86_get_event_constraint() to be called because LBR may be associated with events that may be counter constrained. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-4-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf/x86: Add Intel LBR MSR definitionsStephane Eranian
This patch adds the LBR definitions for NHM/WSM/SNB and Core. It also adds the definitions for the architected LBR MSR: LBR_SELECT, LBRT_TOS. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-3-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05perf: Add generic taken branch sampling supportStephane Eranian
This patch adds the ability to sample taken branches to the perf_event interface. The ability to capture taken branches is very useful for all sorts of analysis. For instance, basic block profiling, call counts, statistical call graph. This new capability requires hardware assist and as such may not be available on all HW platforms. On Intel x86 it is implemented on top of the Last Branch Record (LBR) facility. To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK bit must be set in attr->sample_type. Sampled taken branches may be filtered by type and/or priv levels. The patch adds a new field, called branch_sample_type, to the perf_event_attr structure. It contains a bitmask of filters to apply to the sampled taken branches. Filters may be implemented in HW. If the HW filter does not exist or is not good enough, some arch may also implement a SW filter. The following generic filters are currently defined: - PERF_SAMPLE_USER only branches whose targets are at the user level - PERF_SAMPLE_KERNEL only branches whose targets are at the kernel level - PERF_SAMPLE_HV only branches whose targets are at the hypervisor level - PERF_SAMPLE_ANY any type of branches (subject to priv levels filters) - PERF_SAMPLE_ANY_CALL any call branches (may incl. syscall on some arch) - PERF_SAMPLE_ANY_RET any return branches (may incl. syscall returns on some arch) - PERF_SAMPLE_IND_CALL indirect call branches Obviously filter may be combined. The priv level bits are optional. If not provided, the priv level of the associated event are used. It is possible to collect branches at a priv level different from the associated event. Use of kernel, hv priv levels is subject to permissions and availability (hv). The number of taken branch records present in each sample may vary based on HW, the type of sampled branches, the executed code. Therefore each sample contains the number of taken branches it contains. Signed-off-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-05x86: Introduce x86_cpuinit.early_percpu_clock_init hookIgor Mammedov
When kvm guest uses kvmclock, it may hang on vcpu hot-plug. This is caused by an overflow in pvclock_get_nsec_offset, u64 delta = tsc - shadow->tsc_timestamp; which in turn is caused by an undefined values from percpu hv_clock that hasn't been initialized yet. Uninitialized clock on being booted cpu is accessed from start_secondary -> smp_callin -> smp_store_cpu_info -> identify_secondary_cpu -> mtrr_ap_init -> mtrr_restore -> stop_machine_from_inactive_cpu -> queue_stop_cpus_work ... -> sched_clock -> kvm_clock_read which is well before x86_cpuinit.setup_percpu_clockev call in start_secondary, where percpu clock is initialized. This patch introduces a hook that allows to setup/initialize per_cpu clock early and avoid overflow due to reading - undefined values - old values if cpu was offlined and then onlined again Another possible early user of this clock source is ftrace that accesses it to get timestamps for ring buffer entries. So if mtrr_ap_init is moved from identify_secondary_cpu to past x86_cpuinit.setup_percpu_clockev in start_secondary, ftrace may cause the same overflow/hang on cpu hot-plug anyway. More complete description of the problem: https://lkml.org/lkml/2012/2/2/101 Credits to Marcelo Tosatti <mtosatti@redhat.com> for hook idea. Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Igor Mammedov <imammedo@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-03-05Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-04Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / Freezer: Remove references to TIF_FREEZE in comments PM / Sleep: Add more wakeup source initialization routines PM / Hibernate: Enable usermodehelpers in hibernate() error path PM / Sleep: Make __pm_stay_awake() delete wakeup source timers PM / Sleep: Fix race conditions related to wakeup source timer function PM / Sleep: Fix possible infinite loop during wakeup source destruction PM / Hibernate: print physical addresses consistently with other parts of kernel PM: Add comment describing relationships between PM callbacks to pm.h PM / Sleep: Drop suspend_stats_update() PM / Sleep: Make enter_state() in kernel/power/suspend.c static PM / Sleep: Unify kerneldoc comments in kernel/power/suspend.c PM / Sleep: Remove unnecessary label from suspend_freeze_processes() PM / Sleep: Do not check wakeup too often in try_to_freeze_tasks() PM / Sleep: Initialize wakeup source locks in wakeup_source_add() PM / Hibernate: Refactor and simplify freezer_test_done PM / Hibernate: Thaw kernel threads in hibernation_snapshot() in error/test path PM / Freezer / Docs: Document the beauty of freeze/thaw semantics PM / Suspend: Avoid code duplication in suspend statistics update PM / Sleep: Introduce generic callbacks for new device PM phases PM / Sleep: Introduce "late suspend" and "early resume" of devices
2012-03-02perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM disabledJoerg Roedel
It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Avi Kivity <avi@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Robert Richter <robert.richter@amd.com> Cc: stable@vger.kernel.org # v3.2 Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-03-01x86, mtrr: Use explicit sizing and padding for the 64-bit ioctlsH. Peter Anvin
Specify the data structures for the 64-bit ioctls with explicit sizing and padding so that the x32 kernel will correctly use the 64-bit forms of these ioctls. Note that these ioctls are bogus in both forms on both 32 and 64 bits; even on 64 bits the maximum MTRR size is only 44 bits long. Note that nothing really is supposed to use these ioctls and that the preferred interface is text strings on /proc/mtrr, or better yet, nothing at all (use /sys/bus/pci/devices/*/resource*_wc for write combining; that uses PAT not MTRRs.) Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Cc: H. J. Lu <hjl.tools@gmail.com> Tested-by: Nitin A. Kamble <nitin.a.kamble@intel.com> Link: http://lkml.kernel.org/n/tip-vwvnlu3hjmtkwvij4qxtm90l@git.kernel.org Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-03-01sched/rt: Use schedule_preempt_disabled()Thomas Gleixner
Coccinelle based conversion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-24swm5zut3h9c4a6s46x8rws@git.kernel.org Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-29bug.h: add include of it to various implicit C usersPaul Gortmaker
With bug.h currently living right in linux/kernel.h there are files that use BUG_ON and friends but are not including the header explicitly. Fix them up so we can remove the presence in kernel.h file. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-02-28x86: relocate get/set debugreg fcns to include/asm/debugreg.Paul Gortmaker
Since we already have a debugreg.h header file, move the assoc. get/set functions to it. In addition to it being the logical home for them, it has a secondary advantage. The functions that are moved use BUG(). So we really need to have linux/bug.h in scope. But asm/processor.h is used about 600 times, vs. only about 15 for debugreg.h -- so adding bug.h to the latter reduces the amount of time we'll be processing it during a compile. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Acked-by: Ingo Molnar <mingo@elte.hu> CC: Thomas Gleixner <tglx@linutronix.de> CC: "H. Peter Anvin" <hpa@zytor.com>
2012-02-28Merge branch 'tip/x86/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into x86/asm
2012-02-28Merge branch 'linus' into x86/asmIngo Molnar
Sync up the latest NMI fixes. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce/AMD: Fix UP build error x86: Specify a size for the cmp in the NMI handler x86/nmi: Test saved %cs in NMI to determine nested NMI case x86/amd: Fix L1i and L2 cache sharing information for AMD family 15h processors x86/microcode: Remove noisy AMD microcode warning
2012-02-27x86-64: Fix CFI data for common_interrupt()Mark Wielaard
Commit eab9e6137f23 ("x86-64: Fix CFI data for interrupt frames") introduced a DW_CFA_def_cfa_expression in the SAVE_ARGS_IRQ macro. To later define the CFA using a simple register+offset rule both register and offset need to be supplied. Just using CFI_DEF_CFA_REGISTER leaves the offset undefined. So use CFI_DEF_CFA with reg+off explicitly at the end of common_interrupt. Signed-off-by: Mark Wielaard <mjw@redhat.com> Acked-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/1330079527-30711-1-git-send-email-mjw@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27x86/time: Eliminate unused irq0_irqs counterJan Beulich
As of v2.6.38 this counter is being maintained without ever being read. Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/4F4787930200007800074A10@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-27x86: Properly _init-annotate NMI selftest codeJan Beulich
After all, this code is being run once at boot only (if configured in at all). Signed-off-by: Jan Beulich <jbeulich@suse.com> Acked-by: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/4F478C010200007800074A3D@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-26x86_64: Record stack pointer before task execution beginsSiddhesh Poyarekar
task->thread.usersp is unusable immediately after a binary is exec()'d until it undergoes a context switch cycle. The start_thread() function called during execve() saves the stack pointer into pt_regs and into old_rsp, but fails to record it into task->thread.usersp. Because of this, KSTK_ESP(task) returns an incorrect value for a 64-bit program until the task is switched out and back in since switch_to swaps %rsp values in and out into task->thread.usersp. Signed-off-by: Siddhesh Poyarekar <siddhesh.poyarekar@gmail.com> Link: http://lkml.kernel.org/r/1330273075-2949-1-git-send-email-siddhesh.poyarekar@gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-25x32: Only clear TIF_X32 flag onceBobby Powers
Commits bb212724 and d1a797f3 both added a call to clear_thread_flag(TIF_X32) under set_personality_64bit() - only one is needed. Signed-off-by: Bobby Powers <bobbypowers@gmail.com> Link: http://lkml.kernel.org/r/1330228774-24223-1-git-send-email-bobbypowers@gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-25x32: Make sure TS_COMPAT is cleared for x32 tasksBobby Powers
If a process has a non-x32 ia32 personality and changes to x32, the process would keep its TS_COMPAT flag. x32 uses the presence of the x32 flag on a syscall to determine compat status, so make sure TS_COMPAT is cleared. Signed-off-by: Bobby Powers <bobbypowers@gmail.com> Link: http://lkml.kernel.org/r/1330230338-25077-1-git-send-email-bobbypowers@gmail.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-24PCI: Use class for quirk for via_no_dacYinghai Lu
Signed-off-by: Yinghai Lu <yinghai@kernel.org> Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>
2012-02-24x86: Fix the NMI nesting commentsSteven Rostedt
Some of the comments for the nesting NMI algorithm were stale and had some references to some prototypes that were first tried. I also updated the comments to be a little easier to understand the flow of the code. It definitely needs the documentation. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-24x86-64: Improve insn scheduling in SAVE_ARGS_IRQJan Beulich
In one case, use an address register that was computed earlier (and with a simpler instruction), thus reducing the risk of a stall. In the second case, eliminate a branch by using a conditional move (as is already done in call_softirq and xen_do_hypervisor_callback). Signed-off-by: Jan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/4F4788A50200007800074A26@nat28.tlf.novell.com Reviewed-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-24x86-64: Fix CFI annotations for NMI nesting codeJan Beulich
The saving and restoring of %rdx wasn't annotated at all, and the jumping over sections where state gets partly restored wasn't handled either. Further, by folding the pushing of the previous frame in repeat_nmi into that which so far was immediately preceding restart_nmi (after moving the restore of %rdx ahead of that, since it doesn't get used anymore when pushing prior frames), annotations of the replicated frame creations can be made consistent too. v2: Fully fold repeat_nmi into the normal code flow (adding a single redundant instruction to the "normal" code path), thus retaining the special protection of all instructions between repeat_nmi and end_repeat_nmi. Link: http://lkml.kernel.org/r/4F478B630200007800074A31@nat28.tlf.novell.com Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-02-24Merge tag 'mce-recovery-for-tip' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/mce Add symbolic defines for architectural MCACOD constants
2012-02-24static keys: Introduce 'struct static_key', static_key_true()/false() and ↵Ingo Molnar
static_key_slow_[inc|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Jason Baron <jbaron@redhat.com> Acked-by: Steven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.hu Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-23x86, efi: Allow basic init with mixed 32/64-bit efi/kernelOlof Johansson
Traditionally the kernel has refused to setup EFI at all if there's been a mismatch in 32/64-bit mode between EFI and the kernel. On some platforms that boot natively through EFI (Chrome OS being one), we still need to get at least some of the static data such as memory configuration out of EFI. Runtime services aren't as critical, and it's a significant amount of work to implement switching between the operating modes to call between kernel and firmware for thise cases. So I'm ignoring it for now. v5: * Fixed some printk strings based on feedback * Renamed 32/64-bit specific types to not have _ prefix * Fixed bug in printout of efi runtime disablement v4: * Some of the earlier cleanup was accidentally reverted by this patch, fixed. * Reworded some messages to not have to line wrap printk strings v3: * Reorganized to a series of patches to make it easier to review, and do some of the cleanups I had left out before. v2: * Added graceful error handling for 32-bit kernel that gets passed EFI data above 4GB. * Removed some warnings that were missed in first version. Signed-off-by: Olof Johansson <olof@lixom.net> Link: http://lkml.kernel.org/r/1329081869-20779-6-git-send-email-olof@lixom.net Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-23irq_domain/x86: Convert x86 (embedded) to use common irq_domainGrant Likely
This patch removes the x86-specific definition of irq_domain and replaces it with the common implementation. Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Rob Herring <rob.herring@calxeda.com> Cc: Thomas Gleixner <tglx@linutronix.de>
2012-02-22x86/mce: Fix return value of mce_chrdev_read() when erst is disabledNaoya Horiguchi
Current kernel MCE code reads ERST at the first reading of /dev/mcelog (maybe in starting mcelogd,) even if the system does not support ERST, which results in a fake "no such device" message (as described in [1].) This problem is not critical, but can confuse system admins. This patch fixes it by filtering the return value from lower (ACPI) layer. [1] http://thread.gmane.org/gmane.linux.kernel/1060250 Reported by: Jon Masters <jonathan@jonmasters.org> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.org/lkml/2012/1/23/299 Signed-off-by: Tony Luck <tony.luck@intel.com>
2012-02-22x86/mce: Convert static array of pointers to per-cpu variablesGreg Kroah-Hartman
When I previously fixed up the mce_device code, I used a static array of the pointers. It was (rightfully) pointed out to me that I should be using the per_cpu code instead. This patch converts the code over to that structure, moving the variable back into the per_cpu area, like it used to be for 3.2 and earlier. Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Reviewed-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Link: https://lkml.org/lkml/2012/1/27/165 Signed-off-by: Tony Luck <tony.luck@intel.com>
2012-02-22x86: Remove some noise from boot log when starting cpusLuck, Tony
Printing the "start_ip" for every secondary cpu is very noisy on a large system - and doesn't add any value. Drop this message. Console log before: Booting Node 0, Processors #1 smpboot cpu 1: start_ip = 96000 #2 smpboot cpu 2: start_ip = 96000 #3 smpboot cpu 3: start_ip = 96000 #4 smpboot cpu 4: start_ip = 96000 ... #31 smpboot cpu 31: start_ip = 96000 Brought up 32 CPUs Console log after: Booting Node 0, Processors #1 #2 #3 #4 #5 #6 #7 Ok. Booting Node 1, Processors #8 #9 #10 #11 #12 #13 #14 #15 Ok. Booting Node 0, Processors #16 #17 #18 #19 #20 #21 #22 #23 Ok. Booting Node 1, Processors #24 #25 #26 #27 #28 #29 #30 #31 Brought up 32 CPUs Acked-by: Borislav Petkov <bp@amd64.org> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/4f452eb42507460426@agluck-desktop.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
2012-02-22x86/mce/AMD: Fix UP build errorBorislav Petkov
141168c36cde ("x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86'") removed a bunch of CONFIG_SMP ifdefs around code touching struct cpuinfo_x86 members but also caused the following build error with Randy's randconfigs: mce_amd.c:(.cpuinit.text+0x4723): undefined reference to `cpu_llc_shared_map' Restore the #ifdef in threshold_create_bank() which creates symlinks on the non-BSP CPUs. There's a better patch series being worked on by Kevin Winchester which will solve this in a cleaner fashion, but that series is too ambitious for v3.3 merging - so we first queue up this trivial fix and then do the rest for v3.4. Signed-off-by: Borislav Petkov <bp@alien8.de> Acked-by: Kevin Winchester <kjwinchester@gmail.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Nick Bowler <nbowler@elliptictech.com> Link: http://lkml.kernel.org/r/20120203191801.GA2846@x1.osrc.amd.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-22x86/tsc: Reduce the TSC sync check time for core-siblingsSuresh Siddha
For each logical CPU that is coming online, we spend 20msec for checking the TSC synchronization. And as this is done sequentially for each logical CPU boot, this time gets added up depending on the number of logical CPU's supported by the platform. Minimize this by using the socket topology information. If the target CPU coming online doesn't have any of its core-siblings online, a timeout of 20msec will be used for the TSC-warp measurement loop. Otherwise a smaller timeout of 2msec will be used, as we have some information about this socket already (and this information grows as we have more and more logical-siblings in that socket). Ideally we should be able to skip the TSC sync check on the other core-siblings, if the first logical CPU in a socket passed the sync test. But as the TSC is per-logical CPU and can potentially be modified wrongly by the bios before the OS boot, TSC sync test for smaller duration should be able to catch such errors. Also this will catch the condition where all the cores in the socket doesn't get reset at the same time. For example, with this modification, time spent in TSC sync checks on a 4 socket 10-core with HT system gets reduced from 1580msec to 212msec. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Acked-by: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jack Steiner <steiner@sgi.com> Cc: venki@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1328581940.29790.20.camel@sbsiddha-desk.sc.intel.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-21i387: Split up <asm/i387.h> into exported and internal interfacesLinus Torvalds
While various modules include <asm/i387.h> to get access to things we actually *intend* for them to use, most of that header file was really pretty low-level internal stuff that we really don't want to expose to others. So split the header file into two: the small exported interfaces remain in <asm/i387.h>, while the internal definitions that are only used by core architecture code are now in <asm/fpu-internal.h>. The guiding principle for this was to expose functions that we export to modules, and leave them in <asm/i387.h>, while stuff that is used by task switching or was marked GPL-only is in <asm/fpu-internal.h>. The fpu-internal.h file could be further split up too, especially since arch/x86/kvm/ uses some of the remaining stuff for its module. But that kvm usage should probably be abstracted out a bit, and at least now the internal FPU accessor functions are much more contained. Even if it isn't perhaps as contained as it _could_ be. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-21i387: Uninline the generic FP helpers that we expose to kernel modulesLinus Torvalds
Instead of exporting the very low-level internals of the FPU state save/restore code (ie things like 'fpu_owner_task'), we should export the higher-level interfaces. Inlining these things is pointless anyway: sure, sometimes the end result is small, but while 'stts()' can result in just three x86 instructions, those are not cheap instructions (writing %cr0 is a serializing instruction and a very slow one at that). So the overhead of a function call is not noticeable, and we really don't want random modules mucking about with our internal state save logic anyway. So this unexports 'fpu_owner_task', and instead uninlines and exports the actual functions that modules can use: fpu_kernel_begin/end() and unlazy_fpu(). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211339590.5354@i5.linux-foundation.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-02-20i387: export 'fpu_owner_task' per-cpu variableLinus Torvalds
(And define it properly for x86-32, which had its 'current_task' declaration in separate from x86-64) Bitten by my dislike for modules on the machines I use, and the fact that apparently nobody else actually wanted to test the patches I sent out. Snif. Nobody else cares. Anyway, we probably should uninline the 'kernel_fpu_begin()' function that is what modules actually use and that references this, but this is the minimal fix for now. Reported-by: Josh Boyer <jwboyer@gmail.com> Reported-and-tested-by: Jongman Heo <jongman.heo@samsung.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-02-20x86: Specify a size for the cmp in the NMI handlerSteven Rostedt
Linus noticed that the cmp used to check if the code segment is __KERNEL_CS or not did not specify a size. Perhaps it does not matter as H. Peter Anvin noted that user space can not set the bottom two bits of the %cs register. But it's best not to let the assembly choose and change things between different versions of gas, but instead just pick the size. Four bytes are used to compare the saved code segment against __KERNEL_CS. Perhaps this might mess up Xen, but we can fix that when the time comes. Also I noticed that there was another non-specified cmp that checks the special stack variable if it is 1 or 0. This too probably doesn't matter what cmp is used, but this patch uses cmpl just to make it non ambiguous. Link: http://lkml.kernel.org/r/CA+55aFxfAn9MWRgS3O5k2tqN5ys1XrhSFVO5_9ZAoZKDVgNfGA@mail.gmail.com Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>