summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/nmi.c
AgeCommit message (Collapse)Author
2020-01-11x86/nmi: Remove irq_work from the long duration NMI handlerChangbin Du
First, printk() is NMI-context safe now since the safe printk() has been implemented and it already has an irq_work to make NMI-context safe. Second, this NMI irq_work actually does not work if a NMI handler causes panic by watchdog timeout. It has no chance to run in such case, while the safe printk() will flush its per-cpu buffers before panicking. While at it, repurpose the irq_work callback into a function which concentrates the NMI duration checking and makes the code easier to follow. [ bp: Massage. ] Signed-off-by: Changbin Du <changbin.du@gmail.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200111125427.15662-1-changbin.du@gmail.com
2019-07-25x86/hotplug: Silence APIC and NMI when CPU is deadThomas Gleixner
In order to support IPI/NMI broadcasting via the shorthand mechanism side effects of shorthands need to be mitigated: Shorthand IPIs and NMIs hit all CPUs including unplugged CPUs Neither of those can be handled on unplugged CPUs for obvious reasons. It would be trivial to just fully disable the APIC via the enable bit in MSR_APICBASE. But that's not possible because clearing that bit on systems based on the 3 wire APIC bus would require a hardware reset to bring it back as the APIC would lose track of bus arbitration. On systems with FSB delivery APICBASE could be disabled, but it has to be guaranteed that no interrupt is sent to the APIC while in that state and it's not clear from the SDM whether it still responds to INIT/SIPI messages. Therefore stay on the safe side and switch the APIC into soft disabled mode so it won't deliver any regular vector to the CPU. NMIs are still propagated to the 'dead' CPUs. To mitigate that add a check for the CPU being offline on early nmi entry and if so bail. Note, this cannot use the stop/restart_nmi() magic which is used in the alternatives code. A dead CPU cannot invoke nmi_enter() or anything else due to RCU and other reasons. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907241723290.1791@nanos.tec.linutronix.de
2019-05-21treewide: Add SPDX license identifier for missed filesThomas Gleixner
Add SPDX license identifiers to all files which: - Have no license information of any form - Have EXPORT_.*_SYMBOL_GPL inside which was used in the initial scan/conversion to ignore the file These files fall under the project license, GPL v2 only. The resulting SPDX license identifier is: GPL-2.0-only Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-14Merge branch 'x86-mds-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...
2019-04-17x86/exceptions: Split debug IST stackThomas Gleixner
The debug IST stack is actually two separate debug stacks to handle #DB recursion. This is required because the CPU starts always at top of stack on exception entry, which means on #DB recursion the second #DB would overwrite the stack of the first. The low level entry code therefore adjusts the top of stack on entry so a secondary #DB starts from a different stack page. But the stack pages are adjacent without a guard page between them. Split the debug stack into 3 stacks which are separated by guard pages. The 3rd stack is never mapped into the cpu_entry_area and is only there to catch triple #DB nesting: --- top of DB_stack <- Initial stack --- end of DB_stack guard page --- top of DB1_stack <- Top of stack after entering first #DB --- end of DB1_stack guard page --- top of DB2_stack <- Top of stack after entering second #DB --- end of DB2_stack guard page If DB2 would not act as the final guard hole, a second #DB would point the top of #DB stack to the stack below #DB1 which would be valid and not catch the not so desired triple nesting. The backing store does not allocate any memory for DB2 and its guard page as it is not going to be mapped into the cpu_entry_area. - Adjust the low level entry code so it adjusts top of #DB with the offset between the stacks instead of exception stack size. - Make the dumpstack code aware of the new stacks. - Adjust the in_debug_stack() implementation and move it into the NMI code where it belongs. As this is NMI hotpath code, it just checks the full area between top of DB_stack and bottom of DB1_stack without checking for the guard page. That's correct because the NMI cannot hit a stackpointer pointing to the guard page between DB and DB1 stack. Even if it would, then the NMI operation still is unaffected, but the resume of the debug exception on the topmost DB stack will crash by touching the guard page. [ bp: Make exception_stack_names static const char * const ] Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: linux-doc@vger.kernel.org Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Qian Cai <cai@lca.pw> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190414160145.439944544@linutronix.de
2019-03-06x86/speculation/mds: Clear CPU buffers on exit to userThomas Gleixner
Add a static key which controls the invocation of the CPU buffer clear mechanism on exit to user space and add the call into prepare_exit_to_usermode() and do_nmi() right before actually returning. Add documentation which kernel to user space transition this covers and explain why some corner cases are not mitigated. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Jon Masters <jcm@redhat.com> Tested-by: Jon Masters <jcm@redhat.com>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-08-16x86/nmi: Use raw lockScott Wood
register_nmi_handler() can be called from PREEMPT_RT atomic context (e.g. wakeup_cpu_via_init_nmi() or native_stop_other_cpus()), and thus ordinary spinlocks cannot be used. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/20170724213242.27598-1-swood@redhat.com
2017-05-01Merge tag 'edac_for_4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bpLinus Torvalds
Pull EDAC updates from Borislav Petkov: - an EDAC driver for Cavium ThunderX RAS IP (Sergey Temerkhanov) - removal of DRAM error reporting through PCI SERR NMI (Borislav Petkov) - misc small fixes (Jan Glauber, Thor Thayer) * tag 'edac_for_4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: EDAC, ghes: Do not enable it by default EDAC: Rename report status accessors EDAC: Delete edac_stub.c EDAC: Update Kconfig help text EDAC: Remove EDAC_MM_EDAC EDAC: Issue tracepoint only when it is defined ACPI/extlog: Add EDAC dependency EDAC: Move edac_op_state to edac_mc.c EDAC: Remove edac_err_assert EDAC: Get rid of edac_handlers x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMI EDAC, highbank: Align Makefile directives EDAC, thunderx: Remove unused code EDAC, thunderx: Change LMC index calculation EDAC, altera: Fix peripheral warnings for Cyclone5 EDAC, thunderx: Fix L2C MCI interrupt disable EDAC, thunderx: Add Cavium ThunderX EDAC driver
2017-04-10x86/nmi, EDAC: Get rid of DRAM error reporting thru PCI SERR NMIBorislav Petkov
Apparently, some machines used to report DRAM errors through a PCI SERR NMI. This is why we have a call into EDAC in the NMI handler. See c0d121720220 ("drivers/edac: add new nmi rescan"). From looking at the patch above, that's two drivers: e752x_edac.c and e7xxx_edac.c. Now, I wanna say those are old machines which are probably decommissioned already. Tony says that "[t]the newest CPU supported by either of those drivers is the Xeon E7520 (a.k.a. "Nehalem") released in Q1'2010. Possibly some folks are still using these ... but people that hold onto h/w for 7 years generally cling to old s/w too ... so I'd guess it unlikely that we will get complaints for breaking these in upstream." So even if there is a small number still in use, we did load EDAC with edac_op_state == EDAC_OPSTATE_POLL by default (we still do, in fact) which means a default EDAC setup without any parameters supplied on the command line or otherwise would never even log the error in the NMI handler because we're polling by default: inline int edac_handler_set(void) { if (edac_op_state == EDAC_OPSTATE_POLL) return 0; return atomic_read(&edac_handlers); } So, long story short, I'd like to get rid of that nastiness called edac_stub.c and confine all the EDAC drivers solely to drivers/edac/. If we ever have to do stuff like that again, it should be notifiers we're using and not some insanity like this one. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com>
2017-03-13x86/platform: Remove warning message for duplicate NMI handlersMike Travis
Remove the WARNING message associated with multiple NMI handlers as there are at least two that are legitimate. These are the KGDB and the UV handlers and both want to be called if the NMI has not been claimed by any other NMI handler. Use of the UNKNOWN NMI call chain dramatically lowers the NMI call rate when high frequency NMI tools are in use, notably the perf tools. It is required on systems that cannot sustain a high NMI call rate without adversely affecting the system operation. Signed-off-by: Mike Travis <mike.travis@hpe.com> Reviewed-by: Dimitri Sivanich <dimitri.sivanich@hpe.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Russ Anderson <russ.anderson@hpe.com> Cc: Frank Ramsay <frank.ramsay@hpe.com> Cc: Tony Ernst <tony.ernst@hpe.com> Link: http://lkml.kernel.org/r/20170307210841.730959611@asylum.americas.sgi.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/debug.h> We are going to split <linux/sched/debug.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/debug.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-06-06x86: include linux/ratelimit.h in nmi.cArnd Bergmann
When building random configurations, we now occasionally get a new build error: In file included from include/linux/kernel.h:13:0, from include/linux/list.h:8, from include/linux/preempt.h:10, from include/linux/spinlock.h:50, from arch/x86/kernel/nmi.c:13: arch/x86/kernel/nmi.c: In function 'nmi_max_handler': include/linux/printk.h:375:9: error: type defaults to 'int' in declaration of 'DEFINE_RATELIMIT_STATE' [-Werror=implicit-int] static DEFINE_RATELIMIT_STATE(_rs, \ ^ arch/x86/kernel/nmi.c:110:2: note: in expansion of macro 'printk_ratelimited' printk_ratelimited(KERN_INFO ^~~~~~~~~~~~~~~~~~ This was working before the rtc rework series because linux/ratelimit.h was included implictly through asm/mach_traps.h -> asm/mc146818rtc.h -> linux/mc146818rtc.h -> linux/rtc.h -> linux/device.h. We clearly shouldn't rely on this indirect inclusion, so this adds an explicit #include in the file that needs it. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reported-by: kbuild test robot <fengguang.wu@intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Fixes: 5ab788d73832 ("rtc: cmos: move mc146818rtc code out of asm-generic/rtc.h") Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
2016-03-08x86/nmi: Mark 'ignore_nmis' as __read_mostlyKostenzer Felix
ignore_nmis is used in two distinct places: 1. modified through {stop,restart}_nmi by alternative_instructions 2. read by do_nmi to determine if default_do_nmi should be called or not thus the access pattern conforms to __read_mostly and do_nmi() is a fastpath. Signed-off-by: Kostenzer Felix <fkostenzer@live.at> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-12-19x86/nmi: Save regs in crash dump on external NMIHidehiro Kawai
Now, multiple CPUs can receive an external NMI simultaneously by specifying the "apic_extnmi=all" command line parameter. When we take a crash dump by using external NMI with this option, we fail to save registers into the crash dump. This happens as follows: CPU 0 CPU 1 ================================ ============================= receive an external NMI default_do_nmi() receive an external NMI spin_lock(&nmi_reason_lock) default_do_nmi() io_check_error() spin_lock(&nmi_reason_lock) panic() busy loop ... kdump_nmi_shootdown_cpus() issue NMI IPI -----------> blocked until IRET busy loop... Here, since CPU 1 is in NMI context, an additional NMI from CPU 0 remains unhandled until CPU 1 IRETs. However, CPU 1 will never execute IRET so the NMI is not handled and the callback function to save registers is never called. To solve this issue, we check if the IPI for crash dumping was issued while waiting for nmi_reason_lock to be released, and if so, call its callback function directly. If the IPI is not issued (e.g. kdump is disabled), the actual behavior doesn't change. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20151210065245.4587.39316.stgit@softrs Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19panic, x86: Allow CPUs to save registers even if looping in NMI contextHidehiro Kawai
Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(), sends an NMI IPI to CPUs which haven't called panic() to stop them, save their register information and do some cleanups for crash dumping. However, if such a CPU is infinitely looping in NMI context, we fail to save its register information into the crash dump. For example, this can happen when unknown NMIs are broadcast to all CPUs as follows: CPU 0 CPU 1 =========================== ========================== receive an unknown NMI unknown_nmi_error() panic() receive an unknown NMI spin_trylock(&panic_lock) unknown_nmi_error() crash_kexec() panic() spin_trylock(&panic_lock) panic_smp_self_stop() infinite loop kdump_nmi_shootdown_cpus() issue NMI IPI -----------> blocked until IRET infinite loop... Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET, so the NMI is not handled and the callback function to save registers is never called. In practice, this can happen on some servers which broadcast NMIs to all CPUs when the NMI button is pushed. To save registers in this case, we need to: a) Return from NMI handler instead of looping infinitely or b) Call the callback function directly from the infinite loop Inherently, a) is risky because NMI is also used to prevent corrupted data from being propagated to devices. So, we chose b). This patch does the following: 1. Move the infinite looping of CPUs which haven't called panic() in NMI context (actually done by panic_smp_self_stop()) outside of panic() to enable us to refer pt_regs. Please note that panic_smp_self_stop() is still used for normal context. 2. Call a callback of kdump_nmi_shootdown_cpus() directly to save registers and do some cleanups after setting waiting_for_crash_ipi which is used for counting down the number of CPUs which handled the callback Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Seth Jennings <sjenning@redhat.com> Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs [ Cleanup comments, fixup formatting. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-12-19panic, x86: Fix re-entrance problem due to panic on NMIHidehiro Kawai
If panic on NMI happens just after panic() on the same CPU, panic() is recursively called. Kernel stalls, as a result, after failing to acquire panic_lock. To avoid this problem, don't call panic() in NMI context if we've already entered panic(). For that, introduce nmi_panic() macro to reduce code duplication. In the case of panic on NMI, don't return from NMI handlers if another CPU already panicked. Signed-off-by: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: David Hildenbrand <dahi@linux.vnet.ibm.com> Cc: Don Zickus <dzickus@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: kexec@lists.infradead.org Cc: linux-doc@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seth Jennings <sjenning@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs [ Cleanup comments, fixup formatting. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-07-31Merge branch 'x86/urgent' into x86/asm, before applying dependent patchesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-21x86/nmi: Remove the 'b2b' parameter from nmi_handle()Andy Lutomirski
It has never had any effect. Remove it for comprehensibility. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/c91fa38507760d9e54a4b8737fa6409bde896b33.1437418322.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-17x86/nmi/64: Improve nested NMI commentsAndy Lutomirski
I found the nested NMI documentation to be difficult to follow. Improve the comments. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-07-17x86/nmi: Enable nested do_nmi() handling for 64-bit kernelsAndy Lutomirski
32-bit kernels handle nested NMIs in C. Enable the exact same handling on 64-bit kernels as well. This isn't currently necessary, but it will become necessary once the asm code starts allowing limited nesting. Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-04-24kprobes, x86: Use NOKPROBE_SYMBOL() instead of __kprobes annotationMasami Hiramatsu
Use NOKPROBE_SYMBOL macro for protecting functions from kprobes instead of __kprobes annotation under arch/x86. This applies nokprobe_inline annotation for some cases, because NOKPROBE_SYMBOL() will inhibit inlining by referring the symbol address. This just folds a bunch of previous NOKPROBE_SYMBOL() cleanup patches for x86 to one patch. Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Lebon <jlebon@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09x86/nmi: Push duration printk() to irq contextPeter Zijlstra
Calling printk() from NMI context is bad (TM), so move it to IRQ context. In doing so we slightly change (probably wreck) the debugfs nmi_longest_ns thingy, in that it doesn't update to reflect the longest, nor does writing to it reset the count. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Don Zickus <dzickus@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Link: http://lkml.kernel.org/n/tip-rdw0au56a5ymis1u8p48c12d@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-10-29perf/x86: Fix NMI measurementsPeter Zijlstra
OK, so what I'm actually seeing on my WSM is that sched/clock.c is 'broken' for the purpose we're using it for. What triggered it is that my WSM-EP is broken :-( [ 0.001000] tsc: Fast TSC calibration using PIT [ 0.002000] tsc: Detected 2533.715 MHz processor [ 0.500180] TSC synchronization [CPU#0 -> CPU#6]: [ 0.505197] Measured 3 cycles TSC warp between CPUs, turning off TSC clock. [ 0.004000] tsc: Marking TSC unstable due to check_tsc_sync_source failed For some reason it consistently detects TSC skew, even though NHM+ should have a single clock domain for 'reasonable' systems. This marks sched_clock_stable=0, which means that we do fancy stuff to try and get a 'sane' clock. Part of this fancy stuff relies on the tick, clearly that's gone when NOHZ=y. So for idle cpus time gets stuck, until it either wakes up or gets kicked by another cpu. While this is perfectly fine for the scheduler -- it only cares about actually running stuff, and when we're running stuff we're obviously not idle. This does somewhat break down for perf which can trigger events just fine on an otherwise idle cpu. So I've got NMIs get get 'measured' as taking ~1ms, which actually don't last nearly that long: <idle>-0 [013] d.h. 886.311970: rcu_nmi_enter <-do_nmi ... <idle>-0 [013] d.h. 886.311997: perf_sample_event_took: HERE!!! : 1040990 So ftrace (which uses sched_clock(), not the fancy bits) only sees ~27us, but we measure ~1ms !! Now since all this measurement stuff lives in x86 code, we can actually fix it. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: mingo@kernel.org Cc: dave.hansen@linux.intel.com Cc: eranian@google.com Cc: Don Zickus <dzickus@redhat.com> Cc: jmario@redhat.com Cc: acme@infradead.org Link: http://lkml.kernel.org/r/20131017133350.GG3364@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-07-12perf/x86: Fix incorrect use of do_div() in NMI warningDave Hansen
I completely botched understanding the calling conventions of do_div(). I assumed that do_div() returned the result instead of realizing that it modifies its argument and returns a remainder. The side-effect from this would be bogus numbers for the "msecs" value in the warning messages: INFO: NMI handler (perf_event_nmi_handler) took too long to run: 0.114 msecs Note, there was a second fix posted by Stephane Eranian for a separate patch which I also botched: http://lkml.kernel.org/r/20130704223010.GA30625@quad Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20130708214404.B0B6EA66@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-23x86: Add NMI duration tracepointsDave Hansen
This patch has been invaluable in my adventures finding issues in the perf NMI handler. I'm as big a fan of printk() as anybody is, but using printk() in NMIs is deadly when they're happening frequently. Even hacking in trace_printk() ended up eating enough CPU to throw off some of the measurements I was making. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: acme@ghostprotocols.net Cc: Dave Hansen <dave@sr71.net> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-06-23x86: Warn when NMI handlers take large amounts of timeDave Hansen
I have a system which is causing all kinds of problems. It has 8 NUMA nodes, and lots of cores that can fight over cachelines. If things are not working _perfectly_, then NMIs can take longer than expected. If we get too many of them backed up to each other, we can easily end up in a situation where we are doing nothing *but* running NMIs. The biggest problem, though, is that this happens _silently_. You might be lucky to get an hrtimer warning, but most of the time system simply hangs. This patch should at least give us some warning before we fall off the cliff. the warnings look like this: nmi_handle: perf_event_nmi_handler() took: 26095071 ns The message is triggered whenever we notice the longest NMI we've seen to date. You can always view and reset this value via the debugfs interface if you like. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus@samba.org Cc: acme@ghostprotocols.net Cc: Dave Hansen <dave@sr71.net> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-01-17x86/nmi: export local_touch_nmi() symbol for modulesJacob Pan
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Zhang Rui <rui.zhang@intel.com>
2012-06-08x86: Save cr2 in NMI in case NMIs take a page fault (for i386)Steven Rostedt
Avi Kivity reported that page faults in NMIs could cause havic if the NMI preempted another page fault handler: The recent changes to NMI allow exceptions to take place in NMI handlers, but I think that a #PF (say, due to access to vmalloc space) is still problematic. Consider the sequence #PF (cr2 set by processor) NMI ... #PF (cr2 clobbered) do_page_fault() IRET ... IRET do_page_fault() address = read_cr2() The last line reads the overwritten cr2 value. This is the i386 version, which has the luxury of doing the work in C code. Link: http://lkml.kernel.org/r/4FBB8C40.6080304@redhat.com Reported-by: Avi Kivity <avi@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-08x86: Remove cmpxchg from i386 NMI nesting codeSteven Rostedt
I've been informed by someone on LWN called 'slashdot' that some i386 machines do not support a true cmpxchg. The cmpxchg used by the i386 NMI nesting code must be a true cmpxchg as disabling interrupts will not work for NMIs (which is the work around for i386s that do not have a true cmpxchg). This 'slashdot' character also suggested a fix to the issue. As the state of the nesting NMIs goes as follows: NOT_RUNNING -> EXECUTING EXECUTING -> NOT_RUNNING EXECUTING -> LATCHED LATCHED -> EXECUTING Having these states as enum values of: NOT_RUNNING = 0 EXECUTING = 1 LATCHED = 2 Instead of a cmpxchg to make EXECUTING -> NOT_RUNNING a dec_and_test() would work as well. If the dec_and_test brings the state to NOT_RUNNING, that is the same as a cmpxchg succeeding to change EXECUTING to NOT_RUNNING. If a nested NMI were to come in and change it to LATCHED, the dec_and_test() would convert the state to EXECUTING (what we want it to be in such a case anyway). I asked 'slashdot' to post this as a patch, but it never came to be. I decided to do the work instead. Thanks to H. Peter Anvin for suggesting to use this_cpu_dec_and_return() instead of local_dec_and_test(&__get_cpu_var()). Link: http://lwn.net/Articles/484932/ Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-31x86: Reset the debug_stack update counterSteven Rostedt
When an NMI goes off and it sees that it preempted the debug stack, to keep the debug stack safe, it changes the IDT to point to one that does not modify the stack on breakpoint (to allow breakpoints in NMIs). But the variable that gets set to know to undo it on exit never gets cleared on exit. Thus every NMI will reset it on exit the first time it is done even if it does not need to be reset. [ Added H. Peter Anvin's suggestion to use this_cpu_read/write ] Cc: <stable@vger.kernel.org> # v3.3 Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-05-23Merge branch 'delete-mca' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull the MCA deletion branch from Paul Gortmaker: "It was good that we could support MCA machines back in the day, but realistically, nobody is using them anymore. They were mostly limited to 386-sx 16MHz CPU and some 486 class machines and never more than 64MB of RAM. Even the enthusiast hobbyist community seems to have dried up close to ten years ago, based on what you can find searching various websites dedicated to the relatively short lived hardware. So lets remove the support relating to CONFIG_MCA. There is no point carrying this forward, wasting cycles doing routine maintenance on it; wasting allyesconfig build time on validating it, wasting I/O on git grep'ping over it, and so on." Let's see if anybody screams. It generally has compiled, and James Bottomley pointed out that there was a MCA extension from NCR that allowed for up to 4GB of memory and PPro-class machines. So in *theory* there may be users out there. But even James (technically listed as a maintainer) doesn't actually have a system, and while Alan Cox claims to have a machine in his cellar that he offered to anybody who wants to take it off his hands, he didn't argue for keeping MCA support either. So we could bring it back. But somebody had better speak up and talk about how they have actually been using said MCA hardware with modern kernels for us to do that. And David already took the patch to delete all the networking driver code (commit a5e371f61ad3: "drivers/net: delete all code/drivers depending on CONFIG_MCA"). * 'delete-mca' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: MCA: delete all remaining traces of microchannel bus support. scsi: delete the MCA specific drivers and driver code serial: delete the MCA specific 8250 support. arm: remove ability to select CONFIG_MCA
2012-05-23Merge branches 'x86-asm-for-linus', 'x86-cleanups-for-linus', ↵Linus Torvalds
'x86-cpu-for-linus', 'x86-debug-for-linus' and 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull initial trivial x86 stuff from Ingo Molnar. Various random cleanups and trivial fixes. * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86-64: Eliminate dead ia32 syscall handlers * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/pci-calgary_64.c: Remove obsoleted simple_strtoul() usage x86: Don't continue booting if we can't load the specified initrd x86: kernel/dumpstack.c simple_strtoul cleanup x86: kernel/check.c simple_strtoul cleanup debug: Add CONFIG_READABLE_ASM x86: spinlock.h: Remove REG_PTR_MODE * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cache_info: Fix setup of l2/l3 ids * 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Avoid double stack traces with show_regs() * 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, microcode: microcode_core.c simple_strtoul cleanup
2012-05-22Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf changes from Ingo Molnar: "Lots of changes: - (much) improved assembly annotation support in perf report, with jump visualization, searching, navigation, visual output improvements and more. - kernel support for AMD IBS PMU hardware features. Notably 'perf record -e cycles:p' and 'perf top -e cycles:p' should work without skid now, like PEBS does on the Intel side, because it takes advantage of IBS transparently. - the libtracevents library: it is the first step towards unifying tracing tooling and perf, and it also gives a tracing library for external tools like powertop to rely on. - infrastructure: various improvements and refactoring of the UI modules and related code - infrastructure: cleanup and simplification of the profiling targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.) - tons of robustness fixes all around - various ftrace updates: speedups, cleanups, robustness improvements. - typing 'make' in tools/ will now give you a menu of projects to build and a short help text to explain what each does. - ... and lots of other changes I forgot to list. The perf record make bzImage + perf report regression you reported should be fixed." * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits) tracing: Remove kernel_lock annotations tracing: Fix initial buffer_size_kb state ring-buffer: Merge separate resize loops perf evsel: Create events initially disabled -- again perf tools: Split term type into value type and term type perf hists: Fix callchain ip printf format perf target: Add uses_mmap field ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code() ftrace: Make ftrace_modify_all_code() global for archs to use ftrace: Return record ip addr for ftrace_location() ftrace: Consolidate ftrace_location() and ftrace_text_reserved() ftrace: Speed up search by skipping pages by address ftrace: Remove extra helper functions ftrace: Sort all function addresses, not just per page tracing: change CPU ring buffer state from tracing_cpumask tracing: Check return value of tracing_dentry_percpu() ring-buffer: Reset head page before running self test ring-buffer: Add integrity check at end of iter read ring-buffer: Make addition of pages in ring buffer atomic ...
2012-05-17MCA: delete all remaining traces of microchannel bus support.Paul Gortmaker
Hardware with MCA bus is limited to 386 and 486 class machines that are now 20+ years old and typically with less than 32MB of memory. A quick search on the internet, and you see that even the MCA hobbyist/enthusiast community has lost interest in the early 2000 era and never really even moved ahead from the 2.4 kernels to the 2.6 series. This deletes anything remaining related to CONFIG_MCA from core kernel code and from the x86 architecture. There is no point in carrying this any further into the future. One complication to watch for is inadvertently scooping up stuff relating to machine check, since there is overlap in the TLA name space (e.g. arch/x86/boot/mca.c). Cc: Thomas Gleixner <tglx@linutronix.de> Cc: James Bottomley <JBottomley@Parallels.com> Cc: x86@kernel.org Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-05-09x86: Avoid double stack traces with show_regs()Jan Beulich
What was called show_registers() so far already showed a stack trace for kernel faults, and kernel_stack_pointer() isn't even valid to be used for faults from user mode, hence it was pointless for show_regs() to call show_trace() after show_registers(). Simply rename show_registers() to show_regs() and eliminate the old definition. Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Link: http://lkml.kernel.org/r/4FAA3D3902000078000826E1@nat28.tlf.novell.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-27ftrace/x86: Remove the complex ftrace NMI handling codeSteven Rostedt
As ftrace function tracing would require modifying code that could be executed in NMI context, which is not stopped with stop_machine(), ftrace had to do a complex algorithm with various stages of setup and memory barriers to make it work. With the new breakpoint method, this is no longer required. The changes to the code can be done without any problem in NMI context, as well as without stop machine altogether. Remove the complex code as it is no longer needed. Also, a lot of the notrace annotations could be removed from the NMI code as it is now safe to trace them. With the exception of do_nmi itself, which does some special work to handle running in the debug stack. The breakpoint method can cause NMIs to double nest the debug stack if it's not setup properly, and that is done in do_nmi(), thus that function must not be traced. (Note the arch sh may want to do the same) Cc: Paul Mundt <lethal@linux-sh.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-04-25x86/nmi: Fix page faults by nmiaction if kmemcheck is enabledLi Zhong
This patch tries to fix the problem of page fault exception caused by accessing nmiaction structure in nmi if kmemcheck is enabled. If kmemcheck is enabled, the memory allocated through slab are in pages that are marked non-present, so that some checks could be done in the page fault handling code ( e.g. whether the memory is read before written to ). As nmiaction is allocated in this way, so it resides in a non-present page. Then there is a page fault while the nmi code accessing the nmiaction structure, which would then cause a warning by WARN_ON_ONCE(in_nmi()) in kmemcheck_fault(), called by do_page_fault(). This significantly simplifies the code as well, as the whole dynamic allocation dance goes away. v2: as Peter suggested, changed the nmiaction to use static storage. v3: as Peter suggested, use macro to shorten the codes. Also keep the original usage of register_nmi_handler, so users of this call doesn't need change. Tested-by: Seiji Aguchi <seiji.aguchi@hds.com> Fixes: https://lkml.org/lkml/2012/3/2/356 Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com> [ simplified the wrappers ] Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: thomas.mingarelli@hp.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1333051877-15755-4-git-send-email-dzickus@redhat.com [ tidied the patch a bit ] Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-04-25x86/nmi: Add new NMI queues to deal with IO_CHK and SERRDon Zickus
In discussions with Thomas Mingarelli about hpwdt, he explained to me some issues they were some when using their virtual NMI button to test the hpwdt driver. It turns out the virtual NMI button used on HP's machines do no send unknown NMIs but instead send IO_CHK NMIs. The way the kernel code is written, the hpwdt driver can not register itself against that type of NMI and therefore can not successfully capture system information before panic'ing. To solve this I created two new NMI queues to allow driver to register against the IO_CHK and SERR NMIs. Or in the hpwdt all three (if you include unknown NMIs too). The change is straightforward and just mimics what the unknown NMI does. Reported-and-tested-by: Thomas Mingarelli <thomas.mingarelli@hp.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1333051877-15755-3-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2011-12-21x86: Allow NMIs to hit breakpoints in i386Steven Rostedt
With i386, NMIs and breakpoints use the current stack and they do not reset the stack pointer to a fix point that might corrupt a previous NMI or breakpoint (as it does in x86_64). But NMIs are still not made to be re-entrant, and need to prevent the case that an NMI hitting a breakpoint (which does an iret), doesn't allow another NMI to run. The fix is to let the NMI be in 3 different states: 1) not running 2) executing 3) latched When no NMI is executing on a given CPU, the state is "not running". When the first NMI comes in, the state is switched to "executing". On exit of that NMI, a cmpxchg is performed to switch the state back to "not running" and if that fails, the NMI is restarted. If a breakpoint is hit and does an iret, which re-enables NMIs, and another NMI comes in before the first NMI finished, it will detect that the state is not in the "not running" state and the current NMI is nested. In this case, the state is switched to "latched" to let the interrupted NMI know to restart the NMI handler, and the nested NMI exits without doing anything. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-12-21x86: Keep current stack in NMI breakpointsSteven Rostedt
We want to allow NMI handlers to have breakpoints to be able to remove stop_machine from ftrace, kprobes and jump_labels. But if an NMI interrupts a current breakpoint, and then it triggers a breakpoint itself, it will switch to the breakpoint stack and corrupt the data on it for the breakpoint processing that it interrupted. Instead, have the NMI check if it interrupted breakpoint processing by checking if the stack that is currently used is a breakpoint stack. If it is, then load a special IDT that changes the IST for the debug exception to keep the same stack in kernel context. When the NMI is done, it puts it back. This way, if the NMI does trigger a breakpoint, it will keep using the same stack and not stomp on the breakpoint data for the breakpoint it interrupted. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2011-11-10x86, ioapic: Only print ioapic debug information for IRQs belonging to an ↵Mathias Nyman
ioapic chip with "apic=verbose" the print_IO_APIC() function tries to print IRQ to pin mappings for every active irq. It assumes chip_data is of type irq_cfg and may cause an oops if not. As the print_IO_APIC() is called from a late_initcall other chained irq chips may already be registered with custom chip_data information, causing an oops. This is the case with intel MID SoC devices with gpio demuxers registered as irq_chips. Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Signed-off-by: Alan Cox <alan@linux.intel.com> [ -v2: fixed build failure ] Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-10x86/mrst: Avoid reporting wrong nmi statusJacob Pan
Moorestown/Medfield platform does not have port 0x61 to report NMI status, nor does it have external NMI sources. The only NMI sources are from lapic, as results of perf counter overflow or IPI, e.g. NMI watchdog or spin lock debug. Reading port 0x61 on Moorestown will return 0xff which misled NMI handlers to false critical errors such memory parity error. The subsequent ioport access for NMI handling can also cause undefined behavior on Moorestown. This patch allows kernel process NMI due to watchdog or backrace dump without unnecessary hangs. Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> [hand applied] Signed-off-by: Alan Cox <alan@linux.intel.com>
2011-10-31x86: Fix files explicitly requiring export.h for EXPORT_SYMBOL/THIS_MODULEPaul Gortmaker
These files were implicitly getting EXPORT_SYMBOL via device.h which was including module.h, but that will be fixed up shortly. By fixing these now, we can avoid seeing things like: arch/x86/kernel/rtc.c:29: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’ arch/x86/kernel/pci-dma.c:20: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL’ arch/x86/kernel/e820.c:69: warning: type defaults to ‘int’ in declaration of ‘EXPORT_SYMBOL_GPL’ [ with input from Randy Dunlap <rdunlap@xenotime.net> and also from Stephen Rothwell <sfr@canb.auug.org.au> ] Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-10x86, nmi, drivers: Fix nmi splitup build bugIngo Molnar
nmi.c needs an #include <linux/mca.h>: arch/x86/kernel/nmi.c: In function ‘unknown_nmi_error’: arch/x86/kernel/nmi.c:286:6: error: ‘MCA_bus’ undeclared (first use in this function) arch/x86/kernel/nmi.c:286:6: note: each undeclared identifier is reported only once for each function it appears in Another one is the hpwdt driver: drivers/watchdog/hpwdt.c:507:9: error: ‘NMI_DONE’ undeclared (first use in this function) Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-10x86, nmi: Track NMI usage statsDon Zickus
Now that the NMI handler are broken into lists, increment the appropriate stats for each list. This allows us to see what is going on when they get printed out in the next patch. Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1317409584-23662-6-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-10x86, nmi: Add in logic to handle multiple events and unknown NMIsDon Zickus
Previous patches allow the NMI subsystem to process multipe NMI events in one NMI. As previously discussed this can cause issues when an event triggered another NMI but is processed in the current NMI. This causes the next NMI to go unprocessed and become an 'unknown' NMI. To handle this, we first have to flag whether or not the NMI handler handled more than one event or not. If it did, then there exists a chance that the next NMI might be already processed. Once the NMI is flagged as a candidate to be swallowed, we next look for a back-to-back NMI condition. This is determined by looking at the %rip from pt_regs. If it is the same as the previous NMI, it is assumed the cpu did not have a chance to jump back into a non-NMI context and execute code and instead handled another NMI. If both of those conditions are true then we will swallow any unknown NMI. There still exists a chance that we accidentally swallow a real unknown NMI, but for now things seem better. An optimization has also been added to the nmi notifier rountine. Because x86 can latch up to one NMI while currently processing an NMI, we don't have to worry about executing _all_ the handlers in a standalone NMI. The idea is if multiple NMIs come in, the second NMI will represent them. For those back-to-back NMI cases, we have the potentail to drop NMIs. Therefore only execute all the handlers in the second half of a detected back-to-back NMI. Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1317409584-23662-5-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-10x86, nmi: Wire up NMI handlers to new routinesDon Zickus
Just convert all the files that have an nmi handler to the new routines. Most of it is straight forward conversion. A couple of places needed some tweaking like kgdb which separates the debug notifier from the nmi handler and mce removes a call to notify_die. [Thanks to Ying for finding out the history behind that mce call https://lkml.org/lkml/2010/5/27/114 And Boris responding that he would like to remove that call because of it https://lkml.org/lkml/2011/9/21/163] The things that get converted are the registeration/unregistration routines and the nmi handler itself has its args changed along with code removal to check which list it is on (most are on one NMI list except for kgdb which has both an NMI routine and an NMI Unknown routine). Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Corey Minyard <minyard@acm.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Corey Minyard <minyard@acm.org> Cc: Jack Steiner <steiner@sgi.com> Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-10-10x86, nmi: Create new NMI handler routinesDon Zickus
The NMI handlers used to rely on the notifier infrastructure. This worked great until we wanted to support handling multiple events better. One of the key ideas to the nmi handling is to process _all_ the handlers for each NMI. The reason behind this switch is because NMIs are edge triggered. If enough NMIs are triggered, then they could be lost because the cpu can only latch at most one NMI (besides the one currently being processed). In order to deal with this we have decided to process all the NMI handlers for each NMI. This allows the handlers to determine if they recieved an event or not (the ones that can not determine this will be left to fend for themselves on the unknown NMI list). As a result of this change it is now possible to have an extra NMI that was destined to be received for an already processed event. Because the event was processed in the previous NMI, this NMI gets dropped and becomes an 'unknown' NMI. This of course will cause printks that scare people. However, we prefer to have extra NMIs as opposed to losing NMIs and as such are have developed a basic mechanism to catch most of them. That will be a later patch. To accomplish this idea, I unhooked the nmi handlers from the notifier routines and created a new mechanism loosely based on doIRQ. The reason for this is the notifier routines have a couple of shortcomings. One we could't guarantee all future NMI handlers used NOTIFY_OK instead of NOTIFY_STOP. Second, we couldn't keep track of the number of events being handled in each routine (most only handle one, perf can handle more than one). Third, I wanted to eventually display which nmi handlers are registered in the system in /proc/interrupts to help see who is generating NMIs. The patch below just implements the new infrastructure but doesn't wire it up yet (that is the next patch). Its design is based on doIRQ structs and the atomic notifier routines. So the rcu stuff in the patch isn't entirely untested (as the notifier routines have soaked it) but it should be double checked in case I copied the code wrong. Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1317409584-23662-3-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>