summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu/mcheck/mce.c
AgeCommit message (Collapse)Author
2018-02-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Add a console_msg_format command line option: The value "default" keeps the old "[time stamp] text\n" format. The value "syslog" allows to see the syslog-like "<log level>[timestamp] text" format. This feature was requested by people doing regression tests, for example, 0day robot. They want to have both filtered and full logs at hands. - Reduce the risk of softlockup: Pass the console owner in a busy loop. This is a new approach to the old problem. It was first proposed by Steven Rostedt on Kernel Summit 2017. It marks a context in which the console_lock owner calls console drivers and could not sleep. On the other side, printk() callers could detect this state and use a busy wait instead of a simple console_trylock(). Finally, the console_lock owner checks if there is a busy waiter at the end of the special context and eventually passes the console_lock to the waiter. The hand-off works surprisingly well and helps in many situations. Well, there is still a possibility of the softlockup, for example, when the flood of messages stops and the last owner still has too much to flush. There is increasing number of people having problems with printk-related softlockups. We might eventually need to get better solution. Anyway, this looks like a good start and promising direction. - Do not allow to schedule in console_unlock() called from printk(): This reverts an older controversial commit. The reschedule helped to avoid softlockups. But it also slowed down the console output. This patch is obsoleted by the new console waiter logic described above. In fact, the reschedule made the hand-off less effective. - Deprecate "%pf" and "%pF" format specifier: It was needed on ia64, ppc64 and parisc64 to dereference function descriptors and show the real function address. It is done transparently by "%ps" and "pS" format specifier now. Sergey Senozhatsky found that all the function descriptors were in a special elf section and could be easily detected. - Remove printk_symbol() API: It has been obsoleted by "%pS" format specifier, and this change helped to remove few continuous lines and a less intuitive old API. - Remove redundant memsets: Sergey removed unnecessary memset when processing printk.devkmsg command line option. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (27 commits) printk: drop redundant devkmsg_log_str memsets printk: Never set console_may_schedule in console_trylock() printk: Hide console waiter logic into helpers printk: Add console owner and waiter logic to load balance console writes kallsyms: remove print_symbol() function checkpatch: add pF/pf deprecation warning symbol lookup: introduce dereference_symbol_descriptor() parisc64: Add .opd based function descriptor dereference powerpc64: Add .opd based function descriptor dereference ia64: Add .opd based function descriptor dereference sections: split dereference_function_descriptor() openrisc: Fix conflicting types for _exext and _stext lib: do not use print_symbol() irq debug: do not use print_symbol() sysfs: do not use print_symbol() drivers: do not use print_symbol() x86: do not use print_symbol() unicore32: do not use print_symbol() sh: do not use print_symbol() mn10300: do not use print_symbol() ...
2018-01-30Merge branch 'siginfo-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull siginfo cleanups from Eric Biederman: "Long ago when 2.4 was just a testing release copy_siginfo_to_user was made to copy individual fields to userspace, possibly for efficiency and to ensure initialized values were not copied to userspace. Unfortunately the design was complex, it's assumptions unstated, and humans are fallible and so while it worked much of the time that design failed to ensure unitialized memory is not copied to userspace. This set of changes is part of a new design to clean up siginfo and simplify things, and hopefully make the siginfo handling robust enough that a simple inspection of the code can be made to ensure we don't copy any unitializied fields to userspace. The design is to unify struct siginfo and struct compat_siginfo into a single definition that is shared between all architectures so that anyone adding to the set of information shared with struct siginfo can see the whole picture. Hopefully ensuring all future si_code assignments are arch independent. The design is to unify copy_siginfo_to_user32 and copy_siginfo_from_user32 so that those function are complete and cope with all of the different cases documented in signinfo_layout. I don't think there was a single implementation of either of those functions that was complete and correct before my changes unified them. The design is to introduce a series of helpers including force_siginfo_fault that take the values that are needed in struct siginfo and build the siginfo structure for their callers. Ensuring struct siginfo is built correctly. The remaining work for 4.17 (unless someone thinks it is post -rc1 material) is to push usage of those helpers down into the architectures so that architecture specific code will not need to deal with the fiddly work of intializing struct siginfo, and then when struct siginfo is guaranteed to be fully initialized change copy siginfo_to_user into a simple wrapper around copy_to_user. Further there is work in progress on the issues that have been documented requires arch specific knowledge to sort out. The changes below fix or at least document all of the issues that have been found with siginfo generation. Then proceed to unify struct siginfo the 32 bit helpers that copy siginfo to and from userspace, and generally clean up anything that is not arch specific with regards to siginfo generation. It is a lot but with the unification you can of siginfo you can already see the code reduction in the kernel" * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (45 commits) signal/memory-failure: Use force_sig_mceerr and send_sig_mceerr mm/memory_failure: Remove unused trapno from memory_failure signal/ptrace: Add force_sig_ptrace_errno_trap and use it where needed signal/powerpc: Remove unnecessary signal_code parameter of do_send_trap signal: Helpers for faults with specialized siginfo layouts signal: Add send_sig_fault and force_sig_fault signal: Replace memset(info,...) with clear_siginfo for clarity signal: Don't use structure initializers for struct siginfo signal/arm64: Better isolate the COMPAT_TASK portion of ptrace_hbptriggered ptrace: Use copy_siginfo in setsiginfo and getsiginfo signal: Unify and correct copy_siginfo_to_user32 signal: Remove the code to clear siginfo before calling copy_siginfo_from_user32 signal: Unify and correct copy_siginfo_from_user32 signal/blackfin: Remove pointless UID16_SIGINFO_COMPAT_NEEDED signal/blackfin: Move the blackfin specific si_codes to asm-generic/siginfo.h signal/tile: Move the tile specific si_codes to asm-generic/siginfo.h signal/frv: Move the frv specific si_codes to asm-generic/siginfo.h signal/ia64: Move the ia64 specific si_codes to asm-generic/siginfo.h signal/powerpc: Remove redefinition of NSIGTRAP on powerpc signal: Move addr_lsb into the _sigfault union for clarity ...
2018-01-30Merge branch 'ras-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 RAS updates from Ingo Molnar: - various AMD SMCA error parsing/reporting improvements (Yazen Ghannam) - extend Intel CMCI error reporting to more cases (Xie XiuQi) * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/MCE: Make correctable error detection look at the Deferred bit x86/MCE: Report only DRAM ECC as memory errors on AMD systems x86/MCE/AMD: Define a function to get SMCA bank type x86/mce/AMD: Don't set DEF_INT_TYPE in MSR_CU_DEF_ERR on SMCA systems x86/MCE: Extend table to report action optional errors through CMCI too
2018-01-23mm/memory_failure: Remove unused trapno from memory_failureEric W. Biederman
Today 4 architectures set ARCH_SUPPORTS_MEMORY_FAILURE (arm64, parisc, powerpc, and x86), while 4 other architectures set __ARCH_SI_TRAPNO (alpha, metag, sparc, and tile). These two sets of architectures do not interesect so remove the trapno paramater to remove confusion. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2018-01-21Merge branch 'x86-pti-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 pti fixes from Thomas Gleixner: "A small set of fixes for the meltdown/spectre mitigations: - Make kprobes aware of retpolines to prevent probes in the retpoline thunks. - Make the machine check exception speculation protected. MCE used to issue an indirect call directly from the ASM entry code. Convert that to a direct call into a C-function and issue the indirect call from there so the compiler can add the retpoline protection, - Make the vmexit_fill_RSB() assembly less stupid - Fix a typo in the PTI documentation" * 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/retpoline: Optimize inline assembler for vmexit_fill_RSB x86/pti: Document fix wrong index kprobes/x86: Disable optimizing on the function jumps to indirect thunk kprobes/x86: Blacklist indirect thunk functions for kprobes retpoline: Introduce start/end markers of indirect thunk x86/mce: Make machine check speculation protected
2018-01-19x86/mce: Make machine check speculation protectedThomas Gleixner
The machine check idtentry uses an indirect branch directly from the low level code. This evades the speculation protection. Replace it by a direct call into C code and issue the indirect call there so the compiler can apply the proper speculation protection. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by:Borislav Petkov <bp@alien8.de> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Niced-by: Peter Zijlstra <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801181626290.1847@nanos
2018-01-05x86: do not use print_symbol()Sergey Senozhatsky
print_symbol() is a very old API that has been obsoleted by %pS format specifier in a normal printk() call. Replace print_symbol() with a direct printk("%pS") call and correctly handle continuous lines. Link: http://lkml.kernel.org/r/20171211125025.2270-9-sergey.senozhatsky@gmail.com To: Andrew Morton <akpm@linux-foundation.org> To: Russell King <linux@armlinux.org.uk> To: Catalin Marinas <catalin.marinas@arm.com> To: Mark Salter <msalter@redhat.com> To: Tony Luck <tony.luck@intel.com> To: David Howells <dhowells@redhat.com> To: Yoshinori Sato <ysato@users.sourceforge.jp> To: Guan Xuetao <gxt@mprc.pku.edu.cn> To: Borislav Petkov <bp@alien8.de> To: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: Thomas Gleixner <tglx@linutronix.de> To: Peter Zijlstra <peterz@infradead.org> To: Vineet Gupta <vgupta@synopsys.com> To: Fengguang Wu <fengguang.wu@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Petr Mladek <pmladek@suse.com> Cc: LKML <linux-kernel@vger.kernel.org> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-c6x-dev@linux-c6x.org Cc: linux-ia64@vger.kernel.org Cc: linux-am33-list@redhat.com Cc: linux-sh@vger.kernel.org Cc: linux-edac@vger.kernel.org Cc: x86@kernel.org Cc: linux-snps-arc@lists.infradead.org Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Borislav Petkov <bp@suse.de> # mce.c part [pmladek@suse.com: updated commit message] Signed-off-by: Petr Mladek <pmladek@suse.com>
2017-12-18x86/MCE: Make correctable error detection look at the Deferred bitYazen Ghannam
AMD systems may log Deferred errors. These are errors that are uncorrected but which do not need immediate action. The MCA_STATUS[UC] bit may not be set for Deferred errors. Flag the error as not correctable when MCA_STATUS[Deferred] is set and do not feed it into the Correctable Errors Collector. [ bp: Massage commit message. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171212165143.27475-1-Yazen.Ghannam@amd.com
2017-12-18x86/MCE: Report only DRAM ECC as memory errors on AMD systemsYazen Ghannam
The MCA_STATUS[ErrorCodeExt] field is very bank type specific. We currently check if the ErrorCodeExt value is 0x0 or 0x8 in mce_is_memory_error(), but we don't check the bank number. This means that we could flag non-memory errors as memory errors. We know that we want to flag DRAM ECC errors as memory errors, so let's do those cases first. We can add more cases later when needed. Define a wrapper function in mce_amd.c so we can use SMCA enums. [ bp: Remove brackets around return statements. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171207203955.118171-2-Yazen.Ghannam@amd.com
2017-10-05x86/mce: Convert timers to use timer_setup()Kees Cook
In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Adjust sanity-check WARN to make sure the triggering timer matches the current CPU timer. Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac@vger.kernel.org Link: https://lkml.kernel.org/r/20171005005425.GA23950@beast
2017-08-17x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pagesTony Luck
Speculative processor accesses may reference any memory that has a valid page table entry. While a speculative access won't generate a machine check, it will log the error in a machine check bank. That could cause escalation of a subsequent error since the overflow bit will be then set in the machine check bank status register. Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual address of the page we want to map out otherwise we may trigger the very problem we are trying to avoid. We use a non-canonical address that passes through the usual Linux table walking code to get to the same "pte". Thanks to Dave Hansen for reviewing several iterations of this. Also see: http://marc.info/?l=linux-mm&m=149860136413338&w=2 Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-26x86/mce: Always save severity in machine_check_poll()Yazen Ghannam
The MCE severity gives a hint as to how to handle the error. The notifier blocks can then use the severity to decide on an action. It's not necessary for machine_check_poll() to filter errors for the notifier chain, since each block will check its own set of conditions before handling an error. Also, there isn't any urgency for machine_check_poll() to make decisions based on severity like in do_machine_check(). If we can assume that a severity is set then we can use it in more notifier blocks. For example, the CEC block could check for a "KEEP" severity rather than checking bits in the status. This isn't possible now since the severity is not set except for "DEFFRRED/UCNA" errors with a valid address. Save the severity since we have it, and let the notifier blocks decide if they want to do anything. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1498074402-98633-1-git-send-email-Yazen.Ghannam@amd.com
2017-06-14x86/mce: Update bootlog description to reflect behavior on AMDYazen Ghannam
The bootlog option is only disabled by default on AMD Fam10h and older systems. Update bootlog description to say this. Change the family value to hex to avoid confusion. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170613162835.30750-9-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-06-14x86/mce: Don't disable MCA banks when offlining a CPU on AMDYazen Ghannam
AMD systems have non-core, shared MCA banks within a die. These banks are controlled by a master CPU per die. If this CPU is offlined then all the shared banks are disabled in addition to the CPU's core banks. Also, Fam17h systems may have SMT enabled. The MCA_CTL register is shared between SMT thread siblings. If a CPU is offlined then all its sibling's MCA banks are also disabled. Extend the existing vendor check to AMD too. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> [ Fix up comment. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170613162835.30750-8-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-05-21x86/MCE: Export memory_error()Borislav Petkov
Export the function which checks whether an MCE is a memory error to other users so that we can reuse the logic. Drop the boot_cpu_data use, while at it, as mce.cpuvendor already has the CPU vendor in there. Integrate a piece from a patch from Vishal Verma <vishal.l.verma@intel.com> to export it for modules (nfit). The main reason we're exporting it is that the nfit handler nfit_handle_mce() needs to detect a memory error properly before doing its recovery actions. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-05-01Merge branch 'ras-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Ingo Molnar: "The main changes in this cycle were: - add the 'Corrected Errors Collector' kernel feature which collect and monitor correctable errors statistics and will preemptively (soft-)offline physical pages that have a suspiciously high error count. - handle MCE errors during kexec() more gracefully - factor out and deprecate the /dev/mcelog driver - ... plus misc fixes and cleanpus" * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Check MCi_STATUS[MISCV] for usable addr on Intel only ACPI/APEI: Use setup_deferrable_timer() x86/mce: Update notifier priority check x86/mce: Enable PPIN for Knights Landing/Mill x86/mce: Do not register notifiers with invalid prio x86/mce: Factor out and deprecate the /dev/mcelog driver RAS: Add a Corrected Errors Collector x86/mce: Rename mce_log to mce_log_buffer x86/mce: Rename mce_log()'s argument x86/mce: Init some CPU features early x86/mce: Handle broadcasted MCE gracefully with kexec
2017-04-19x86/mce: Check MCi_STATUS[MISCV] for usable addr on Intel onlyBorislav Petkov
mce_usable_address() does a bunch of basic sanity checks to verify whether the address reported with the error is usable for further processing. However, we do check MCi_STATUS[MISCV] and that is not needed on AMD as that bit says that there's additional information about the logged error in the MCi_MISCj banks. But we don't need that to know whether the address is usable - we only need to know whether the physical address is valid - i.e., ADDRV. On Intel the MISCV bit is needed to perform additional checks to determine whether the reported address is a physical one, etc. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170418183924.6agjkebilwqj26or@pd.tnic Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-18x86/mce: Make the MCE notifier a blocking oneVishal Verma
The NFIT MCE handler callback (for handling media errors on NVDIMMs) takes a mutex to add the location of a memory error to a list. But since the notifier call chain for machine checks (x86_mce_decoder_chain) is atomic, we get a lockdep splat like: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0 [..] Call Trace: dump_stack ___might_sleep __might_sleep mutex_lock_nested ? __lock_acquire nfit_handle_mce notifier_call_chain atomic_notifier_call_chain ? atomic_notifier_call_chain mce_gen_pool_process Convert the notifier to a blocking one which gets to run only in process context. Boris: remove the notifier call in atomic context in print_mce(). For now, let's print the MCE on the atomic path so that we can make sure they go out and get logged at least. Fixes: 6839a6d96f4e ("nfit: do an ARS scrub on hitting a latent media error") Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@intel.com Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-04-18x86/mce: Update notifier priority checkBorislav Petkov
Update the check which enforces the registration of MCE decoder notifier callbacks with valid priority only, to include mcelog's priority. Reported-by: kernel test robot <xiaolong.ye@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lkp@01.org Link: http://lkml.kernel.org/r/20170418073820.i6kl5tggcntwlisa@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28x86/mce: Do not register notifiers with invalid prioBorislav Petkov
This is just a defensive precaution: do not register notifiers with a priority which would disrupt the error handling in the notifiers with prio higher than MCE_PRIO_EDAC. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-7-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28x86/mce: Factor out and deprecate the /dev/mcelog driverTony Luck
Move all code relating to /dev/mcelog to a separate source file. /dev/mcelog driver can now operate from the machine check notifier with lowest prio. Signed-off-by: Tony Luck <tony.luck@intel.com> [ Move the mce_helper and trigger functionality behind CONFIG_X86_MCELOG_LEGACY. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-6-bp@alien8.de [ Renamed CONFIG_X86_MCELOG to CONFIG_X86_MCELOG_LEGACY. ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28RAS: Add a Corrected Errors CollectorBorislav Petkov
Introduce a simple data structure for collecting correctable errors along with accessors. More detailed description in the code itself. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it and then the rest of the chain doesn't hear about it - basically the main reason for the CE collector - or to continue running the notifiers. When the CEC hits the action threshold, it will try to soft-offine the page containing the ECC and then the whole decoding chain gets to see the error. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28x86/mce: Rename mce_log to mce_log_bufferBorislav Petkov
It is confusing when staring at "struct mce_log mcelog" and then there's also a function called mce_log(). So call the buffer what it is. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-4-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28x86/mce: Rename mce_log()'s argumentBorislav Petkov
We call it everywhere "struct mce *m". Adjust that here too to avoid confusion. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-3-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28Merge branch 'ras/urgent' into ras/core, to pick up fixIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-28x86/mce: Don't print MCEs when mcelog is activeAndi Kleen
Since: cd9c57cad3fe ("x86/MCE: Dump MCE to dmesg if no consumers") all MCEs are printed even when mcelog is running. Fix the regression to not print to dmesg when mcelog is running as it is a consumer too. Signed-off-by: Andi Kleen <ak@linux.intel.com> [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org # 4.10.. Fixes: cd9c57cad3fe ("x86/MCE: Dump MCE to dmesg if no consumers") Link: http://lkml.kernel.org/r/20170327093304.10683-2-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-18x86/mce: Init some CPU features earlyYazen Ghannam
When the MCA banks in __mcheck_cpu_init_generic() are polled for leftover errors logged during boot or from the previous boot, its required to have CPU features detected sufficiently so that the reading out and handling of those early errors is done correctly. If those features are not available, the decoding may miss some information and get incomplete errors logged. For example, on SMCA systems the MCA_IPID and MCA_SYND registers are not logged and MCA_ADDR is not masked appropriately. To cure that, do a subset of the basic feature detection early while the rest happens in its usual place in __mcheck_cpu_init_vendor(). Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1489599055-20756-1-git-send-email-Yazen.Ghannam@amd.com [ Massage commit message and simplify. ] Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-03-13x86/mce: Handle broadcasted MCE gracefully with kexecXunlei Pang
When we are about to kexec a crash kernel and right then and there a broadcasted MCE fires while we're still in the first kernel and while the other CPUs remain in a holding pattern, the #MC handler of the first kernel will timeout and then panic due to never completing MCE synchronization. Handle this in a similar way as to when the CPUs are offlined when that broadcasted MCE happens. [ Boris: rewrote commit message and comments. ] Suggested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Xunlei Pang <xlpang@redhat.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: kexec@lists.infradead.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1487857012-9059-1-git-send-email-xlpang@redhat.com Link: http://lkml.kernel.org/r/20170313095019.19351-1-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-02-20Merge branch 'ras-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Ingo Molnar: "The main changes in this cycle were: - Assign notifier chain priorities for all RAS related handlers to make the ordering explicit (Borislav Petkov) - Improve the AMD MCA banks sysfs output (Yazen Ghannam) - Various cleanups and restructuring of the x86 RAS code (Borislav Petkov)" * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/ras, EDAC, acpi: Assign MCE notifier handlers a priority x86/ras: Get rid of mce_process_work() EDAC/mce/amd: Dump TSC value EDAC/mce/amd: Unexport amd_decode_mce() x86/ras/amd/inj: Change dependency x86/ras: Flip the TSC-adding logic x86/ras/amd: Make sysfs names of banks more user-friendly x86/ras/therm_throt: Do not log a fake MCE for thermal events x86/ras/inject: Make it depend on X86_LOCAL_APIC=y
2017-01-31x86/mce: Make timer handling more robustThomas Gleixner
Erik reported that on a preproduction hardware a CMCI storm triggers the BUG_ON in add_timer_on(). The reason is that the per CPU MCE timer is started by the CMCI logic before the MCE CPU hotplug callback starts the timer with add_timer_on(). So the timer is already queued which triggers the BUG. Using add_timer_on() is pretty pointless in this code because the timer is strictlty per CPU, initialized as pinned and all operations which arm the timer happen on the CPU to which the timer belongs. Simplify the whole machinery by using mod_timer() instead of add_timer_on() which avoids the problem because mod_timer() can handle already queued timers. Use __start_timer() everywhere so the earliest armed expiry time is preserved. Reported-by: Erik Veijola <erik.veijola@intel.com> Tested-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701310936080.3457@nanos Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2017-01-24x86/ras, EDAC, acpi: Assign MCE notifier handlers a priorityBorislav Petkov
Assign all notifiers on the MCE decode chain a priority so that they get called in the correct order. Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-24x86/ras: Get rid of mce_process_work()Borislav Petkov
Make mce_gen_pool_process() the workqueue function directly and save us an indirection. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-9-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-24x86/ras: Flip the TSC-adding logicBorislav Petkov
Add the TSC value to the MCE record only when the MCE being logged is precise, i.e., it is logged as an exception or an MCE-related interrupt. So it doesn't look particularly easy to do without touching/changing a bunch of places. That's why I'm trying tricks first. For example, the mce-apei.c case I'm addressing by setting ->tsc only for errors of panic severity. The idea there is, that, panic errors will have raised an #MC and not polled. And then instead of propagating a flag to mce_setup(), it seems easier/less code to set ->tsc depending on the call sites, i.e., are we polling or are we preparing an MCE record in an exception handler/thresholding interrupt. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-5-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-01-24x86/ras/therm_throt: Do not log a fake MCE for thermal eventsBorislav Petkov
We log a fake bank 128 MCE to note that we're handling a CPU thermal event. However, this confuses people into thinking that their hardware generates MCEs. Hijacking MCA for logging thermal events is a gross misuse anyway and it shouldn't have been done in the first place. And besides we have other means for dealing with thermal events which are much more suitable. So let's kill the MCE logging part. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Ashok Raj <ashok.raj@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-13Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue updates from Tejun Heo: "Mostly patches to initialize workqueue subsystem earlier and get rid of keventd_up(). The patches were headed for the last merge cycle but got delayed due to a bug found late minute, which is fixed now. Also, to help debugging, destroy_workqueue() is more chatty now on a sanity check failure." * 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: move wq_numa_init() to workqueue_init() workqueue: remove keventd_up() debugobj, workqueue: remove keventd_up() usage slab, workqueue: remove keventd_up() usage power, workqueue: remove keventd_up() usage tty, workqueue: remove keventd_up() usage mce, workqueue: remove keventd_up() usage workqueue: make workqueue available early during boot workqueue: dump workqueue state on sanity check failures in destroy_workqueue()
2016-12-12Merge branch 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull smp hotplug updates from Thomas Gleixner: "This is the final round of converting the notifier mess to the state machine. The removal of the notifiers and the related infrastructure will happen around rc1, as there are conversions outstanding in other trees. The whole exercise removed about 2000 lines of code in total and in course of the conversion several dozen bugs got fixed. The new mechanism allows to test almost every hotplug step standalone, so usage sites can exercise all transitions extensively. There is more room for improvement, like integrating all the pointlessly different architecture mechanisms of synchronizing, setting cpus online etc into the core code" * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits) tracing/rb: Init the CPU mask on allocation soc/fsl/qbman: Convert to hotplug state machine soc/fsl/qbman: Convert to hotplug state machine zram: Convert to hotplug state machine KVM/PPC/Book3S HV: Convert to hotplug state machine arm64/cpuinfo: Convert to hotplug state machine arm64/cpuinfo: Make hotplug notifier symmetric mm/compaction: Convert to hotplug state machine iommu/vt-d: Convert to hotplug state machine mm/zswap: Convert pool to hotplug state machine mm/zswap: Convert dst-mem to hotplug state machine mm/zsmalloc: Convert to hotplug state machine mm/vmstat: Convert to hotplug state machine mm/vmstat: Avoid on each online CPU loops mm/vmstat: Drop get_online_cpus() from init_cpu_node_state/vmstat_cpu_dead() tracing/rb: Convert to hotplug state machine oprofile/nmi timer: Convert to hotplug state machine net/iucv: Use explicit clean up labels in iucv_init() x86/pci/amd-bus: Convert to hotplug state machine x86/oprofile/nmi: Convert to hotplug state machine ...
2016-11-23x86/mce: Include the PPIN in MCE records when availableTony Luck
Intel Xeons from Ivy Bridge onwards support a processor identification number set in the factory. To the user this is a handy unique number to identify a particular CPU. Intel can decode this to the fab/production run to track errors. On systems that have it, include it in the machine check record. I'm told that this would be helpful for users that run large data centers with multi-socket servers to keep track of which CPUs are seeing errors. Boris: * Add some clarifying comments and spacing. * Mask out [63:2] in the disabled-but-not-locked case * Call the MSR variable "val" for more readability. Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20161123114855.njguoaygp3qnbkia@pd.tnic Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16x86/mcheck: Move CPU_DEAD to hotplug state machineSebastian Andrzej Siewior
This moves the last piece of the old hotplug notifier code in MCE to the new hotplug state machine. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: rt@linutronix.de Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20161110174447.11848-8-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16x86/mcheck: Move CPU_ONLINE and CPU_DOWN_PREPARE to hotplug state machineSebastian Andrzej Siewior
The CPU_ONLINE and CPU_DOWN_PREPARE look fully symmetrical and could be move to the hotplug state machine. On a failure during registration we have the tear down callback invoked (mce_cpu_pre_down()) so there should be no timer around and so no need to need keep notifier installed (this was the reason according to the comment why the notifier was registered despite of errors). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: rt@linutronix.de Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20161110174447.11848-7-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16x86/mcheck: Reorganize the hotplug callbacksSebastian Andrzej Siewior
Initially I wanted to remove mcheck_cpu_init() from identify_cpu() and let it become an independent early hotplug callback. The main problem here was that the init on the boot CPU may happen too late (device_initcall_sync(mcheck_init_device)) and nobody wanted to risk receiving and MCE event at boot time leading to a shutdown (if the MCE feature is not yet enabled). Here is attempt two: the timming stays as-is but the ordering of the functions is changed: - mcheck_cpu_init() (which is run from identify_cpu()) will setup the timer struct but won't fire the timer. This is moved to CPU_ONLINE since its cleanup part is in CPU_DOWN_PREPARE. So if it is okay to stop the timer early in the shutdown phase, it should be okay to start it late in the bring up phase. - CPU_DOWN_PREPARE disables the MCE feature flags for !INTEL CPUs in mce_disable_cpu(). If a failure occures it would be re-enabled on all vendor CPUs (including Intel where it was not disabled during shutdown). To keep this working I am moving it to CPU_ONLINE. smp_call_function_single() is dropped beause the notifier runs nowdays on the target CPU. - CPU_ONLINE is invoking mce_device_create() + mce_threshold_create_device() but its cleanup part is in CPU_DEAD (mce_threshold_remove_device() and mce_device_remove()). In order to keep this symmetrical I am moving the clean up from CPU_DEAD to CPU_DOWN_PREPARE. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: rt@linutronix.de Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20161110174447.11848-6-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16x86/mcheck: Split threshold_cpu_callback into two callbacksSebastian Andrzej Siewior
The threshold_cpu_callback callbacks looks like one of the notifier and its arguments are almost the same. Split this out and have one ONLINE and one DEAD callback. This will come handy later once the main code gets changed to use the callback mechanism. Also, handle threshold_cpu_callback_online() return value so we don't continue if the function fails. Boris Petkov removed the callback pointer and replaced it with proper functions. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: rt@linutronix.de Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20161110174447.11848-5-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-16x86/mcheck: Be prepared for a rollback back to the ONLINE stateSebastian Andrzej Siewior
If we try a CPU down and fail in the middle then we roll back to the online state. This means we would perform CPU_ONLINE / mce_device_create() without invoking CPU_DEAD / mce_device_remove() for the cleanup of what was allocated in CPU_ONLINE. Be prepared for this and don't allocate the struct if we have it already. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Borislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: rt@linutronix.de Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20161110174447.11848-4-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-11x86/MCE: Correct TSC timestamping of error recordsBorislav Petkov
We did have logic in the MCE code which would TSC-timestamp an error record only when it is exact - i.e., when it wasn't detected by polling. This isn't the case anymore. So let's fix that: We have a valid TSC timestamp in the error record only when it has been a precise detection, i.e., either in the #MC handler or in one of the interrupt handlers (thresholding, deferred, ...). All other error records still have mce.time which contains the wall time in order to be able to place the error record in time at least approximately. Also, this fixes another bug where machine_check_poll() would clear mce.tsc unconditionally even if we requested precise MCP_TIMESTAMP logging. The proper fix would be to generate timestamp only when it has been requested and not always. But that would require a more thorough code audit of all mce_gather_info/mce_setup() users. Add a FIXME for now. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony <tony.luck@intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: kernel test robot <xiaolong.ye@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lkp@01.org Link: http://lkml.kernel.org/r/20161110131053.kybsijfs5venpjnf@pd.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-08x86/MCE: Dump MCE to dmesg if no consumersBorislav Petkov
When there are no error record consumers registered with the kernel, the only thing that appears in dmesg is something like: [ 300.000326] mce: [Hardware Error]: Machine check events logged and the error records are gone. Which is seriously counterproductive. So let's dump them to dmesg instead, in such a case. Requested-by: Eric Morton <Eric.Morton@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20161101120911.13163-4-bp@alien8.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-10-19Merge branch 'for-4.9' into for-4.10Tejun Heo
2016-09-17mce, workqueue: remove keventd_up() usageTejun Heo
Now that workqueue can handle work item queueing from very early during boot, there is no need to gate schedule_work() with keventd_up(). Remove it. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: linux-edac@vger.kernel.org
2016-09-13x86/mce/AMD: Extract the error address on SMCA systemsYazen Ghannam
The MCA_ADDR registers on Scalable MCA systems contain the ErrorAddr in bits [55:0] and the least significant bit of the address in bits [61:56]. We should extract the valid ErrorAddr bits from the MCA_ADDR register rather than saving the raw value to struct mce. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1473275643-1721-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13x86/mce, EDAC/mce_amd: Print MCA_SYND and MCA_IPID during MCE on SMCA systemsYazen Ghannam
The MCA_SYND and MCA_IPID registers contain valuable information and should be included in MCE output. The MCA_SYND register contains syndrome and other error information, and the MCA_IPID register will uniquely identify the MCA bank's type without having to rely on system software. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472680624-34221-2-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13x86/mce/AMD: Save MCA_IPID in MCE struct on SMCA systemsYazen Ghannam
The MCA_IPID register uniquely identifies a bank's type and instance on Scalable MCA systems. We should save the value of this register in struct mce along with the other relevant error information. This ensures that we can decode errors without relying on system software to correlate the bank to the type. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1472680624-34221-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-09-13x86/mce: Add support for new MCA_SYND registerYazen Ghannam
Syndrome information is no longer contained in MCA_STATUS for SMCA systems but in a new register - MCA_SYND. Add a synd field to struct mce to hold MCA_SYND register value. Add it to the end of struct mce to maintain compatibility with old versions of mcelog. Also, add it to the respective tracepoint. Signed-off-by: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1467633035-32080-1-git-send-email-Yazen.Ghannam@amd.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>