summaryrefslogtreecommitdiff
path: root/arch/x86/ras
diff options
context:
space:
mode:
authorBorislav Petkov <bp@suse.de>2017-03-27 11:33:02 +0200
committerIngo Molnar <mingo@kernel.org>2017-03-28 08:54:48 +0200
commit011d8261117249eab97bc86a8e1ac7731e03e319 (patch)
tree5e4a07f4ac44d81b62344ee3c8dadadf1f77cf66 /arch/x86/ras
parente64edfcce9c738300b4102d0739577d6ecc96d4a (diff)
RAS: Add a Corrected Errors Collector
Introduce a simple data structure for collecting correctable errors along with accessors. More detailed description in the code itself. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it and then the rest of the chain doesn't hear about it - basically the main reason for the CE collector - or to continue running the notifiers. When the CEC hits the action threshold, it will try to soft-offine the page containing the ECC and then the whole decoding chain gets to see the error. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'arch/x86/ras')
-rw-r--r--arch/x86/ras/Kconfig14
1 files changed, 14 insertions, 0 deletions
diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index 0bc60a308730..2a2d89d39af6 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -7,3 +7,17 @@ config MCE_AMD_INJ
aspects of the MCE handling code.
WARNING: Do not even assume this interface is staying stable!
+
+config RAS_CEC
+ bool "Correctable Errors Collector"
+ depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS
+ ---help---
+ This is a small cache which collects correctable memory errors per 4K
+ page PFN and counts their repeated occurrence. Once the counter for a
+ PFN overflows, we try to soft-offline that page as we take it to mean
+ that it has reached a relatively high error count and would probably
+ be best if we don't use it anymore.
+
+ Bear in mind that this is absolutely useless if your platform doesn't
+ have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS.
+