summaryrefslogtreecommitdiff
path: root/drivers/edac/ghes_edac.c
AgeCommit message (Collapse)Author
2020-12-15edac: ghes: use krealloc_array()Bartosz Golaszewski
Use the helper that checks for overflows internally instead of manually calculating the size of the new array. Link: https://lkml.kernel.org/r/20201109110654.12547-7-brgl@bgdev.pl Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Christian Knig <christian.koenig@amd.com> Cc: Christoph Lameter <cl@linux.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Airlie <airlied@linux.ie> Cc: David Rientjes <rientjes@google.com> Cc: Gustavo Padovan <gustavo@padovan.org> Cc: James Morse <james.morse@arm.com> Cc: Jaroslav Kysela <perex@perex.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Maxime Ripard <mripard@kernel.org> Cc: "Michael S . Tsirkin" <mst@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Robert Richter <rric@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Takashi Iwai <tiwai@suse.com> Cc: Takashi Iwai <tiwai@suse.de> Cc: Thomas Zimmermann <tzimmermann@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-12Merge tag 'efi-core-2020-10-12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull EFI changes from Ingo Molnar: - Preliminary RISC-V enablement - the bulk of it will arrive via the RISCV tree. - Relax decompressed image placement rules for 32-bit ARM - Add support for passing MOK certificate table contents via a config table rather than a EFI variable. - Add support for 18 bit DIMM row IDs in the CPER records. - Work around broken Dell firmware that passes the entire Boot#### variable contents as the command line - Add definition of the EFI_MEMORY_CPU_CRYPTO memory attribute so we can identify it in the memory map listings. - Don't abort the boot on arm64 if the EFI RNG protocol is available but returns with an error - Replace slashes with exclamation marks in efivarfs file names - Split efi-pstore from the deprecated efivars sysfs code, so we can disable the latter on !x86. - Misc fixes, cleanups and updates. * tag 'efi-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits) efi: mokvar: add missing include of asm/early_ioremap.h efi: efivars: limit availability to X86 builds efi: remove some false dependencies on CONFIG_EFI_VARS efi: gsmi: fix false dependency on CONFIG_EFI_VARS efi: efivars: un-export efivars_sysfs_init() efi: pstore: move workqueue handling out of efivars efi: pstore: disentangle from deprecated efivars module efi: mokvar-table: fix some issues in new code efi/arm64: libstub: Deal gracefully with EFI_RNG_PROTOCOL failure efivarfs: Replace invalid slashes with exclamation marks in dentries. efi: Delete deprecated parameter comments efi/libstub: Fix missing-prototypes in string.c efi: Add definition of EFI_MEMORY_CPU_CRYPTO and ability to report it cper,edac,efi: Memory Error Record: bank group/address and chip id edac,ghes,cper: Add Row Extension to Memory Error Record efi/x86: Add a quirk to support command line arguments on Dell EFI firmware efi/libstub: Add efi_warn and *_once logging helpers integrity: Load certs from the EFI MOK config table integrity: Move import of MokListRT certs to a separate routine efi: Support for MOK variable config table ...
2020-10-12Merge tag 'edac_updates_for_v5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add Amazon's Annapurna Labs memory controller EDAC driver (Talel Shenhar) - New AMD CPUs support (Yazen Ghannam) - The usual misc fixes and cleanups all over the subsystem * tag 'edac_updates_for_v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/amd64: Set proper family type for Family 19h Models 20h-2Fh EDAC/mc_sysfs: Add missing newlines when printing {max,dimm}_location EDAC/aspeed: Use module_platform_driver() to simplify EDAC, sb_edac: Simplify switch statement EDAC/ti: Fix handling of platform_get_irq() error EDAC/aspeed: Fix handling of platform_get_irq() error EDAC/i5100: Fix error handling order in i5100_init_one() EDAC/highbank: Handover Calxeda Highbank maintenance to Andre Przywara EDAC/socfpga: Transfer SoCFPGA EDAC maintainership EDAC/thunderx: Make symbol lmc_dfs_ents static EDAC/al-mc-edac: Add Amazon's Annapurna Labs Memory Controller driver dt-bindings: EDAC: Add Amazon's Annapurna Labs Memory Controller binding EDAC/mce_amd: Add new error descriptions for existing types EDAC: Replace HTTP links with HTTPS ones
2020-09-17cper,edac,efi: Memory Error Record: bank group/address and chip idAlex Kluver
Updates to the UEFI 2.8 Memory Error Record allow splitting the bank field into bank address and bank group, and using the last 3 bits of the extended field as a chip identifier. When needed, print correct version of bank field, bank group, and chip identification. Based on UEFI 2.8 Table 299. Memory Error Record. Signed-off-by: Alex Kluver <alex.kluver@hpe.com> Reviewed-by: Russ Anderson <russ.anderson@hpe.com> Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200819143544.155096-3-alex.kluver@hpe.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-09-17edac,ghes,cper: Add Row Extension to Memory Error RecordAlex Kluver
Memory errors could be printed with incorrect row values since the DIMM size has outgrown the 16 bit row field in the CPER structure. UEFI Specification Version 2.8 has increased the size of row by allowing it to use the first 2 bits from a previously reserved space within the structure. When needed, add the extension bits to the row value printed. Based on UEFI 2.8 Table 299. Memory Error Record Signed-off-by: Alex Kluver <alex.kluver@hpe.com> Tested-by: Russ Anderson <russ.anderson@hpe.com> Reviewed-by: Steve Wahl <steve.wahl@hpe.com> Reviewed-by: Kyle Meyer <kyle.meyer@hpe.com> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20200819143544.155096-2-alex.kluver@hpe.com Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
2020-09-15EDAC/ghes: Check whether the driver is on the safe list correctlyBorislav Petkov
With CONFIG_DEBUG_TEST_DRIVER_REMOVE=y, a system would try to probe, unregister and probe again a driver. When ghes_edac is attempted to be loaded on a system which is not on the safe platforms list, ghes_edac_register() would return early. The unregister counterpart ghes_edac_unregister() would still attempt to unregister and exit early at the refcount test, leading to the refcount underflow below. In order to not do *anything* on the unregister path too, reuse the force_load parameter and check it on that path too, before fumbling with the refcount. ghes_edac: ghes_edac_register: entry ghes_edac: ghes_edac_register: return -ENODEV ------------[ cut here ]------------ refcount_t: underflow; use-after-free. WARNING: CPU: 10 PID: 1 at lib/refcount.c:28 refcount_warn_saturate+0xb9/0x100 Modules linked in: CPU: 10 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #12 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:refcount_warn_saturate+0xb9/0x100 Code: 82 e8 fb 8f 4d 00 90 0f 0b 90 90 c3 80 3d 55 4c f5 00 00 75 88 c6 05 4c 4c f5 00 01 90 48 c7 c7 d0 8a 10 82 e8 d8 8f 4d 00 90 <0f> 0b 90 90 c3 80 3d 30 4c f5 00 00 0f 85 61 ff ff ff c6 05 23 4c RSP: 0018:ffffc90000037d58 EFLAGS: 00010292 RAX: 0000000000000026 RBX: ffff88840b8da000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffff8216b24f RDI: 00000000ffffffff RBP: ffff88840c662e00 R08: 0000000000000001 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000046 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff88840ee80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 0000800002211000 CR4: 00000000003506e0 Call Trace: ghes_edac_unregister ghes_remove platform_drv_remove really_probe driver_probe_device device_driver_attach __driver_attach ? device_driver_attach ? device_driver_attach bus_for_each_dev bus_add_driver driver_register ? bert_init ghes_init do_one_initcall ? rcu_read_lock_sched_held kernel_init_freeable ? rest_init kernel_init ret_from_fork ... ghes_edac: ghes_edac_unregister: FALSE, refcount: -1073741824 Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164950.GB19320@zn.tnic
2020-09-15EDAC/ghes: Clear scanned data on unloadBorislav Petkov
Commit b972fdba8665 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()") didn't clear all the information from the scanned system and, more specifically, left ghes_hw.num_dimms to its previous value. On a second load (CONFIG_DEBUG_TEST_DRIVER_REMOVE=y), the driver would use the leftover num_dimms value which is not 0 and thus the 0 check in enumerate_dimms() will get bypassed and it would go directly to the pointer deref: d = &hw->dimms[hw->num_dimms]; which is, of course, NULL: #PF: supervisor write access in kernel mode #PF: error_code(0x0002) - not-present page PGD 0 P4D 0 Oops: 0002 [#1] PREEMPT SMP CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.9.0-rc4+ #7 Hardware name: GIGABYTE MZ01-CE1-00/MZ01-CE1-00, BIOS F02 08/29/2018 RIP: 0010:enumerate_dimms.cold+0x7b/0x375 Reset the whole ghes_hw on driver unregister so that no stale values are used on a second system scan. Fixes: b972fdba8665 ("EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()") Cc: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200911164817.GA19320@zn.tnic
2020-08-27EDAC/ghes: Fix NULL pointer dereference in ghes_edac_register()Shiju Jose
After b9cae27728d1 ("EDAC/ghes: Scan the system once on driver init") and with CONFIG_DEBUG_TEST_DRIVER_REMOVE enabled, ghes_hw.dimms becomes a NULL pointer after the second ->probe() (aka ghes_edac_register()) which the config option causes to be called. This happens because the static variable which holds down whether the system has been scanned already, doesn't get reset in ghes_edac_unregister(). Then, on the second probe, ghes_scan_system() doesn't get to enumerate the DIMMs, leading to ghes_hw.dimms remaining NULL. Clear the variable and rename it to something more descriptive so that a second probe succeeds. [ bp: Rewrite commit message. ] Fixes: b9cae27728d1 ("EDAC/ghes: Scan the system once on driver init") Suggested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200827140450.1620-1-shiju.jose@huawei.com
2020-08-17EDAC: Replace HTTP links with HTTPS onesAlexander A. Klimov
Rationale: Reduces attack surface on kernel devs opening the links for MITM as HTTPS traffic is much harder to manipulate. Deterministic algorithm: For each file: If not .svg: For each line: If doesn't contain `\bxmlns\b`: For each link, `\bhttp://[^# \t\r\n]*(?:\w|/)`: If neither `\bgnu\.org/license`, nor `\bmozilla\.org/MPL\b`: If both the HTTP and HTTPS versions return 200 OK and serve the same content: Replace HTTP with HTTPS. [ bp: Merge all EDAC patches into a single one. ] Signed-off-by: Alexander A. Klimov <grandmaster@al2klimov.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tero Kristo <t-kristo@ti.com> # ti_edac Link: https://lkml.kernel.org/r/20200708113546.14135-1-grandmaster@al2klimov.de
2020-06-16EDAC/ghes: Scan the system once on driver initBorislav Petkov
Change the hardware scanning and figuring out how many DIMMs a machine has to a single, one-time thing which happens once on driver init. After that scanning completes, struct ghes_hw_desc contains a representation of the hardware which the driver can then use for later initialization. Then, copy the DIMM information into the respective EDAC core representation of those. Get rid of ghes_edac_dimm_fill and use a struct dimm_info array directly. This way, hw detection and further driver initialization is nicely and logically split. Further additions should all be added to ghes_scan_system() and the hw representation extended as needed. There should be no functionality change resulting from this patch. Signed-off-by: Borislav Petkov <bp@suse.de>
2020-06-16EDAC/ghes: Remove unused members of struct ghes_edac_pvt, rename it to ghes_pvtRobert Richter
The struct members list and ghes of struct ghes_edac_pvt are unused, remove them. On that occasion, rename it to the shorter name struct ghes_pvt. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200519104443.15673-2-rrichter@marvell.com
2020-06-16EDAC/ghes: Setup DIMM label from DMI and use it in error reportsRobert Richter
The ghes driver reports errors with 'unknown label' even if the actual DIMM label is known, e.g.: EDAC MC0: 1 CE Single-bit ECC on unknown label (node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0 page:0x966a9b3 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:1 bank:0 col:13 bit_pos:16 DIMM location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM memory) Fix this by using struct dimm_info's label string in error reports: EDAC MC0: 1 CE Single-bit ECC on N0 DIMM_A0 (node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0 page:0x99223d8 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:1 bank:515 col:14 bit_pos:16 DIMM location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM memory) The labels are initialized by reading the bank and device strings from DMI. Now, the label information can also read from sysfs. E.g. a ThunderX2 system will show the following: /sys/devices/system/edac/mc/mc0/dimm0/dimm_label:N0 DIMM_A0 /sys/devices/system/edac/mc/mc0/dimm1/dimm_label:N0 DIMM_B0 /sys/devices/system/edac/mc/mc0/dimm2/dimm_label:N0 DIMM_C0 /sys/devices/system/edac/mc/mc0/dimm3/dimm_label:N0 DIMM_D0 /sys/devices/system/edac/mc/mc0/dimm4/dimm_label:N0 DIMM_E0 /sys/devices/system/edac/mc/mc0/dimm5/dimm_label:N0 DIMM_F0 /sys/devices/system/edac/mc/mc0/dimm6/dimm_label:N0 DIMM_G0 /sys/devices/system/edac/mc/mc0/dimm7/dimm_label:N0 DIMM_H0 /sys/devices/system/edac/mc/mc0/dimm8/dimm_label:N1 DIMM_I0 /sys/devices/system/edac/mc/mc0/dimm9/dimm_label:N1 DIMM_J0 /sys/devices/system/edac/mc/mc0/dimm10/dimm_label:N1 DIMM_K0 /sys/devices/system/edac/mc/mc0/dimm11/dimm_label:N1 DIMM_L0 /sys/devices/system/edac/mc/mc0/dimm12/dimm_label:N1 DIMM_M0 /sys/devices/system/edac/mc/mc0/dimm13/dimm_label:N1 DIMM_N0 /sys/devices/system/edac/mc/mc0/dimm14/dimm_label:N1 DIMM_O0 /sys/devices/system/edac/mc/mc0/dimm15/dimm_label:N1 DIMM_P0 Since dimm_labels can be rewritten, that label will be used in a later error report: # echo foobar >/sys/devices/system/edac/mc/mc0/dimm0/dimm_label # # some error injection here # dmesg | grep foobar [ 751.383533] EDAC MC0: 1 CE Single-bit ECC on foobar (node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0 page:0x8c8dc74 offset:0x0 grain:1 syndrome:0x0 - APEI location: node:0 card:0 module:0 rank:1 bank:259 col:3 bit_pos:16 DIMM location:N0 DIMM_A0 status(0x0000000000000400): Storage error in DRAM memory) [ bp: Remove curly brackets around a single if-statement in dimm_setup_label(). ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20200528101307.23245-1-rrichter@marvell.com
2020-02-17EDAC/mc: Remove enable_per_layer_report function argumentRobert Richter
Many functions carry the enable_per_layer_report argument. This is a bool value indicating the error information contains some location data where the error occurred. This can easily being determined by checking the pos[] array for values. Negative values indicate there is no location available. So if the top layer is negative, the error location is unknown. Just check if the top layer is negative and remove enable_per_layer_report as function argument and also from struct edac_raw_error_desc. [ bp: Reflow comments to 80 columns, while at it. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-8-rrichter@marvell.com
2020-02-17EDAC/mc: Determine mci pointer from the error descriptorRobert Richter
Each struct mci has its own error descriptor. Create a function error_desc_to_mci() to determine the corresponding mci from an error descriptor. This removes @mci from the parameter list of edac_raw_mc_handle_error() as the mci pointer does not need to be passed any longer. [ bp: Massage commit message. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-5-rrichter@marvell.com
2020-02-17EDAC: Store error type in struct edac_raw_error_descRobert Richter
Store the error type in struct edac_raw_error_desc. This makes the type parameter of edac_raw_mc_handle_error() obsolete. [ kernel-doc typo ] Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Link: https://lkml.kernel.org/r/20200123090210.26933-4-rrichter@marvell.com
2019-11-22EDAC/ghes: Do not warn when incrementing refcount on 0Robert Richter
The following warning from the refcount framework is seen during ghes initialization: EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT) ------------[ cut here ]------------ refcount_t: increment on 0; use-after-free. WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked [...] Call trace: refcount_inc_checked ghes_edac_register ghes_probe ... It warns if the refcount is incremented from zero. This warning is reasonable as a kernel object is typically created with a refcount of one and freed once the refcount is zero. Afterwards the object would be "used-after-free". For GHES, the refcount is initialized with zero, and that is why this message is seen when initializing the first instance. However, whenever the refcount is zero, the device will be allocated and registered. Since the ghes_reg_mutex protects the refcount and serializes allocation and freeing of ghes devices, a use-after-free cannot happen here. Instead of using refcount_inc() for the first instance, use refcount_set(). This can be used here because the refcount is zero at this point and can not change due to its protection by the mutex. Fixes: 23f61b9fc5cc ("EDAC/ghes: Fix locking and memory barrier issues") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: John Garry <john.garry@huawei.com> Cc: <huangming23@huawei.com> Cc: James Morse <james.morse@arm.com> Cc: <linuxarm@huawei.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: <tanxiaofei@huawei.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <wanghuiqiang@huawei.com> Link: https://lkml.kernel.org/r/20191121213628.21244-1-rrichter@marvell.com
2019-11-10EDAC: Unify the mc_event tracepoint callRobert Richter
The code in ghes_edac.c and edac_mc.c for grain_bits calculation and calling trace_mc_event() is now the same. Move it to a single location in edac_raw_mc_handle_error(). The only difference is the missing IS_ENABLED(CONFIG_RAS) switch, but this is needed for ghes too. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-13-rrichter@marvell.com
2019-11-10EDAC/ghes: Remove intermediate buffer pvt->detail_locationRobert Richter
detail_location[] is used to collect two location strings so they can be passed as one to trace_mc_event(). Instead of having an extra copy step, assemble the location string in other_detail[] from the beginning. Using other_detail[] to call trace_mc_event() is now the same as in edac_mc.c and code can be unified. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-12-rrichter@marvell.com
2019-11-10EDAC/ghes: Fix grain calculationRobert Richter
The current code to convert a physical address mask to a grain (defined as granularity in bytes) is: e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK); This is broken in several ways: 1) It calculates to wrong grain values. E.g., a physical address mask of ~0xfff should give a grain of 0x1000. Without considering PAGE_MASK, there is an off-by-one. Things are worse when also filtering it with ~PAGE_MASK. This will calculate to a grain with the upper bits set. In the example it even calculates to ~0. 2) The grain does not depend on and is unrelated to the kernel's page-size. The page-size only matters when unmapping memory in memory_failure(). Smaller grains are wrongly rounded up to the page-size, on architectures with a configurable page-size (e.g. arm64) this could round up to the even bigger page-size of the hypervisor. Fix this with: e->grain = ~mem_err->physical_addr_mask + 1; The grain_bits are defined as: grain = 1 << grain_bits; Change also the grain_bits calculation accordingly, it is the same formula as in edac_mc.c now and the code can be unified. The value in ->physical_addr_mask coming from firmware is assumed to be contiguous, but this is not sanity-checked. However, in case the mask is non-contiguous, a conversion to grain_bits effectively converts the grain bit mask to a power of 2 by rounding it up. Suggested-by: James Morse <james.morse@arm.com> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-11-rrichter@marvell.com
2019-11-10EDAC/ghes: Use standard kernel macros for page calculationsRobert Richter
Use standard macros for page calculations. Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-10-rrichter@marvell.com
2019-11-10EDAC: Introduce an mci_for_each_dimm() iteratorRobert Richter
Introduce an mci_for_each_dimm() iterator. It returns a pointer to a struct dimm_info. This makes the declaration and use of an index obsolete and avoids access to internal data of struct mci (direct array access etc). [ bp: push the struct dimm_info *dimm; declaration into the CONFIG_EDAC_DEBUG block. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-4-rrichter@marvell.com
2019-11-09EDAC: Replace EDAC_DIMM_PTR() macro with edac_get_dimm() functionRobert Richter
The EDAC_DIMM_PTR() macro takes 3 arguments from struct mem_ctl_info. Clean up this interface to only pass the mci struct and replace this macro with a new function edac_get_dimm(). Also introduce an edac_get_dimm_by_index() function for later use. This allows it to get a DIMM pointer only by a given index. This can be useful if the DIMM's position within the layers of the memory controller or the exact size of the layers are unknown. Small style changes made for some hunks after applying the semantic patch. Semantic patch used: @@ expression mci, a, b,c; @@ -EDAC_DIMM_PTR(mci->layers, mci->dimms, mci->n_layers, a, b, c) +edac_get_dimm(mci, a, b, c) [ bp: Touchups. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jason Baron <jbaron@akamai.com> Cc: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Cc: Tero Kristo <t-kristo@ti.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191106093239.25517-2-rrichter@marvell.com
2019-11-08EDAC/ghes: Fix locking and memory barrier issuesRobert Richter
The ghes registration and refcount is broken in several ways: * ghes_edac_register() returns with success for a 2nd instance even if a first instance's registration is still running. This is not correct as the first instance may fail later. A subsequent registration may not finish before the first. Parallel registrations must be avoided. * The refcount was increased even if a registration failed. This leads to stale counters preventing the device from being released. * The ghes refcount may not be decremented properly on unregistration. Always decrement the refcount once ghes_edac_unregister() is called to keep the refcount sane. * The ghes_pvt pointer is handed to the irq handler before registration finished. * The mci structure could be freed while the irq handler is running. Fix this by adding a mutex to ghes_edac_register(). This mutex serializes instances to register and unregister. The refcount is only increased if the registration succeeded. This makes sure the refcount is in a consistent state after registering or unregistering a device. Note: A spinlock cannot be used here as the code section may sleep. The ghes_pvt is protected by ghes_lock now. This ensures the pointer is not updated before registration was finished or while the irq handler is running. It is unset before unregistering the device including necessary (implicit) memory barriers making the changes visible to other CPUs. Thus, the device can not be used anymore by an interrupt. Also, rename ghes_init to ghes_refcount for better readability and switch to refcount API. A refcount is needed because there can be multiple GHES structures being defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error Source, "Some platforms may describe multiple Generic Hardware Error Source structures with different notification types, ..."). Another approach to use the mci's device refcount (get_device()) and have a release function does not work here. A release function will be called only for device_release() with the last put_device() call. The device must be deleted *before* that with device_del(). This is only possible by maintaining an own refcount. [ bp: touchups. ] Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path") Co-developed-by: James Morse <james.morse@arm.com> Signed-off-by: James Morse <james.morse@arm.com> Co-developed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20191105200732.3053-1-rrichter@marvell.com
2019-10-17EDAC/ghes: Fix Use after free in ghes_edac remove pathJames Morse
ghes_edac models a single logical memory controller, and uses a global ghes_init variable to ensure only the first ghes_edac_register() will do anything. ghes_edac is registered the first time a GHES entry in the HEST is probed. There may be multiple entries, so subsequent attempts to register ghes_edac are silently ignored as the work has already been done. When a GHES entry is unregistered, it calls ghes_edac_unregister(), which free()s the memory behind the global variables in ghes_edac. But there may be multiple GHES entries, the next call to ghes_edac_unregister() will dereference the free()d memory, and attempt to free it a second time. This may also be triggered on a platform with one GHES entry, if the driver is unbound/re-bound and unbound. The re-bind step will do nothing because of ghes_init, the second unbind will then do the same work as the first. Doing the unregister work on the first call is unsafe, as another CPU may be processing a notification in ghes_edac_report_mem_error(), using the memory we are about to free. ghes_init is already half of the reference counting. We only need to do the register work for the first call, and the unregister work for the last. Add the unregister check. This means we no longer free ghes_edac's memory while there are GHES entries that may receive a notification. This was detected by KASAN and DEBUG_TEST_DRIVER_REMOVE. [ bp: merge into a single patch. ] Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") Reported-by: John Garry <john.garry@huawei.com> Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Robert Richter <rrichter@marvell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20191014171919.85044-2-james.morse@arm.com Link: https://lkml.kernel.org/r/304df85b-8b56-b77e-1a11-aa23769f2e7c@huawei.com
2019-09-03EDAC: Prefer 'unsigned int' to bare use of 'unsigned'Robert Richter
Use of 'unsigned int' instead of bare use of 'unsigned'. Fix this for edac_mc*, ghes and the i5100 driver as reported by checkpatch.pl. While at it, struct member dev_ch_attribute->channel is always used as unsigned int. Change type to unsigned int to avoid type casts. [ bp: Massage. ] Signed-off-by: Robert Richter <rrichter@marvell.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20190902123216.9809-2-rrichter@marvell.com
2019-05-30treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 172Thomas Gleixner
Based on 1 normalized pattern(s): this file may be distributed under the terms of the gnu general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 9 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070034.395589349@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-09-22EDAC, ghes: Use CPER module handles to locate DIMMsFan Wu
Use SMBIOS module handle type 17, on platforms which provide valid ones, to locate the corresponding DIMM and thus have per-DIMM error counter updates. Signed-off-by: Fan Wu <wufan@codeaurora.org> [ Massage commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Tyler Baicar <baicar.tyler@gmail.com> Reviewed-by: James Morse <james.morse@arm.com> Tested-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: baicar.tyler@gmail.com Cc: john.garry@huawei.com Cc: linux-arm-kernel@lists.infradead.org Cc: linux-edac <linux-edac@vger.kernel.org> Cc: shiju.jose@huawei.com Cc: tanxiaofei@huawei.com Cc: wanghuiqiang@huawei.com Link: http://lkml.kernel.org/r/1537322340-1860-1-git-send-email-wufan@codeaurora.org
2018-05-21EDAC, ghes: Make platform-based whitelisting x86-onlyBorislav Petkov
ARM machines all have DMI tables so if they request hw error reporting through GHES, then the driver should be able to detect DIMMs and report errors successfully (famous last words :)). Make the platform-based list x86-specific so that ghes_edac can load on ARM. Reported-by: Qiang Zheng <zhengqiang10@huawei.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: James Morse <james.morse@arm.com> Tested-by: James Morse <james.morse@arm.com> Tested-by: Qiang Zheng <zhengqiang10@huawei.com> Link: https://lkml.kernel.org/r/1526039543-180996-1-git-send-email-zhengqiang10@huawei.com
2018-05-12EDAC, ghes: Use BIT() macroBorislav Petkov
... for improved readability. Also, add a local mask variable for the same reason. No functional changes. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
2018-05-12EDAC, ghes: Add DDR4 and NVDIMM memory typesToshi Kani
The ghes_edac driver obtains memory type from SMBIOS type 17, but it does not recognize DDR4 and NVDIMM types. Add support of DDR4 and NVDIMM types. NVDIMM type is denoted by memory type DDR3/4 and non-volatile. Reported-by: Robert Elliott <elliott@hpe.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180509222030.9299-1-toshi.kani@hpe.com Signed-off-by: Borislav Petkov <bp@suse.de>
2018-05-12EDAC, ghes: Remove unused argument to ghes_edac_report_mem_error()Alexandru Gagniuc
The use of the @ghes argument was removed in a previous commit, but function signature was not updated to reflect this. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Cc: linux-acpi@vger.kernel.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20180430213358.8319-1-mr.nuke.me@gmail.com Signed-off-by: Borislav Petkov <bp@suse.de>
2018-05-02EDAC, ghes: Add a null pointer check in ghes_edac_unregister()Sughosh Ganu
Add a null check for ghes_pvt, before dereferencing it. The pointer could still be null in case the return path is taken before initialising ghes_pvt in the registration function. Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Sughosh Ganu <sughosh.ganu@arm.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lkml <linux-kernel@vger.kernel.org> Link: http://lkml.kernel.org/r/1524737809-24475-1-git-send-email-sughosh.ganu@arm.com Signed-off-by: Borislav Petkov <bp@suse.de>
2018-05-02ghes, EDAC: Fix ghes_edac registrationBorislav Petkov
Tony reported seeing "Internal error: Can't find EDAC structure" when injecting correctable errors due to the fact that ghes_edac would still load even if the whitelist won't hit. Drop the pr_err() in ghes_edac_report_mem_error() for now due to the hacky way how ghes_edac depends on ghes.c. While at it, make ghes_edac_register() return an error if it doesn't hit in the whitelist as it is the only sensible thing to do in that situation. Furthermore, move the call to it to happen last in ghes_probe() so that GHES initializing properly does not depend on ghes_edac init at all as latter is only reporting errors and not required for GHES's proper functioning. Reviewed-by: Toshi Kani <toshi.kani@hpe.com> Tested-by: Sughosh Ganu <sughosh.ganu@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Tony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20180420182015.zao3olss4tvvlxki@agluck-desk
2017-09-25EDAC, ghes: Add platform checkToshi Kani
The ghes_edac driver was introduced in 2013 [1], but it has not been enabled by any distro yet. This driver obtains error info from firmware interfaces (APEI), which are not properly implemented on many platforms, as the driver says on load: This EDAC driver relies on BIOS to enumerate memory and get error reports. Unfortunately, not all BIOSes reflect the memory layout correctly. So, the end result of using this driver varies from vendor to vendor. If you find incorrect reports, please contact your hardware vendor to correct its BIOS. To get out from this situation, add a platform check to selectively enable the driver on platforms that are known to have proper APEI firmware implementation. "ghes_edac.force_load=1" skips this platform check. [1]: https://lkml.kernel.org/r/cover.1360931635.git.mchehab@redhat.com Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Mauro Carvalho Chehab <mchehab@kernel.org> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-acpi@vger.kernel.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170823225447.15608-4-toshi.kani@hpe.com Signed-off-by: Borislav Petkov <bp@suse.de>
2017-09-25EDAC, ghes: Model a single, logical memory controllerBorislav Petkov
We're enumerating the DIMMs through a DMI walk and since we can't get any more detailed topological information about which DIMMs belong to which memory controller, convert it to a single, logical controller which contains all the DIMMs. The error reporting path from GHES ghes_edac_report_mem_error() doesn't get called in NMI context but add a warning about it to catch any changes in the future as if so, our locking scheme will be insufficient then. Signed-off-by: Borislav Petkov <bp@suse.de>
2017-09-25EDAC, ghes: Remove symbol exportsBorislav Petkov
They're called from builtin code so no need for the exports. Signed-off-by: Borislav Petkov <bp@suse.de>
2017-07-17EDAC: Get rid of mci->mod_verBorislav Petkov
It is a write-only variable so get rid of it. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Robert Richter <rric@kernel.org> Acked-by: Michal Simek <michal.simek@xilinx.com> Acked-by: Thor Thayer <thor.thayer@linux.intel.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Mark Gross <mark.gross@intel.com> Cc: Tim Small <tim@buttersideup.com> Cc: Ranganathan Desikan <ravi@jetztechnologies.com> Cc: "Arvind R." <arvino55@gmail.com> Cc: Jason Baron <jbaron@akamai.com> Cc: "Sören Brinkmann" <soren.brinkmann@xilinx.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Daney <david.daney@cavium.com> Cc: Loc Ho <lho@apm.com> Cc: linux-edac@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-kernel@lists.infradead.org Cc: linux-mips@linux-mips.org
2016-12-15edac: rename edac_core.h to edac_mc.hMauro Carvalho Chehab
Now, all left at edac_core.h are at drivers/edac/edac_mc.c, so rename it to edac_mc.h. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
2015-10-22EDAC: Fix PAGES_TO_MiB macro misuseTan Xiaojun
The PAGES_TO_MiB macro is used for unit conversion but the trace_mc_event() tracepoint expects a page address. Fix that. Signed-off-by: Tan Xiaojun <tanxiaojun@huawei.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1445341538-24271-1-git-send-email-tanxiaojun@huawei.com Signed-off-by: Borislav Petkov <bp@suse.de>
2015-09-23EDAC, ghes_edac: Remove redundant memory_type arrayAravind Gopalakrishnan
We already have edac_mem_types[] that enumerates the different kinds of memory. So, use that and remove the redundant memory_type[] array here. Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1442436811-23382-2-git-send-email-Aravind.Gopalakrishnan@amd.com Signed-off-by: Borislav Petkov <bp@suse.de>
2014-11-11ghes_edac: Use snprintf() to silence a static checker warningDan Carpenter
My static checker complains because the "e->location" has up to 256 characters but we are copying it into the "pvt->detail_location" which only has space for 240 characters. That's not counting the surrounding text and the "e->other_detail" string which can be over 80 characters long. I am not familiar with this code but presumably it normally works. Let's add a limit though for safety. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Link: http://lkml.kernel.org/r/20140801082514.GD28869@mwanda Signed-off-by: Borislav Petkov <bp@suse.de>
2014-02-07[media, edac] Change my email addressMauro Carvalho Chehab
There are several left overs with my old email address. Remove their occurrences and add myself at CREDITS, to allow people to be able to reach me on my new addresses. Signed-off-by: Mauro Carvalho Chehab <m.chehab@samsung.com>
2013-10-23EDAC, GHES: Update ghes error record infoChen, Gong
In latest UEFI spec(by now it's 2.4) there are some new fields for memory error reporting. Add these new fields for ghes_edac interface. Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Cc: Mauro Carvalho Chehab <m.chehab@samsung.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-10-23ACPI, APEI, CPER: Add UEFI 2.4 support for memory errorChen, Gong
In latest UEFI spec(by now it is 2.4) memory error definition for CPER (UEFI 2.4 Appendix N Common Platform Error Record) adds some new fields. These fields help people to locate memory error to an actual DIMM location. Original-author: Tony Luck <tony.luck@intel.com> Signed-off-by: Chen, Gong <gong.chen@linux.intel.com> Reviewed-by: Borislav Petkov <bp@suse.de> Reviewed-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
2013-02-26ghes_edac: fix to use list_for_each_entry_safe() when delete list itemsWei Yongjun
Since we will remove items off the list using list_del() we need to use a safe version of the list_for_each_entry() macro aptly named list_for_each_entry_safe(). Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25ghes_edac: Fix RAS tracingMauro Carvalho Chehab
With the current version of CPER, there's no way to associate an error with the memory error. So, the error location in EDAC layers is unused. As CPER has its own idea about memory architectural layers, just output whatever is there inside the driver's detail at the RAS tracepoint. The EDAC location keeps untouched, in the case that, in some future, we could actually map the error into the dimm labels. Now, the error message: [ 72.396625] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 72.396627] {1}[Hardware Error]: APEI generic hardware error status [ 72.396628] {1}[Hardware Error]: severity: 2, corrected [ 72.396630] {1}[Hardware Error]: section: 0, severity: 2, corrected [ 72.396632] {1}[Hardware Error]: flags: 0x01 [ 72.396634] {1}[Hardware Error]: primary [ 72.396635] {1}[Hardware Error]: section_type: memory error [ 72.396637] {1}[Hardware Error]: error_status: 0x0000000000000400 [ 72.396638] {1}[Hardware Error]: node: 3 [ 72.396639] {1}[Hardware Error]: card: 0 [ 72.396640] {1}[Hardware Error]: module: 0 [ 72.396641] {1}[Hardware Error]: device: 0 [ 72.396643] {1}[Hardware Error]: error_type: 18, unknown [ 72.396666] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:0 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory) Is properly represented on the trace event: kworker/0:2-584 [000] .... 72.396657: mc_event: 1 Corrected error: reserved error (18) on unknown label (mc:0 location:-1:-1:-1 address:0x00000000 grain:1 syndrome:0x00000000 APEI location: node:3 card:0 module:0 status(0x0000000000000400): Storage error in DRAM memory) Tested on a 4 sockets E5-4650 Sandy Bridge machine. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25ghes_edac: Make it compliant with UEFI spec 2.3.1Mauro Carvalho Chehab
The UEFI spec defines the memory error types ans the bits that validate each field on the memory error record, at Appendix N om items N.2.5 (Memory Error Section) and N.2.11 (Error Status). Make the error description compliant with it, only showing the valid fields. The EDAC error log is now properly reporting the error: [ 281.556854] mce: [Hardware Error]: Machine check events logged [ 281.557042] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 [ 281.557044] {2}[Hardware Error]: APEI generic hardware error status [ 281.557046] {2}[Hardware Error]: severity: 2, corrected [ 281.557048] {2}[Hardware Error]: section: 0, severity: 2, corrected [ 281.557050] {2}[Hardware Error]: flags: 0x01 [ 281.557052] {2}[Hardware Error]: primary [ 281.557053] {2}[Hardware Error]: section_type: memory error [ 281.557055] {2}[Hardware Error]: error_status: 0x0000000000000400 [ 281.557056] {2}[Hardware Error]: node: 3 [ 281.557057] {2}[Hardware Error]: card: 0 [ 281.557058] {2}[Hardware Error]: module: 1 [ 281.557059] {2}[Hardware Error]: device: 0 [ 281.557061] {2}[Hardware Error]: error_type: 18, unknown [ 281.557067] EDAC DEBUG: ghes_edac_report_mem_error: error validation_bits: 0x000040b9 [ 281.557084] EDAC MC0: 1 CE reserved error (18) on unknown label (node:3 card:0 module:1 page:0x0 offset:0x0 grain:0 syndrome:0x0 - status(0x0000000000000400): Storage error in DRAM memory) Tested on a 4 CPUs E5-4650 Sandy Bridge machine. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25ghes_edac: Improve driver's printk messagesMauro Carvalho Chehab
Provide a better infrastructure for printk's inside the driver: - use edac_dbg() for debug messages; - standardize the usage of pr_info(); - provide warning about the risk of relying on this driver. While here, changes the size of a fake memory to 1 page. This is as good or as bad as 1000 pages, but it is easier for userspace to detect, as I don't expect that any machine implementing GHES would provide just 1 page available ;) Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com> Conflicts: drivers/edac/ghes_edac.c
2013-02-25ghes_edac: Don't credit the same memory dimm twiceMauro Carvalho Chehab
On my tests on a 4xE5-4650 CPU's system, the GHES EDAC driver is called twice. As the SMBIOS DMI enumeration call will seek for the entire DIMM sockets in the system, on this machine, equipped with 128 GB of RAM, the memory is displayed twice: +-----------------------+ | mc0 | mc1 | ----------+-----------------------+ memory45: | 8192 MB | 8192 MB | memory44: | 0 MB | 0 MB | ----------+-----------------------+ memory43: | 0 MB | 0 MB | memory42: | 8192 MB | 8192 MB | ----------+-----------------------+ memory41: | 0 MB | 0 MB | memory40: | 0 MB | 0 MB | ----------+-----------------------+ memory39: | 8192 MB | 8192 MB | memory38: | 0 MB | 0 MB | ----------+-----------------------+ memory37: | 0 MB | 0 MB | memory36: | 8192 MB | 8192 MB | ----------+-----------------------+ memory35: | 0 MB | 0 MB | memory34: | 0 MB | 0 MB | ----------+-----------------------+ memory33: | 8192 MB | 8192 MB | memory32: | 0 MB | 0 MB | ----------+-----------------------+ memory31: | 0 MB | 0 MB | memory30: | 8192 MB | 8192 MB | ----------+-----------------------+ memory29: | 0 MB | 0 MB | memory28: | 0 MB | 0 MB | ----------+-----------------------+ memory27: | 8192 MB | 8192 MB | memory26: | 0 MB | 0 MB | ----------+-----------------------+ memory25: | 0 MB | 0 MB | memory24: | 8192 MB | 8192 MB | ----------+-----------------------+ memory23: | 0 MB | 0 MB | memory22: | 0 MB | 0 MB | ----------+-----------------------+ memory21: | 8192 MB | 8192 MB | memory20: | 0 MB | 0 MB | ----------+-----------------------+ memory19: | 0 MB | 0 MB | memory18: | 8192 MB | 8192 MB | ----------+-----------------------+ memory17: | 0 MB | 0 MB | memory16: | 0 MB | 0 MB | ----------+-----------------------+ memory15: | 8192 MB | 8192 MB | memory14: | 0 MB | 0 MB | ----------+-----------------------+ memory13: | 0 MB | 0 MB | memory12: | 8192 MB | 8192 MB | ----------+-----------------------+ memory11: | 0 MB | 0 MB | memory10: | 0 MB | 0 MB | ----------+-----------------------+ memory9: | 8192 MB | 8192 MB | memory8: | 0 MB | 0 MB | ----------+-----------------------+ memory7: | 0 MB | 0 MB | memory6: | 8192 MB | 8192 MB | ----------+-----------------------+ memory5: | 0 MB | 0 MB | memory4: | 0 MB | 0 MB | ----------+-----------------------+ memory3: | 8192 MB | 8192 MB | memory2: | 0 MB | 0 MB | ----------+-----------------------+ memory1: | 0 MB | 0 MB | memory0: | 8192 MB | 8192 MB | ----------+-----------------------+ Total sum of 256 GB. As there's no reliable way to credit DIMMS to the right memory controller, just put everything on memory controller 0 (with should always exist). Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
2013-02-25ghes_edac: do a better job of filling EDAC DIMM infoMauro Carvalho Chehab
Instead of just faking a random value for the DIMM data, get the information that it is available via DMI table. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>