summaryrefslogtreecommitdiff
path: root/drivers/edac
AgeCommit message (Collapse)Author
2025-03-25Merge tag 'edac_updates_for_v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add infrastructure support to EDAC in order to be able to register memory scrubbing RAS functionality with the kernel and expose sysfs nodes to control such scrubbing functionality. The main use case is CXL devices which provide different scrubbers for their built-in memories so that tools like rasdaemon can configure and control memory scrubbing and other, more advanced RAS functionality (Shiju Jose and Jonathan Cameron) - Add support to ie31200_edac for client SoCs like Raptor Lake-S which have multiple memory controllers and out-of-band ECC capability (Qiuxu Zhuo) - The usual round of cleanups, simplifications and fixlets * tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits) MAINTAINERS: Add a secondary maintainer for bluefield_edac EDAC/ie31200: Switch Raptor Lake-S to interrupt mode EDAC/ie31200: Add Intel Raptor Lake-S SoCs support EDAC/ie31200: Break up ie31200_probe1() EDAC/ie31200: Fold the two channel loops into one loop EDAC/ie31200: Make struct dimm_data contain decoded information EDAC/ie31200: Make the memory controller resources configurable EDAC/ie31200: Simplify the pci_device_id table EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info() EDAC/ie31200: Fix the error path order of ie31200_init() EDAC/ie31200: Fix the DIMM size mask for several SoCs EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer EDAC/device: Fix dev_set_name() format string EDAC/pnd2: Make read-only const array intlv static EDAC/igen6: Constify struct res_config EDAC/amd64: Simplify return statement in dct_ecc_enabled() EDAC: Update memory repair control interface for memory sparing feature EDAC: Add a memory repair control feature EDAC: Use string choice helper functions EDAC: Add a Error Check Scrub control feature ...
2025-03-25Merge remote-tracking branches 'ras/edac-cxl', 'ras/edac-drivers' and ↵Borislav Petkov (AMD)
'ras/edac-misc' into edac-updates * ras/edac-cxl: EDAC/device: Fix dev_set_name() format string EDAC: Update memory repair control interface for memory sparing feature EDAC: Add a memory repair control feature EDAC: Add a Error Check Scrub control feature EDAC: Add scrub control feature EDAC: Add support for EDAC device features control * ras/edac-drivers: EDAC/ie31200: Switch Raptor Lake-S to interrupt mode EDAC/ie31200: Add Intel Raptor Lake-S SoCs support EDAC/ie31200: Break up ie31200_probe1() EDAC/ie31200: Fold the two channel loops into one loop EDAC/ie31200: Make struct dimm_data contain decoded information EDAC/ie31200: Make the memory controller resources configurable EDAC/ie31200: Simplify the pci_device_id table EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info() EDAC/ie31200: Fix the error path order of ie31200_init() EDAC/ie31200: Fix the DIMM size mask for several SoCs EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids EDAC/igen6: Fix the flood of invalid error reports EDAC/ie31200: work around false positive build warning * ras/edac-misc: MAINTAINERS: Add a secondary maintainer for bluefield_edac EDAC/pnd2: Make read-only const array intlv static EDAC/igen6: Constify struct res_config EDAC/amd64: Simplify return statement in dct_ecc_enabled() EDAC: Use string choice helper functions Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-03-10EDAC/ie31200: Switch Raptor Lake-S to interrupt modeQiuxu Zhuo
Raptor Lake-S SoCs notify correctable memory errors via CMCI (Corrected Machine Check Interrupt). Switch Raptor Lake-S EDAC support from polling to interrupt mode by registering the callback to the MCE decode notifier chain. Note that as Raptor Lake-S SoCs may not recover from uncorrectable memory errors, the system will hang as soon as this type of error occurs, and the registered callback on the MCE decode chain will not be executed. This is the expected behavior. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-12-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Add Intel Raptor Lake-S SoCs supportQiuxu Zhuo
The Intel Raptor Lake-S SoC contains two memory controllers with DDR5 memory type and out-of-band ECC capability. The resource definitions of the memory controller are different from previous generations. One notable difference is that the PCI ERRSTS register is deprecated and is not used to indicate the presence of errors or to clear the MMIO-mapped ECC error log regsiters. Extend the ie31200_edac driver to support multiple memory controllers, add a resource configuration table and use an MSR register to clear the ECC error log registers to provide EDAC support for Raptor Lake-S SoCs. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-11-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Break up ie31200_probe1()Qiuxu Zhuo
Split ie31200_probe1() into two helper functions to easily extend support for multiple memory controllers. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-10-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fold the two channel loops into one loopQiuxu Zhuo
Fold the two channel loops to simplify the code and improve readability. Also, delete the comments related to the DRB register, as this register is not used here. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-9-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Make struct dimm_data contain decoded informationQiuxu Zhuo
The current dimm_data structure contains encoded DIMM information, which needs to be decoded for a given SoC when it is used. Make it contain decoded information when it's initialized so that the places where it is used do not need to decode it again, thereby simplifying the code. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-8-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Make the memory controller resources configurableQiuxu Zhuo
The resources such as MMIO, register offset, register mask, memory DIMM information, ECC error log location, etc., of the memory controller, and the number of memory controllers can be device-ID-specific. It requires adding numerous 'if (device_id == new_id)' special handling cases to the code to support a new SoC. Make these kinds of resources configurable and separate them from the code to facilitate the addition of new SoC support. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-7-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Simplify the pci_device_id tableQiuxu Zhuo
Use PCI_VDEVICE() to simplify the pci_device_id table. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-6-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()Qiuxu Zhuo
The 3rd parameter of *populate_dimm_info() pertains to the DIMM index within a channel, not the channel index. Fix the parameter name to dimm to reflect its actual purpose. No functional changes intended. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-5-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the error path order of ie31200_init()Qiuxu Zhuo
The error path order of ie31200_init() is incorrect, fix it. Fixes: 709ed1bcef12 ("EDAC/ie31200: Fallback if host bridge device is already initialized") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-4-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the DIMM size mask for several SoCsQiuxu Zhuo
The DIMM size mask for {Sky, Kaby, Coffee} Lake is not bits{7:0}, but bits{5:0}. Fix it. Fixes: 953dee9bbd24 ("EDAC, ie31200_edac: Add Skylake support") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-3-qiuxu.zhuo@intel.com
2025-03-10EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layerQiuxu Zhuo
The EDAC_MC_LAYER_CHIP_SELECT layer pertains to the rank, not the DIMM. Fix its size to reflect the number of ranks instead of the number of DIMMs. Also delete the unused macros IE31200_{DIMMS,RANKS}. Fixes: 7ee40b897d18 ("ie31200_edac: Introduce the driver") Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Gary Wang <gary.c.wang@intel.com> Link: https://lore.kernel.org/r/20250310011411.31685-2-qiuxu.zhuo@intel.com
2025-03-05EDAC/device: Fix dev_set_name() format stringArnd Bergmann
Passing a variable string as the format to dev_set_name() causes a W=1 warning: drivers/edac/edac_device.c:736:9: error: format not a string literal and no format arguments [-Werror=format-security] 736 | ret = dev_set_name(&ctx->dev, name); | ^~~ Use a literal "%s" instead so the name can be the argument. Fixes: db99ea5f2c03 ("EDAC: Add support for EDAC device features control") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250304143603.995820-1-arnd@kernel.org
2025-03-03EDAC/pnd2: Make read-only const array intlv staticColin Ian King
Don't populate the const read-only array intlv on the stack at run time, instead make it static. This also shrinks the object size: $ size pnd2_edac.o.* text data bss dec hex filename 15632 264 1384 17280 4380 pnd2_edac.o.new 15644 264 1384 17292 438c pnd2_edac.o.old Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20240919170427.497429-1-colin.i.king@gmail.com
2025-03-03EDAC/igen6: Constify struct res_configChristophe JAILLET
The res_config structs are not modified in this driver. Constifying these structures moves some data to a read-only section, so increase overall security, especially when the structure holds some function pointers. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 36777 2479 4304 43560 aa28 drivers/edac/igen6_edac.o After: ===== text data bss dec hex filename 37297 1959 4304 43560 aa28 drivers/edac/igen6_edac.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/a06153870951a64b438e76adf97d440e02c1a1fc.1738355198.git.christophe.jaillet@wanadoo.fr
2025-02-28EDAC/amd64: Simplify return statement in dct_ecc_enabled()Thorsten Blum
Simplify the return statement to improve the code's readability. No functional changes. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Link: https://lore.kernel.org/r/20250201130953.1377-2-thorsten.blum@linux.dev
2025-02-26EDAC: Update memory repair control interface for memory sparing featureShiju Jose
Update memory repair control interface for memory sparing feature. CXL memory devices can support soft and hard memory sparing at cacheline, row, bank and rank granularities. Memory sparing is defined as a repair function that replaces a portion of memory with a portion of functional memory at that same granularity. When a CXL device detects an error in memory, it will report to the host that there's need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other attributes of the memory to repair such as bank group, bank, rank, row, column, channel etc. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. [ bp: Massage. ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com
2025-02-26EDAC: Add a memory repair control featureShiju Jose
Add a generic EDAC memory repair control driver to manage memory repairs in the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR features. For example, a CXL device with DRAM components that support PPR features may implement PPR maintenance operations. DRAM components may support two types of PPR: - hard PPR, for a permanent row repair, and - soft PPR, for a temporary row repair. Soft PPR is much faster than hard PPR, but the repair is lost with a power cycle. When a CXL device detects an error in a memory, it may report the need for a repair maintenance operation by using an event record where the "maintenance needed" flag is set. The event records contain the device physical address (DPA) and other optional attributes of the memory to repair. The kernel will report the corresponding CXL general media or DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair operation in response to the device request via the sysfs repair control. Device with memory repair features registers with EDAC device driver, which retrieves a memory repair descriptor from EDAC memory repair driver and exposes the sysfs repair control attributes to userspace in /sys/bus/edac/devices/<dev-name>/mem_repairX/. The common memory repair control interface abstracts the control of arbitrary memory repair functionality into a standardized set of functions. The sysfs memory repair attribute nodes are only available if the client driver has implemented the corresponding attribute callback function and provided operations to the EDAC device driver during registration. [ bp: Massage, fixup edac_dev_register() retvals, merge write_overflow fix to mem_repair_create_desc() ] Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
2025-02-25EDAC: Use string choice helper functionsThorsten Blum
Remove hard-coded strings by using the str_enabled_disabled(), str_yes_no(), str_write_read(), and str_plural() helper functions. Add a space in "All DIMMs support ECC: yes/no" to improve readability. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/r/20250223212429.3466-2-thorsten.blum@linux.dev
2025-02-25EDAC: Add a Error Check Scrub control featureShiju Jose
Add an Error Check Scrub (ECS) control to manage a memory device's ECS feature. The ECS is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and allows the DRAM to internally read, correct single-bit errors, and write back corrected data bits to the DRAM array while providing transparency to error counts. The DDR5 device contains a number of memory media Field Replaceable Units (FRU) per device. The DDR5 ECS feature and thus the ECS control driver supports configuring the ECS parameters per FRU. Memory devices support the ECS feature register with the EDAC device driver, which retrieves the ECS descriptor from the EDAC ECS driver. This driver exposes sysfs ECS control attributes to userspace via /sys/bus/edac/devices/<dev-name>/ecs_fruX/. The common sysfs ECS control interface abstracts the control of an arbitrary ECS functionality to a common set of functions. Support for the ECS feature is added separately because the control attributes of the DDR5 ECS feature differ from those of the scrub feature. The sysfs ECS attribute nodes are only present if the client driver has implemented the corresponding attribute callback function and passed the necessary operations to the EDAC RAS feature driver during registration. [ bp: Massage, fixup edac_dev_register() retvals. ] Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fan Ni <fan.ni@samsung.com> Tested-by: Fan Ni <fan.ni@samsung.com> Link: https://lore.kernel.org/r/20250212143654.1893-4-shiju.jose@huawei.com
2025-02-25EDAC: Add scrub control featureShiju Jose
Add a scrub control to manage memory scrubbers in the system. Devices with a scrub feature register with the EDAC device driver which retrieves the scrub descriptor from the scrub driver and exposes the control attributes for a instance to userspace at /sys/bus/edac/devices/<dev-name>/scrubX/. The common sysfs scrub control interface abstracts the control of arbitrary scrubbing functionality into a common set of functions. The attribute nodes are only present if the client driver has implemented the corresponding attribute callback function and passed the operations to the device driver during registration. [ bp: Massage commit message, docs and code, simplify text a bit. Integrate fixup for: https://lore.kernel.org/r/202502251009.0sGkolEJ-lkp@intel.com Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> ] Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> Tested-by: Fan Ni <fan.ni@samsung.com> Link: https://lore.kernel.org/r/20250212143654.1893-3-shiju.jose@huawei.com
2025-02-25EDAC: Add support for EDAC device features controlShiju Jose
Add generic EDAC device feature controls supporting the registration of RAS features available in the system. The driver exposes control attributes for these features to userspace in /sys/bus/edac/devices/<dev-name>/<ras-feature> [ bp: Touch-up documentation, simplify, make edac_dev_type static, fixup edac_dev_register() retvals. ] Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Shiju Jose <shiju.jose@huawei.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Fan Ni <fan.ni@samsung.com> Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com> Tested-by: Fan Ni <fan.ni@samsung.com> Link: https://lore.kernel.org/r/20250212143654.1893-2-shiju.jose@huawei.com
2025-02-20EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald RapidsQiuxu Zhuo
When doing error injection to some memory DIMMs on certain Intel Emerald Rapids servers, the i10nm_edac missed error reports for some memory DIMMs. Certain BIOS configurations may hide some memory controllers, and the i10nm_edac doesn't enumerate these hidden memory controllers. However, the ADXL decodes memory errors using memory controller physical indices even if there are hidden memory controllers. Therefore, the memory controller physical indices reported by the ADXL may mismatch the logical indices enumerated by the i10nm_edac, resulting in missed error reports for some memory DIMMs. Fix this issue by creating a mapping table from memory controller physical indices (used by the ADXL) to logical indices (used by the i10nm_edac) and using it to convert the physical indices to the logical indices during the error handling process. Fixes: c545f5e41225 ("EDAC/i10nm: Skip the absent memory controllers") Reported-by: Kevin Chang <kevin1.chang@intel.com> Tested-by: Kevin Chang <kevin1.chang@intel.com> Reported-by: Thomas Chen <Thomas.Chen@intel.com> Tested-by: Thomas Chen <Thomas.Chen@intel.com> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com
2025-02-20EDAC/igen6: Fix the flood of invalid error reportsQiuxu Zhuo
The ECC_ERROR_LOG register of certain SoCs may contain the invalid value ~0, which results in a flood of invalid error reports in polling mode. Fix the flood of invalid error reports by skipping the invalid ECC error log value ~0. Fixes: e14232afa944 ("EDAC/igen6: Add polling support") Reported-by: Ramses <ramses@well-founded.dev> Closes: https://lore.kernel.org/all/OISL8Rv--F-9@well-founded.dev/ Tested-by: Ramses <ramses@well-founded.dev> Reported-by: John <therealgraysky@proton.me> Closes: https://lore.kernel.org/all/p5YcxOE6M3Ncxpn2-Ia_wCt61EM4LwIiN3LroQvT_-G2jMrFDSOW5k2A9D8UUzD2toGpQBN1eI0sL5dSKnkO8iteZegLoQEj-DwQaMhGx4A=@proton.me/ Tested-by: John <therealgraysky@proton.me> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20250212083354.31919-1-qiuxu.zhuo@intel.com
2025-02-20EDAC/ie31200: work around false positive build warningArnd Bergmann
gcc-14 produces a bogus warning in some configurations: drivers/edac/ie31200_edac.c: In function 'ie31200_probe1.isra': drivers/edac/ie31200_edac.c:412:26: error: 'dimm_info' is used uninitialized [-Werror=uninitialized] 412 | struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL]; | ^~~~~~~~~ drivers/edac/ie31200_edac.c:412:26: note: 'dimm_info' declared here 412 | struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL]; | ^~~~~~~~~ I don't see any way the unintialized access could really happen here, but I can see why the compiler gets confused by the two loops. Instead, rework the two nested loops to only read the addr_decode registers and then keep only one instance of the dimm info structure. [Tony: Qiuxu pointed out that the "populate DIMM info" comment was left behind in the refactor and suggested moving it. I deleted the comment as unnecessry in front os a call to populate_dimm_info(). That seems pretty self-describing.] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20250122065031.1321015-1-arnd@kernel.org
2025-02-14EDAC/qcom: Correct interrupt enable register configurationKomal Bajaj
The previous implementation incorrectly configured the cmn_interrupt_2_enable register for interrupt handling. Using cmn_interrupt_2_enable to configure Tag, Data RAM ECC interrupts would lead to issues like double handling of the interrupts (EL1 and EL3) as cmn_interrupt_2_enable is meant to be configured for interrupts which needs to be handled by EL3. EL1 LLCC EDAC driver needs to use cmn_interrupt_0_enable register to configure Tag, Data RAM ECC interrupts instead of cmn_interrupt_2_enable. Fixes: 27450653f1db ("drivers: edac: Add EDAC driver support for QCOM SoCs") Signed-off-by: Komal Bajaj <quic_kbajaj@quicinc.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20241119064608.12326-1-quic_kbajaj@quicinc.com
2025-01-21Merge tag 'x86_misc_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull misc x86 updates from Borislav Petkov: - The first part of a restructuring of AMD's representation of a northbridge which is legacy now, and the creation of the new AMD node concept which represents the Zen architecture of having a collection of I/O devices within an SoC. Those nodes comprise the so-called data fabric on Zen. This has at least one practical advantage of not having to add a PCI ID each time a new data fabric PCI device releases. Eventually, the lot more uniform provider of data fabric functionality amd_node.c will be used by all the drivers which need it - Smaller cleanups * tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/amd_node: Use defines for SMN register offsets x86/amd_node: Remove dependency on AMD_NB x86/amd_node: Update __amd_smn_rw() error paths x86/amd_nb: Move SMN access code to a new amd_node driver x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id() x86/amd_nb: Simplify function 3 search x86/amd_nb: Use topology info to get AMD node count x86/amd_nb: Simplify root device search x86/amd_nb: Simplify function 4 search x86: Start moving AMD node functionality out of AMD_NB x86/amd_nb: Clean up early_is_amd_nb() x86/amd_nb: Restrict init function to AMD-based systems x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
2025-01-21Merge tag 'x86_cpu_for_v6.14_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpuid updates from Borislav Petkov: - Remove the less generic CPU matching infra around struct x86_cpu_desc and use the generic struct x86_cpu_id thing - Remove magic naked numbers for CPUID functions and use proper defines of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree - Smaller cleanups and improvements * tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Make all all CPUID leaf names consistent x86/fpu: Remove unnecessary CPUID level check x86/fpu: Move CPUID leaf definitions to common code x86/tsc: Remove CPUID "frequency" leaf magic numbers. x86/tsc: Move away from TSC leaf magic numbers x86/cpu: Move TSC CPUID leaf definition x86/cpu: Refresh DCA leaf reading code x86/cpu: Remove unnecessary MwAIT leaf checks x86/cpu: Use MWAIT leaf definition x86/cpu: Move MWAIT leaf definition to common header x86/cpu: Remove 'x86_cpu_desc' infrastructure x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id' x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id' x86/cpu: Expose only stepping min/max interface x86/cpu: Introduce new microcode matching helper x86/cpufeature: Document cpu_feature_enabled() as the default to use x86/paravirt: Remove the WBINVD callback x86/cpufeatures: Free up unused feature bits
2025-01-21Merge tag 'edac_updates_for_v6.14' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Remove the EDAC PowerPC Cell driver due to the removal of the IBM Cell blades support - Add a new EDAC driver for Loongson SoCs which reports single-bit correctable errors - Extend the SKX and i10NM EDAC drivers to support UV systems which can have more than 8 nodes - Add Intel Clearwater Forest server support to i10nm_edac - Minor fix * tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/cell: Remove powerpc Cell driver EDAC: Add an EDAC driver for the Loongson memory controller EDAC: Fix typos in comments EDAC/{i10nm,skx,skx_common}: Support UV systems EDAC/i10nm: Add Intel Clearwater Forest server support
2025-01-17Merge remote-tracking branches 'ras/edac-drivers' and 'ras/edac-misc' into ↵Borislav Petkov (AMD)
edac-updates * ras/edac-drivers: EDAC/cell: Remove powerpc Cell driver EDAC: Add an EDAC driver for the Loongson memory controller EDAC/{i10nm,skx,skx_common}: Support UV systems EDAC/i10nm: Add Intel Clearwater Forest server support * ras/edac-misc: EDAC: Fix typos in comments Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2025-01-16EDAC/cell: Remove powerpc Cell driverMichael Ellerman
This driver can no longer be built since support for IBM Cell Blades was removed, in particular PPC_CELL_COMMON. Remove the driver. [ bp: Remove EDAC_CELL from Cell's defconfig too. ] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241218105523.416573-23-mpe@ellerman.id.au
2025-01-08x86/amd_nb: Move SMN access code to a new amd_node driverMario Limonciello
SMN access was bolted into amd_nb mostly as convenience. This has limitations though that require incurring tech debt to keep it working. Move SMN access to the newly introduced AMD Node driver. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> # pdx86 Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> # PMF, PMC Link: https://lore.kernel.org/r/20241206161210.163701-11-yazen.ghannam@amd.com
2025-01-04EDAC: Add an EDAC driver for the Loongson memory controllerZhao Qunqin
Add ECC support for Loongson SoC DDR controller. This driver reports single bit errors (CE) only. Only ACPI firmware is supported. [ bp: Document what last_ce_count is for. ] Signed-off-by: Zhao Qunqin <zhaoqunqin@loongson.cn> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Huacai Chen <chenhuacai@loongson.cn> Link: https://lore.kernel.org/r/20241219124846.1876-1-zhaoqunqin@loongson.cn Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-12-17x86/cpu: Expose only stepping min/max interfaceDave Hansen
The x86_match_cpu() infrastructure can match CPU steppings. Since there are only 16 possible steppings, the matching infrastructure goes all out and stores the stepping match as a bitmap. That means it can match any possible steppings in a single list entry. Fun. But it exposes this bitmap to each of the X86_MATCH_*() helpers when none of them really need a bitmap. It makes up for this by exporting a helper (X86_STEPPINGS()) which converts a contiguous stepping range into the bitmap which every single user leverages. Instead of a bitmap, have the main helper for this sort of thing (X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up actually being even more compact than before. Leave the helper in place (renamed to __X86_STEPPINGS()) to make it more clear what is going on instead of just having a random GENMASK() in the middle of an already complicated macro. One oddity that I hit was this macro: X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues) It *could* have been converted over to take a min/max stepping value for each entry. But that would have been a bit too verbose and would prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0) from sticking out. Instead, just have it take a *maximum* stepping and imply that the match is from 0=>max_stepping. This is functional for all the cases now and also retains the nice property of having INTEL_COMETLAKE_L stepping 0 stick out like a sore thumb. skx_cpuids[] is goofy. It uses the stepping match but encodes all possible steppings. Just use a normal, non-stepping match helper. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
2024-12-15EDAC: Fix typos in commentsYan Zhen
Fix the following typos: 'Alocate' ==> 'Allocate', 'specifed' ==> 'specified', 'Technlogy' ==> 'Technology', 'Brnach' ==> 'Branch', 'branchs' ==> 'branches'. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240930074023.618110-1-yanzhen@vivo.com
2024-12-13EDAC/{i10nm,skx,skx_common}: Support UV systemsKyle Meyer
The 3-bit source IDs in PCI configuration space registers, used to map devices to sockets, are limited to 8 unique IDs, and each ID is local to a UPI/QPI domain. Source IDs cannot be used to map devices to sockets on UV systems because they can exceed 8 sockets and have multiple UPI/QPI domains with identical, repeating source IDs. Use NUMA information to get package IDs instead of source IDs on UV systems, and use package/source IDs to name IMC information structures. Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/all/20241213012549.43099-1-kyle.meyer@hpe.com/
2024-12-11EDAC/amd64: Simplify ECC check on unified memory controllersBorislav Petkov (AMD)
The intent of the check is to see whether at least one UMC has ECC enabled. So do that instead of tracking which ones are enabled in masks which are too small in size anyway and lead to not loading the driver on Zen4 machines with UMCs enabled over UMC8. Fixes: e2be5955a886 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh") Reported-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Avadhut Naik <avadhut.naik@amd.com> Reviewed-by: Avadhut Naik <avadhut.naik@amd.com> Cc: <stable@kernel.org> Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
2024-12-09EDAC/i10nm: Add Intel Clearwater Forest server supportQiuxu Zhuo
Clearwater Forest is the successor to Sierra Forest. Add Clearwater Forest CPU model ID for EDAC support. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Tested-by: Yi Lai <yi1.lai@intel.com> Link: https://lore.kernel.org/r/20241203022038.72873-1-qiuxu.zhuo@intel.com
2024-12-01Get rid of 'remove_new' relic from platform driver structLinus Torvalds
The continual trickle of small conversion patches is grating on me, and is really not helping. Just get rid of the 'remove_new' member function, which is just an alias for the plain 'remove', and had a comment to that effect: /* * .remove_new() is a relic from a prototype conversion of .remove(). * New drivers are supposed to implement .remove(). Once all drivers are * converted to not use .remove_new any more, it will be dropped. */ This was just a tree-wide 'sed' script that replaced '.remove_new' with '.remove', with some care taken to turn a subsequent tab into two tabs to make things line up. I did do some minimal manual whitespace adjustment for places that used spaces to line things up. Then I just removed the old (sic) .remove_new member function, and this is the end result. No more unnecessary conversion noise. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-23Merge tag 'powerpc-6.13-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Rework kfence support for the HPT MMU to work on systems with >= 16TB of RAM. - Remove the powerpc "maple" platform, used by the "Yellow Dog Powerstation". - Add support for DYNAMIC_FTRACE_WITH_CALL_OPS, DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines. - Add support for running KVM nested guests on Power11. - Other small features, cleanups and fixes. Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard, Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek, Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash, Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum, Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao. * tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits) EDAC/powerpc: Remove PPC_MAPLE drivers powerpc/perf: Add per-task/process monitoring to vpa_pmu driver powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu powerpc/perf: Add perf interface to expose vpa counters MAINTAINERS: powerpc: Mark Maddy as "M" powerpc/Makefile: Allow overriding CPP powerpc-km82xx.c: replace of_node_put() with __free ps3: Correct some typos in comments powerpc/kexec: Fix return of uninitialized variable macintosh: Use common error handling code in via_pmu_led_init() powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type() powerpc: remove dead config options for MPC85xx platform support powerpc/xive: Use cpumask_intersects() selftests/powerpc: Remove the path after initialization. powerpc/xmon: symbol lookup length fixed powerpc/ep8248e: Use %pa to format resource_size_t powerpc/ps3: Reorganize kerneldoc parameter names KVM: PPC: Book3S HV: Fix kmv -> kvm typo powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static ...
2024-11-19Merge tag 'ras_core_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS updates from Borislav Petkov: - Log and handle twp new AMD-specific MCA registers: SYND1 and SYND2 and report the Field Replaceable Unit text info reported through them - Add support for handling variable-sized SMCA BERT records - Add the capability for reporting vendor-specific RAS error info without adding vendor-specific fields to struct mce - Cleanups * tag 'ras_core_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: EDAC/mce_amd: Add support for FRU text in MCA x86/mce/apei: Handle variable SMCA BERT record size x86/MCE/AMD: Add support for new MCA_SYND{1,2} registers tracing: Add __print_dynamic_array() helper x86/mce: Add wrapper for struct mce to export vendor specific info x86/mce/intel: Use MCG_BANKCNT_MASK instead of 0xff x86/mce/mcelog: Use xchg() to get and clear the flags
2024-11-19Merge tag 'edac_updates_for_v6.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull EDAC updates from Borislav Petkov: - Add support for Bluefield-2 SOCs to bluefield_edac - Add support for Intel Panther Lake-H to igen6_edac - Add polling support to igen6_edac as some Intel M100 chips have trouble with error interrupts - Add Kaby Lake-S support to ie31200_edac - Fix memory source detection in the SKX common module which is used by a couple of Intel EDAC drivers - Add support for the NXP i.MX9 memory controller to fsl_edac - The usual fixes and cleanups all over the place * tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: EDAC/igen6: Add polling support EDAC/igen6: Initialize edac_op_state according to the configuration data EDAC/igen6: Avoid segmentation fault on module unload EDAC/ie31200: Add Kaby Lake-S dual-core host bridge ID MAINTAINERS: Change FSL DDR EDAC maintainership EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator EDAC/skx_common: Differentiate memory error sources EDAC/fsl_ddr: Add support for i.MX9 DDR controller dt-bindings: memory: fsl: Add compatible string nxp,imx9-memory-controller EDAC/fsl_ddr: Fix bad bit shift operations EDAC/fsl_ddr: Move global variables into struct fsl_mc_pdata EDAC/fsl_ddr: Pass down fsl_mc_pdata in ddr_in32() and ddr_out32() RAS/AMD/ATL: Add debug prints for DF register reads EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2 EDAC/bluefield: Fix potential integer overflow EDAC/igen6: Add Intel Panther Lake-H SoCs support
2024-11-19EDAC/powerpc: Remove PPC_MAPLE driversMichael Ellerman
These two drivers are only buildable for the powerpc "maple" platform (CONFIG_PPC_MAPLE), which has now been removed, see commit 62f8f307c80e ("powerpc/64: Remove maple platform"). Remove the drivers. Acked-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://patch.msgid.link/20241112084134.411964-1-mpe@ellerman.id.au
2024-11-18Merge branch 'edac-misc' into edac-updatesBorislav Petkov (AMD)
* edac-misc: MAINTAINERS: Change FSL DDR EDAC maintainership RAS/AMD/ATL: Add debug prints for DF register reads EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2 EDAC/bluefield: Fix potential integer overflow EDAC/igen6: Add Intel Panther Lake-H SoCs support Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
2024-11-08EDAC/igen6: Add polling supportOrange Kao
Some PCs with Intel N100 (with PCI device 8086:461c, DID_ADL_N_SKU4) experienced issues with error interrupts not working, even with the following configuration in the BIOS. In-Band ECC Support: Enabled In-Band ECC Operation Mode: 2 (make all requests protected and ignore range checks) IBECC Error Injection Control: Inject Correctable Error on insertion counter Error Injection Insertion Count: 251658240 (0xf000000) Add polling mode support for these machines to ensure that memory error events are handled. Signed-off-by: Orange Kao <orange@aiven.io> Signed-off-by: Tony Luck <tony.luck@intel.com> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Link: https://lore.kernel.org/all/20241106114024.941659-3-orange@aiven.io
2024-11-08EDAC/igen6: Initialize edac_op_state according to the configuration dataQiuxu Zhuo
Currently, igen6_edac sets edac_op_state to EDAC_OPSTATE_NMI, while the driver also supports memory errors reported from Machine Check. Initialize edac_op_state to the correct value according to the configuration data that the driver probed. Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20241106114024.941659-2-orange@aiven.io
2024-11-04EDAC/igen6: Avoid segmentation fault on module unloadOrange Kao
The segmentation fault happens because: During modprobe: 1. In igen6_probe(), igen6_pvt will be allocated with kzalloc() 2. In igen6_register_mci(), mci->pvt_info will point to &igen6_pvt->imc[mc] During rmmod: 1. In mci_release() in edac_mc.c, it will kfree(mci->pvt_info) 2. In igen6_remove(), it will kfree(igen6_pvt); Fix this issue by setting mci->pvt_info to NULL to avoid the double kfree. Fixes: 10590a9d4f23 ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC") Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219360 Signed-off-by: Orange Kao <orange@aiven.io> Signed-off-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241104124237.124109-2-orange@aiven.io
2024-11-04EDAC/ie31200: Add Kaby Lake-S dual-core host bridge IDJames Ye
Add device ID for dual-core Kaby Lake-S processors e.g. i3-7100. Signed-off-by: James Ye <jye836@gmail.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Jason Baron <jbaron@akamai.com> Link: https://lore.kernel.org/r/20240824120622.46226-1-jye836@gmail.com
2024-10-31EDAC/mce_amd: Add support for FRU text in MCAYazen Ghannam
A new "FRU Text in MCA" feature is defined where the Field Replaceable Unit (FRU) Text for a device is represented by a string in the new MCA_SYND1 and MCA_SYND2 registers. This feature is supported per MCA bank, and it is advertised by the McaFruTextInMca bit (MCA_CONFIG[9]). The FRU Text is populated dynamically for each individual error state (MCA_STATUS, MCA_ADDR, et al.). Handle the case where an MCA bank covers multiple devices, for example, a Unified Memory Controller (UMC) bank that manages two DIMMs. [ Yazen: Add Avadhut as co-developer for wrapper changes. ] [ bp: Do not expose MCA_CONFIG to userspace yet. ] Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Co-developed-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20241022194158.110073-6-avadhut.naik@amd.com