summaryrefslogtreecommitdiff
path: root/Documentation/edac/memory_repair.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/edac/memory_repair.rst')
-rw-r--r--Documentation/edac/memory_repair.rst152
1 files changed, 152 insertions, 0 deletions
diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst
new file mode 100644
index 000000000000..5f8da7c9b186
--- /dev/null
+++ b/Documentation/edac/memory_repair.rst
@@ -0,0 +1,152 @@
+.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-or-later
+
+==========================
+EDAC Memory Repair Control
+==========================
+
+Copyright (c) 2024-2025 HiSilicon Limited.
+
+:Author: Shiju Jose <shiju.jose@huawei.com>
+:License: The GNU Free Documentation License, Version 1.2 without
+ Invariant Sections, Front-Cover Texts nor Back-Cover Texts.
+ (dual licensed under the GPL v2)
+:Original Reviewers:
+
+- Written for: 6.15
+
+Introduction
+------------
+
+Some memory devices support repair operations to address issues in their
+memory media. Post Package Repair (PPR) and memory sparing are examples of
+such features.
+
+Post Package Repair (PPR)
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Post Package Repair is a maintenance operation which requests the memory
+device to perform repair operation on its media. It is a memory self-healing
+feature that fixes a failing memory location by replacing it with a spare row
+in a DRAM device.
+
+For example, a CXL memory device with DRAM components that support PPR
+features implements maintenance operations. DRAM components support those
+types of PPR functions:
+
+ - hard PPR, for a permanent row repair, and
+ - soft PPR, for a temporary row repair.
+
+Soft PPR is much faster than hard PPR, but the repair is lost after a power
+cycle.
+
+The data may not be retained and memory requests may not be correctly
+processed during a repair operation. In such case, the repair operation should
+not be executed at runtime.
+
+For example, for CXL memory devices, see CXL spec rev 3.1 [1]_ sections
+8.2.9.7.1.1 PPR Maintenance Operations, 8.2.9.7.1.2 sPPR Maintenance Operation
+and 8.2.9.7.1.3 hPPR Maintenance Operation for more details.
+
+Memory Sparing
+~~~~~~~~~~~~~~
+
+Memory sparing is a repair function that replaces a portion of memory with
+a portion of functional memory at a particular granularity. Memory
+sparing has cacheline/row/bank/rank sparing granularities. For example, in
+rank memory-sparing mode, one memory rank serves as a spare for other ranks on
+the same channel in case they fail.
+
+The spare rank is held in reserve and not used as active memory until
+a failure is indicated, with reserved capacity subtracted from the total
+available memory in the system.
+
+After an error threshold is surpassed in a system protected by memory sparing,
+the content of a failing rank of DIMMs is copied to the spare rank. The
+failing rank is then taken offline and the spare rank placed online for use as
+active memory in place of the failed rank.
+
+For example, CXL memory devices can support various subclasses for sparing
+operation vary in terms of the scope of the sparing being performed.
+
+Cacheline sparing subclass refers to a sparing action that can replace a full
+cacheline. Row sparing is provided as an alternative to PPR sparing functions
+and its scope is that of a single DDR row. Bank sparing allows an entire bank
+to be replaced. Rank sparing is defined as an operation in which an entire DDR
+rank is replaced.
+
+See CXL spec 3.1 [1]_ section 8.2.9.7.1.4 Memory Sparing Maintenance
+Operations for more details.
+
+.. [1] https://computeexpresslink.org/cxl-specification/
+
+Use cases of generic memory repair features control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+1. The soft PPR, hard PPR and memory-sparing features share similar control
+ attributes. Therefore, there is a need for a standardized, generic sysfs
+ repair control that is exposed to userspace and used by administrators,
+ scripts and tools.
+
+2. When a CXL device detects an error in a memory component, it informs the
+ host of the need for a repair maintenance operation by using an event
+ record where the "maintenance needed" flag is set. The event record
+ specifies the device physical address (DPA) and attributes of the memory
+ that requires repair. The kernel reports the corresponding CXL general
+ media or DRAM trace event to userspace, and userspace tools (e.g.
+ rasdaemon) initiate a repair maintenance operation in response to the
+ device request using the sysfs repair control.
+
+3. Userspace tools, such as rasdaemon, request a repair operation on a memory
+ region when maintenance need flag set or an uncorrected memory error or
+ excess of corrected memory errors above a threshold value is reported or an
+ exceed corrected errors threshold flag set for that memory.
+
+4. Multiple PPR/sparing instances may be present per memory device.
+
+5. Drivers should enforce that live repair is safe. In systems where memory
+ mapping functions can change between boots, one approach to this is to log
+ memory errors seen on this boot against which to check live memory repair
+ requests.
+
+The File System
+---------------
+
+The control attributes of a registered memory repair instance could be
+accessed in the /sys/bus/edac/devices/<dev-name>/mem_repairX/
+
+sysfs
+-----
+
+Sysfs files are documented in
+`Documentation/ABI/testing/sysfs-edac-memory-repair`.
+
+Examples
+--------
+
+The memory repair usage takes the form shown in this example:
+
+1. CXL memory sparing
+
+Memory sparing is defined as a repair function that replaces a portion of
+memory with a portion of functional memory at that same DPA. The subclass
+for this operation, cacheline/row/bank/rank sparing, vary in terms of the
+scope of the sparing being performed.
+
+Memory sparing maintenance operations may be supported by CXL devices that
+implement CXL.mem protocol. A sparing maintenance operation requests the
+CXL device to perform a repair operation on its media. For example, a CXL
+device with DRAM components that support memory sparing features may
+implement sparing maintenance operations.
+
+2. CXL memory Soft Post Package Repair (sPPR)
+
+Post Package Repair (PPR) maintenance operations may be supported by CXL
+devices that implement CXL.mem protocol. A PPR maintenance operation
+requests the CXL device to perform a repair operation on its media.
+For example, a CXL device with DRAM components that support PPR features
+may implement PPR Maintenance operations. Soft PPR (sPPR) is a temporary
+row repair. Soft PPR may be faster, but the repair is lost with a power
+cycle.
+
+Sysfs files for memory repair are documented in
+`Documentation/ABI/testing/sysfs-edac-memory-repair`