Age | Commit message (Collapse) | Author |
|
Pull CXL updates from Dan Williams:
"CXL has mechanisms to enumerate the performance characteristics of
memory devices. Those mechanisms allow Linux to build the equivalent
of ACPI SRAT, SLIT, and HMAT tables dynamically at runtime. That
capability is necessary because static ACPI can not represent dynamic
CXL configurations (and reconfigurations).
So, building on the v6.8 work to add "Quality of Service" enumeration,
this update plumbs CXL "access coordinates" (read/write access latency
and bandwidth) in all the same places that ACPI HMAT feeds similar
data. Follow-on patches from the -mm side can then use that data to
feed mechanisms like mm/memory-tiers.c. Greg has acked the touch to
drivers/base/.
The other feature update this cycle is support for CXL error injection
via the ACPI EINJ module. That facility enables injection of bus
protocol errors provided the user knows the magic address values to
insert in the interface. To hide that magic, and make this easier to
use, new error injection attributes were added to CXL debugfs. That
interface injects the errors relative to a CXL object rather than
require user tooling to know how to lookup and inject RCRB (Root
Complex Register Block) addresses into the raw EINJ debugfs interface.
It received some helpful review comments from Tony, but no explicit
acks from the ACPI side. The primary user visible change for existing
EINJ users is that they may find that einj.ko was already loaded by
cxl_core.ko. Previously, einj.ko was only loaded on demand.
The usual collection of miscellaneous cleanups are also present this
cycle.
Summary:
- Supplement ACPI HMAT reported memory performance with native CXL
memory performance enumeration
- Add support for CXL error injection via the ACPI EINJ mechanism
- Cleanup CXL DOE and CDAT integration
- Miscellaneous cleanups and fixes"
* tag 'cxl-for-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (21 commits)
Documentation/ABI/testing/debugfs-cxl: Fix "Unexpected indentation"
lib/firmware_table: Provide buffer length argument to cdat_table_parse()
cxl/pci: Get rid of pointer arithmetic reading CDAT table
cxl/pci: Rename DOE mailbox handle to doe_mb
cxl: Fix the incorrect assignment of SSLBIS entry pointer initial location
cxl/core: Add CXL EINJ debugfs files
EINJ, Documentation: Update EINJ kernel doc
EINJ: Add CXL error type support
EINJ: Migrate to a platform driver
cxl/region: Deal with numa nodes not enumerated by SRAT
cxl/region: Add memory hotplug notifier for cxl region
cxl/region: Add sysfs attribute for locality attributes of CXL regions
cxl/region: Calculate performance data for a region
cxl: Set cxlmd->endpoint before adding port device
cxl: Move QoS class to be calculated from the nearest CPU
cxl: Split out host bridge access coordinates
cxl: Split out combine_coordinates() for common shared usage
ACPI: HMAT / cxl: Add retrieval of generic port coordinates for both access classes
ACPI: HMAT: Introduce 2 levels of generic port access class
base/node / ACPI: Enumerate node access class for 'struct access_coordinate'
...
|
|
Move CXL protocol error types from einj.c (now einj-core.c) to einj-cxl.c.
einj-cxl.c implements the necessary handling for CXL protocol error
injection and exposes an API for the CXL core to use said functionality,
while also allowing the EINJ module to be built without CXL support.
Because CXL error types targeting CXL 1.0/1.1 ports require special
handling, only allow them to be injected through the new cxl debugfs
interface (next commit) and return an error when attempting to inject
through the legacy interface.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ben Cheatham <Benjamin.Cheatham@amd.com>
Link: https://lore.kernel.org/r/20240311142508.31717-3-Benjamin.Cheatham@amd.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Change the EINJ module to install a platform device/driver on module
init and move the module init() and exit() functions to driver probe and
remove. This change allows the EINJ module to load regardless of whether
setting up EINJ succeeds, which allows dependent modules to still load
(i.e. the CXL core).
Since EINJ may no longer be initialized when the module loads, any
functions that are called from dependent/external modules should
safegaurd against the case EINJ didn't load.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Ben Cheatham <Benjamin.Cheatham@amd.com>
Link: https://lore.kernel.org/r/20240311142508.31717-2-Benjamin.Cheatham@amd.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Architecture
To support GHES_ASSIST on Machine Check Architecture (MCA) error sources,
a set of GHES structures is provided by the system firmware for each MCA
error source. Each of these sets consists of a GHES structure for each MCA
bank on each logical CPU, with all structures of a set sharing a common
Related Source ID, equal to the Source ID of one of the MCA error source
structures.[1] On SOCs with large core counts, this typically equates to
tens of thousands of GHES_ASSIST structures for MCA under
"/sys/bus/platform/drivers/GHES".
Support for GHES_ASSIST however, hasn't been implemented in the kernel. As
such, the information provided through these structures is not consumed by
Linux. Moreover, these GHES_ASSIST structures for MCA, which are supposed
to provide supplemental information in context of an error reported by
hardware, are setup as independent error sources by the kernel during HEST
initialization.
Additionally, if the Type field of the Notification structure, associated
with these GHES_ASSIST structures for MCA, is set to Polled, the kernel
sets up a timer for each individual structure. The duration of the timer
is derived from the Poll Interval field of the Notification structure. On
SOCs with high core counts, this will result in tens of thousands of
timers expiring periodically causing unnecessary preemptions and wastage
of CPU cycles. The problem will particularly intensify if Poll Interval
duration is not sufficiently high.
Since GHES_ASSIST support is not present in kernel, skip initialization
of GHES_ASSIST structures for MCA to eliminate their performance impact.
[1] ACPI specification 6.5, section 18.7
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.
To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().
Instead of returning an error code, emit a better error message than the
core. Apart from the improved error message this patch has no effects
for the driver.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Initial tests with the CXL CPER implementation identified that error
reports were being duplicated in the log and the trace event [1]. Then
it was discovered that the notification handler took sleeping locks
while the GHES event handling runs in spin_lock_irqsave() context [2]
While the duplicate reporting was fixed in v6.8-rc4, the fix for the
sleeping-lock-vs-atomic collision would enjoy more time to settle and
gain some test cycles. Given how late it is in the development cycle,
remove the CXL hookup for now and try again during the next merge
window.
Note that end result is that v6.8 does not emit CXL CPER payloads to the
kernel log, but this is in line with the CXL trend to move error
reporting to trace events instead of the kernel log.
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Link: http://lore.kernel.org/r/20240108165855.00002f5a@Huawei.com [1]
Closes: http://lore.kernel.org/r/b963c490-2c13-4b79-bbe7-34c6568423c7@moroto.mountain [2]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
Jonathan reports that CXL CPER events dump an extra generic error
message.
{1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
{1}[Hardware Error]: event severity: recoverable
{1}[Hardware Error]: Error 0, type: recoverable
{1}[Hardware Error]: section type: unknown, fbcd0a77-c260-417f-85a9-088b1621eba6
{1}[Hardware Error]: section length: 0x90
{1}[Hardware Error]: 00000000: 00000090 00000007 00000000 0d938086 ................
{1}[Hardware Error]: 00000010: 00100000 00000000 00040000 00000000 ................
...
CXL events were rerouted though the CXL subsystem for additional
processing. However, when that work was done it was missed that
cper_estatus_print_section() continued with a generic error message
which is confusing.
Teach CPER print code to ignore printing details of some section types.
Assign the CXL event GUIDs to this set to prevent confusing unknown
prints.
Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Suggested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
Pull CXL (Compute Express Link) updates from Dan Williams:
"The bulk of this update is support for enumerating the performance
capabilities of CXL memory targets and connecting that to a platform
CXL memory QoS class. Some follow-on work remains to hook up this data
into core-mm policy, but that is saved for v6.9.
The next significant update is unifying how CXL event records (things
like background scrub errors) are processed between so called
"firmware first" and native error record retrieval. The CXL driver
handler that processes the record retrieved from the device mailbox is
now the handler for that same record format coming from an EFI/ACPI
notification source.
This also contains miscellaneous feature updates, like Get Timestamp,
and other fixups.
Summary:
- Add support for parsing the Coherent Device Attribute Table (CDAT)
- Add support for calculating a platform CXL QoS class from CDAT data
- Unify the tracing of EFI CXL Events with native CXL Events.
- Add Get Timestamp support
- Miscellaneous cleanups and fixups"
* tag 'cxl-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (41 commits)
cxl/core: use sysfs_emit() for attr's _show()
cxl/pci: Register for and process CPER events
PCI: Introduce cleanup helpers for device reference counts and locks
acpi/ghes: Process CXL Component Events
cxl/events: Create a CXL event union
cxl/events: Separate UUID from event structures
cxl/events: Remove passing a UUID to known event traces
cxl/events: Create common event UUID defines
cxl/events: Promote CXL event structures to a core header
cxl: Refactor to use __free() for cxl_root allocation in cxl_endpoint_port_probe()
cxl: Refactor to use __free() for cxl_root allocation in cxl_find_nvdimm_bridge()
cxl: Fix device reference leak in cxl_port_perf_data_calculate()
cxl: Convert find_cxl_root() to return a 'struct cxl_root *'
cxl: Introduce put_cxl_root() helper
cxl/port: Fix missing target list lock
cxl/port: Fix decoder initialization when nr_targets > interleave_ways
cxl/region: fix x9 interleave typo
cxl/trace: Pass UUID explicitly to event traces
cxl/region: use %pap format to print resource_size_t
cxl/region: Add dev_dbg() detail on failure to allocate HPA space
...
|
|
BIOS can configure memory devices as firmware first. This will send CXL
events to the firmware instead of the OS. The firmware can then send
these events to the OS via UEFI.
UEFI v2.10 section N.2.14 defines a Common Platform Error Record (CPER)
format for CXL Component Events. The format is mostly the same as the
CXL Common Event Record Format. The difference is the use of a GUID in
the Section Type rather than a UUID as part of the event itself.
Add GHES support to detect CXL CPER records and call a registered
callback with the event.
A notifier chain was considered for the callback but the complexity did
not justify the use case as only the CXL subsystem requires this event.
Enforce that only one callback can be registered at any time.
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20231220-cxl-cper-v5-7-1bb8a4ca2c7a@intel.com
[djbw: fixup checkpatch errors]
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|
There are two major types of uncorrected recoverable (UCR) errors :
- Synchronous error: The error is detected and raised at the point of
the consumption in the execution flow, e.g. when a CPU tries to
access a poisoned cache line. The CPU will take a synchronous error
exception such as Synchronous External Abort (SEA) on Arm64 and
Machine Check Exception (MCE) on X86. OS requires to take action (for
example, offline failure page/kill failure thread) to recover this
uncorrectable error.
- Asynchronous error: The error is detected out of processor execution
context, e.g. when an error is detected by a background scrubber.
Some data in the memory are corrupted. But the data have not been
consumed. OS is optional to take action to recover this uncorrectable
error.
When APEI firmware first is enabled, a platform may describe one error
source for the handling of synchronous errors (e.g. MCE or SEA notification
), or for handling asynchronous errors (e.g. SCI or External Interrupt
notification). In other words, we can distinguish synchronous errors by
APEI notification. For synchronous errors, kernel will kill the current
process which accessing the poisoned page by sending SIGBUS with
BUS_MCEERR_AR. In addition, for asynchronous errors, kernel will notify the
process who owns the poisoned page by sending SIGBUS with BUS_MCEERR_AO in
early kill mode. However, the GHES driver always sets mf_flags to 0 so that
all synchronous errors are handled as asynchronous errors in memory failure.
To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous
events.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Tested-by: Ma Wupeng <mawupeng1@huawei.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Xiaofei Tan <tanxiaofei@huawei.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: James Morse <james.morse@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Vendor-Defined Error types are supported by the platform apart from
standard error types if bit 31 is set in the output of GET_ERROR_TYPE
Error Injection Action.[1] While the errors themselves and the length
of their associated "OEM Defined data structure" might vary between
vendors, the physical address of this structure can be computed through
vendor_extension and length fields of "SET_ERROR_TYPE_WITH_ADDRESS" and
"Vendor Error Type Extension" Structures respectively.[2][3]
Currently, however, the einj module only computes the physical address of
Vendor Error Type Extension Structure. Neither does it compute the physical
address of OEM Defined structure nor does it establish the memory mapping
required for injecting Vendor-defined errors. Consequently, userspace
tools have to establish the very mapping through /dev/mem, nopat kernel
parameter and system calls like mmap/munmap initially before injecting
Vendor-defined errors.
Circumvent the issue by computing the physical address of OEM Defined data
structure and establishing the required mapping with the structure. Create
a new file "oem_error", if the system supports Vendor-defined errors, to
export this mapping, through debugfs_create_blob(). Userspace tools can
then populate their respective OEM Defined structure instances and just
write to the file as part of injecting Vendor-defined Errors. Similarly,
the tools can also read from the file if the system firmware provides some
information through the OEM defined structure after error injection.
[1] ACPI specification 6.5, section 18.6.4
[2] ACPI specification 6.5, Table 18.31
[3] ACPI specification 6.5, Table 18.32
Suggested-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
OSPM can discover the error injection capabilities of the platform by
executing GET_ERROR_TYPE error injection action.[1] The action returns
a DWORD representing a bitmap of platform supported error injections.[2]
The available_error_type_show() function determines the bits set within
this DWORD and provides a verbose output, from einj_error_type_string
array, through /sys/kernel/debug/apei/einj/available_error_type file.
The function however, assumes one to one correspondence between an error's
position in the bitmap and its array entry offset. Consequently, some
errors like Vendor Defined Error Type fail this assumption and will
incorrectly be shown as not supported, even if their corresponding bit is
set in the bitmap and they have an entry in the array.
Navigate around the issue by converting einj_error_type_string into an
array of structures with a predetermined mask for all error types
corresponding to their bit position in the DWORD returned by GET_ERROR_TYPE
action. The same breaks the aforementioned assumption resulting in all
supported error types by a platform being outputted through the above
available_error_type file.
[1] ACPI specification 6.5, Table 18.25
[2] ACPI specification 6.5, Table 18.30
Suggested-by: Alexey Kardashevskiy <alexey.kardashevskiy@amd.com>
Signed-off-by: Avadhut Naik <Avadhut.Naik@amd.com>
Reviewed-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Slow devices such as flash may not meet the default 1ms timeout value,
so use the ERST max execution time value that they provide as the
timeout if it is larger.
Signed-off-by: Jeshua Smith <jeshuas@nvidia.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
ghes_handle_aer() passes AER data to the PCI core for logging and
recovery by calling aer_recover_queue() with a pointer to struct
aer_capability_regs.
The problem was that aer_recover_queue() queues the pointer directly
without copying the aer_capability_regs data. The pointer was to
the ghes->estatus buffer, which could be reused before
aer_recover_work_func() reads the data.
To avoid this problem, allocate a new aer_capability_regs structure
from the ghes_estatus_pool, copy the AER data from the ghes->estatus
buffer into it, pass a pointer to the new struct to
aer_recover_queue(), and free it after aer_recover_work_func() has
processed it.
Reported-by: Bjorn Helgaas <helgaas@kernel.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since 315bada690e0 ("EDAC: Check for GHES preference in the
chipset-specific EDAC drivers"), vendor specific EDAC driver will not
probe correctly when CONFIG_ACPI_APEI_GHES is enabled but no GHES device
is present. Make ghes_get_devices() return NULL when the GHES device
list is empty to fix the problem.
Fixes: 9057a3f7ac36 ("EDAC/ghes: Prepare to make ghes_edac a proper module")
Signed-off-by: Li Yang <leoyang.li@nxp.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
It's only used inside the __init section. Mark it __initdata.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
ghes_estatus_pool_size_request() is unused now, so remove it.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The cper.c file needs to include an extra header, and efi_zboot_entry
needs an extern declaration to avoid these 'make W=1' warnings:
drivers/firmware/efi/libstub/zboot.c:65:1: error: no previous prototype for 'efi_zboot_entry' [-Werror=missing-prototypes]
drivers/firmware/efi/efi.c:176:16: error: no previous prototype for 'efi_attr_is_visible' [-Werror=missing-prototypes]
drivers/firmware/efi/cper.c:626:6: error: no previous prototype for 'cper_estatus_print' [-Werror=missing-prototypes]
drivers/firmware/efi/cper.c:649:5: error: no previous prototype for 'cper_estatus_check_header' [-Werror=missing-prototypes]
drivers/firmware/efi/cper.c:662:5: error: no previous prototype for 'cper_estatus_check' [-Werror=missing-prototypes]
To make this easier, move the cper specific declarations to
include/linux/cper.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
|
|
OSPM executes an EXECUTE_OPERATION action to instruct the platform to begin
the injection operation, then executes a GET_COMMAND_STATUS action to
determine the status of the completed operation. The ACPI Specification
documented error codes[1] are:
0 = Success (Linux #define EINJ_STATUS_SUCCESS)
1 = Unknown failure (Linux #define EINJ_STATUS_FAIL)
2 = Invalid Access (Linux #define EINJ_STATUS_INVAL)
The original code report -EBUSY for both "Unknown Failure" and "Invalid
Access" cases. Actually, firmware could do some platform dependent sanity
checks and returns different error codes, e.g. "Invalid Access" to indicate
to the user that the parameters they supplied cannot be used for injection.
To this end, fix to return -EINVAL in the __einj_error_inject() error
handling case instead of always -EBUSY, when explicitly indicated by the
platform in the status of the completed operation.
[1] ACPI Specification 6.5 18.6.1. Error Injection Table
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
ACPI 6.5 added six new error types for CXL. See chapter 18
table 18.30.
Add strings for the new types so that Linux will list them in the
/sys/kernel/debug/apei/einj/available_error_types file.
It seems no other changes are needed. Linux already accepts
the CXL codes (on a BIOS that advertises them).
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Davidlohr Bueso <dave@stgolabs.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The bit map of error types to inject is 32-bit width [1].
Add parameter check to reflect the fact.
[1] ACPI Specification 6.4, Section 18.6.4. Error Types
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Due to several bugs caused by timers being re-armed after they are
shutdown and just before they are freed, a new state of timers was added
called "shutdown". After a timer is set to this state, then it can no
longer be re-armed.
The following script was run to find all the trivial locations where
del_timer() or del_timer_sync() is called in the same function that the
object holding the timer is freed. It also ignores any locations where
the timer->function is modified between the del_timer*() and the free(),
as that is not considered a "trivial" case.
This was created by using a coccinelle script and the following
commands:
$ cat timer.cocci
@@
expression ptr, slab;
identifier timer, rfield;
@@
(
- del_timer(&ptr->timer);
+ timer_shutdown(&ptr->timer);
|
- del_timer_sync(&ptr->timer);
+ timer_shutdown_sync(&ptr->timer);
)
... when strict
when != ptr->timer
(
kfree_rcu(ptr, rfield);
|
kmem_cache_free(slab, ptr);
|
kfree(ptr);
)
$ spatch timer.cocci . > /tmp/t.patch
$ patch -p1 < /tmp/t.patch
Link: https://lore.kernel.org/lkml/20221123201306.823305113@linutronix.de/
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Pavel Machek <pavel@ucw.cz> [ LED ]
Acked-by: Kalle Valo <kvalo@kernel.org> [ wireless ]
Acked-by: Paolo Abeni <pabeni@redhat.com> [ networking ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Make ghes_edac a simple module like the rest of the EDAC drivers and
drop the forced built-in only configuration by disentangling it from
GHES (Jia He)
- The usual small cleanups and improvements all over EDAC land
* tag 'edac_updates_for_6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/i10nm: fix refcount leak in pci_get_dev_wrapper()
EDAC/i5400: Fix typo in comment: vaious -> various
EDAC/mc_sysfs: Increase legacy channel support to 12
MAINTAINERS: Make Mauro EDAC reviewer
MAINTAINERS: Make Manivannan Sadhasivam the maintainer of qcom_edac
EDAC/igen6: Return the correct error type when not the MC owner
apei/ghes: Use xchg_release() for updating new cache slot instead of cmpxchg()
EDAC: Check for GHES preference in the chipset-specific EDAC drivers
EDAC/ghes: Make ghes_edac a proper module
EDAC/ghes: Prepare to make ghes_edac a proper module
EDAC/ghes: Add a notifier for reporting memory errors
efi/cper: Export several helpers for ghes_edac to use
EDAC/i5000: Mark as BROKEN
|
|
Move error type descriptions into an array and loop over error types
to improve readability and maintainability.
Replace seq_printf() with seq_puts() as recommended by checkpatch.pl.
Signed-off-by: Jay Lu <jaylu102@amd.com>
Co-developed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Signed-off-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Checkpatch reveals warnings and an error due to missing lines and
incorrect indentations. Add the missing lines after declarations and
fix the suspect indentations.
Signed-off-by: Jay Lu <jaylu102@amd.com>
Signed-off-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
This file does not use rcu, so there is no point in including
<linux/rculist.h>.
So just remove it.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Silence the following warnings when make W=1:
| CC drivers/acpi/apei/apei-base.c
| warning: no previous prototype for 'arch_apei_enable_cmcff' [-Wmissing-prototypes]
| int __weak arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr,
| ^
| CC drivers/acpi/apei/apei-base.c
| warning: no previous prototype for 'arch_apei_report_mem_error' [-Wmissing-prototypes]
| void __weak arch_apei_report_mem_error(int sev,
| ^
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Some documentation first, about how this machinery works:
It seems, the intent of the GHES error records cache is to collect
already reported errors - see the ghes_estatus_cached() checks. There's
even a sentence trying to say what this does:
/*
* GHES error status reporting throttle, to report more kinds of
* errors, instead of just most frequently occurred errors.
*/
New elements are added to the cache this way:
if (!ghes_estatus_cached(estatus)) {
if (ghes_print_estatus(NULL, ghes->generic, estatus))
ghes_estatus_cache_add(ghes->generic, estatus);
The intent being, once this new error record is reported, it gets cached
so that it doesn't get reported for a while due to too many, same-type
error records getting reported in burst-like scenarios. I.e., new,
unreported error types can have a higher chance of getting reported.
Now, the loop in ghes_estatus_cache_add() is trying to pick out the
oldest element in there. Meaning, something which got reported already
but a long while ago, i.e., a LRU-type scheme.
And the cmpxchg() is there presumably to make sure when that selected
element slot_cache is removed, it really *is* that element that gets
removed and not one which replaced it in the meantime.
Now, ghes_estatus_cache_add() selects a slot, and either succeeds in
replacing its contents with a pointer to a newly cached item, or it just
gives up and frees the new item again, without attempting to select
another slot even if one might be available.
Since only inserting new items is being done here, the race can only
cause a failure if the selected slot was updated with another new item
concurrently, which means that it is arbitrary which of those two items
gets dropped.
And "dropped" here means, the item doesn't get added to the cache so
the next time it is seen, it'll get reported again and an insertion
attempt will be done again. Eventually, it'll get inserted and all those
times when the insertion fails, the item will get reported although the
cache is supposed to prevent that and "ratelimit" those repeated error
records. Not a big deal in any case.
This means the cmpxchg() and the special case are not necessary.
Therefore, just drop the existing item unconditionally.
Move the xchg_release() and call_rcu() out of rcu_read_lock/unlock
section since there is no actually dereferencing the pointer at all.
[ bp:
- Flesh out and summarize what was discussed on the thread now
that that cache contraption is understood;
- Touch up code style. ]
Co-developed-by: Jia He <justin.he@arm.com>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221010023559.69655-7-justin.he@arm.com
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Since commit 0998d0631001 ("device-core: Ensure drvdata = NULL when no
driver is bound") the driver core cares for cleaning driver data, so
don't do it in the driver, too.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Some documentation first, about how this machinery works:
It seems, the intent of the GHES error records cache is to collect
already reported errors - see the ghes_estatus_cached() checks. There's
even a sentence trying to say what this does:
/*
* GHES error status reporting throttle, to report more kinds of
* errors, instead of just most frequently occurred errors.
*/
New elements are added to the cache this way:
if (!ghes_estatus_cached(estatus)) {
if (ghes_print_estatus(NULL, ghes->generic, estatus))
ghes_estatus_cache_add(ghes->generic, estatus);
The intent being, once this new error record is reported, it gets cached
so that it doesn't get reported for a while due to too many, same-type
error records getting reported in burst-like scenarios. I.e., new,
unreported error types can have a higher chance of getting reported.
Now, the loop in ghes_estatus_cache_add() is trying to pick out the
oldest element in there. Meaning, something which got reported already
but a long while ago, i.e., a LRU-type scheme.
And the cmpxchg() is there presumably to make sure when that selected
element slot_cache is removed, it really *is* that element that gets
removed and not one which replaced it in the meantime.
Now, ghes_estatus_cache_add() selects a slot, and either succeeds in
replacing its contents with a pointer to a newly cached item, or it just
gives up and frees the new item again, without attempting to select
another slot even if one might be available.
Since only inserting new items is being done here, the race can only
cause a failure if the selected slot was updated with another new item
concurrently, which means that it is arbitrary which of those two items
gets dropped.
And "dropped" here means, the item doesn't get added to the cache so
the next time it is seen, it'll get reported again and an insertion
attempt will be done again. Eventually, it'll get inserted and all those
times when the insertion fails, the item will get reported although the
cache is supposed to prevent that and "ratelimit" those repeated error
records. Not a big deal in any case.
This means the cmpxchg() and the special case are not necessary.
Therefore, just drop the existing item unconditionally.
Move the xchg_release() and call_rcu() out of rcu_read_lock/unlock
section since there is no actually dereferencing the pointer at all.
[ bp:
- Flesh out and summarize what was discussed on the thread now
that that cache contraption is understood;
- Touch up code style. ]
Co-developed-by: Jia He <justin.he@arm.com>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221010023559.69655-7-justin.he@arm.com
|
|
Commit
dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in apci_init()")
introduced a bug leading to ghes_edac_register() to be invoked before
edac_init(). Because at that time the bus "edac" hadn't been even
registered, this created sysfs nodes as /devices/mc0 instead of
/sys/devices/system/edac/mc/mc0 on an Ampere eMag server.
Fix this by turning ghes_edac into a proper module.
The list of GHES devices returned is not protected from being modified
concurrently but it is pretty static as it gets created only during GHES
init and latter is not a module so...
[ bp: Massage. ]
Fixes: dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in apci_init()")
Co-developed-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221010023559.69655-5-justin.he@arm.com
|
|
To make ghes_edac a proper module, prepare to decouple its dependencies
from GHES.
Move the ghes_edac.force_load parameter to ghes.c in order to
properly control whether ghes_edac should be force-loaded: In
ghes_edac_register() it is too late to set the module flag.
Introduce a helper ghes_get_devices(), which returns the list of GHES
devices which got probed when the platform-check passes on the system.
The previous force_load check is not needed in ghes_edac_unregister()
since it will be checked in the module's init function of ghes_edac
later.
[ bp: Massage. ]
Suggested-by: Toshi Kani <toshi.kani@hpe.com>
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221010023559.69655-4-justin.he@arm.com
|
|
In order to make it a proper module and disentangle it from facilities,
add a notifier for reporting memory errors. Use an atomic notifier
because calls sites like ghes_proc_in_irq() run in interrupt context.
[ bp: Massage commit message. ]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Jia He <justin.he@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20221010023559.69655-3-justin.he@arm.com
|
|
Change num_ghes from int to unsigned int, preventing an overflow
and causing subsequent vmalloc() to fail.
The overflow happens in ghes_estatus_pool_init() when calculating
len during execution of the statement below as both multiplication
operands here are signed int:
len += (num_ghes * GHES_ESOURCE_PREALLOC_MAX_SIZE);
The following call trace is observed because of this bug:
[ 9.317108] swapper/0: vmalloc error: size 18446744071562596352, exceeds total pages, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0-1
[ 9.317131] Call Trace:
[ 9.317134] <TASK>
[ 9.317137] dump_stack_lvl+0x49/0x5f
[ 9.317145] dump_stack+0x10/0x12
[ 9.317146] warn_alloc.cold+0x7b/0xdf
[ 9.317150] ? __device_attach+0x16a/0x1b0
[ 9.317155] __vmalloc_node_range+0x702/0x740
[ 9.317160] ? device_add+0x17f/0x920
[ 9.317164] ? dev_set_name+0x53/0x70
[ 9.317166] ? platform_device_add+0xf9/0x240
[ 9.317168] __vmalloc_node+0x49/0x50
[ 9.317170] ? ghes_estatus_pool_init+0x43/0xa0
[ 9.317176] vmalloc+0x21/0x30
[ 9.317177] ghes_estatus_pool_init+0x43/0xa0
[ 9.317179] acpi_hest_init+0x129/0x19c
[ 9.317185] acpi_init+0x434/0x4a4
[ 9.317188] ? acpi_sleep_proc_init+0x2a/0x2a
[ 9.317190] do_one_initcall+0x48/0x200
[ 9.317195] kernel_init_freeable+0x221/0x284
[ 9.317200] ? rest_init+0xe0/0xe0
[ 9.317204] kernel_init+0x1a/0x130
[ 9.317205] ret_from_fork+0x22/0x30
[ 9.317208] </TASK>
Signed-off-by: Ashish Kalra <ashish.kalra@amd.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
If an error is detected as a result of user-space process accessing a
corrupt memory location, the CPU may take an abort. Then the platform
firmware reports kernel via NMI like notifications, e.g. NOTIFY_SEA,
NOTIFY_SOFTWARE_DELEGATED, etc.
For NMI like notifications, commit 7f17b4a121d0 ("ACPI: APEI: Kick the
memory_failure() queue for synchronous errors") keep track of whether
memory_failure() work was queued, and make task_work pending to flush out
the queue so that the work is processed before return to user-space.
The code use init_mm to check whether the error occurs in user space:
if (current->mm != &init_mm)
The condition is always true, becase _nobody_ ever has "init_mm" as a real
VM any more.
In addition to abort, errors can also be signaled as asynchronous
exceptions, such as interrupt and SError. In such case, the interrupted
current process could be any kind of thread. When a kernel thread is
interrupted, the work ghes_kick_task_work deferred to task_work will never
be processed because entry_handler returns to call ret_to_kernel() instead
of ret_to_user(). Consequently, the estatus_node alloced from
ghes_estatus_pool in ghes_in_nmi_queue_one_entry() will not be freed.
After around 200 allocations in our platform, the ghes_estatus_pool will
run of memory and ghes_in_nmi_queue_one_entry() returns ENOMEM. As a
result, the event failed to be processed.
sdei: event 805 on CPU 113 failed with error: -2
Finally, a lot of unhandled events may cause platform firmware to exceed
some threshold and reboot.
The condition should generally just do
if (current->mm)
as described in active_mm.rst documentation.
Then if an asynchronous error is detected when a kernel thread is running,
(e.g. when detected by a background scrubber), do not add task_work to it
as the original patch intends to do.
Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Return the erst_get_record_id_begin() and apei_exec_write_register()
return values directly instead of storing them in redundant local
variables.
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
[ rjw: Changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Print total number of records found during BERT log parsing.
This also simplify dmesg parser implementation for BERT events.
Signed-off-by: Dmitry Monakhov <dmtrmonakhov@yandex-team.ru>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
When a platform marks a memory range as "special purpose" it is not
onlined as System RAM by default. However, it is still suitable for
error injection. Add IORES_DESC_SOFT_RESERVED to einj_error_inject() as
a permissible memory type in the sanity checking of the arguments to
_EINJ.
Fixes: 262b45ae3ab4 ("x86/efi: EFI soft reservation to E820 enumeration")
Reviewed-by: Tony Luck <tony.luck@intel.com>
Reported-by: Omar Avelar <omar.avelar@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
The fix in commit 3f8dec116210 ("ACPI/APEI: Limit printable size of BERT
table data") does not work as intended on systems where the BIOS has a
fixed size block of memory for the BERT table, relying on s/w to quit
when it finds a record with estatus->block_status == 0. On these systems
all errors are suppressed because the check:
if (region_len < ACPI_BERT_PRINT_MAX_LEN)
always fails.
New scheme skips individual CPER records that are too large, and also
limits the total number of records that will be printed to 5.
Fixes: 3f8dec116210 ("ACPI/APEI: Limit printable size of BERT table data")
Cc: All applicable <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Delete the redundant word 'the'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
[ rjw: New subject ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Some validation tests dynamically inject errors into memory used by
applications to check that the system can recover from a variety of
poison consumption sceenarios.
But sometimes the virtual address picked by these tests is mapped to
the zero page.
This causes additional unexpected machine checks as other processes that
map the zero page also consume the poison.
Disallow injection to the zero page.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Read a record is cleared by others, but the deleted record cache entry is
still created by erst_get_record_id_next. When next enumerate the records,
get the cached deleted record, then erst_read() return -ENOENT and try to
get next record, loop back to first ID will return 0 in function
__erst_record_id_cache_add_one and then set record_id as
APEI_ERST_INVALID_RECORD_ID, finished this time read operation.
It will result in read the records just in the cache hereafter.
This patch cleared the deleted record cache, fix the issue that
"./erst-inject -p" shows record counts not equal to "./erst-inject -n".
A reproducer of the problem(retry many times):
[root@localhost erst-inject]# ./erst-inject -c 0xaaaaa00011
[root@localhost erst-inject]# ./erst-inject -p
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00012
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00013
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00014
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000006
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000007
[root@localhost erst-inject]# ./erst-inject -i 0xaaaaa000008
[root@localhost erst-inject]# ./erst-inject -p
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00012
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00013
rc: 273
rcd sig: CPER
rcd id: 0xaaaaa00014
[root@localhost erst-inject]# ./erst-inject -n
total error record count: 6
Signed-off-by: Liu Xinpeng <liuxp11@chinatelecom.cn>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
While the original code is valid, it is not the obvious choice for the
sizeof() call and in preparation to limit the scope of the list iterator
variable the sizeof should be changed to the size of the variable
being allocated.
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
Platforms with large BERT table data can trigger soft lockup errors
while attempting to print the entire BERT table data to the console at
boot:
watchdog: BUG: soft lockup - CPU#160 stuck for 23s! [swapper/0:1]
Observed on Ampere Altra systems with a single BERT record of ~250KB.
The original bert driver appears to have assumed relatively small table
data. Since it is impractical to reassemble large table data from
interwoven console messages, and the table data is available in
/sys/firmware/acpi/tables/data/BERT
limit the size for tables printed to the console to 1024 (for no reason
other than it seemed like a good place to kick off the discussion, would
appreciate feedback from existing users in terms of what size would
maintain their current usage model).
Alternatively, we could make printing a CONFIG option, use the
bert_disable boot arg (or something similar), or use a debug log level.
However, all those solutions require extra steps or change the existing
behavior for small table data. Limiting the size preserves existing
behavior on existing platforms with small table data, and eliminates the
soft lockups for platforms with large table data, while still making it
available.
Signed-off-by: Darren Hart <darren@os.amperecomputing.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
__setup() handlers should return 1 to indicate that the boot option
has been handled. Returning 0 causes a boot option to be listed in
the Unknown kernel command line parameters and also added to init's
arg list (if no '=' sign) or environment list (if of the form 'a=b').
Unknown kernel command line parameters "erst_disable
bert_disable hest_disable BOOT_IMAGE=/boot/bzImage-517rc6", will be
passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
erst_disable
bert_disable
hest_disable
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc6
Fixes: a3e2acc5e37b ("ACPI / APEI: Add Boot Error Record Table (BERT) support")
Fixes: a08f82d08053 ("ACPI, APEI, Error Record Serialization Table (ERST) support")
Fixes: 9dc966641677 ("ACPI, APEI, HEST table parsing")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
ghes_init() sticks out in acpi_init() because it is the only functions
without an "acpi_" prefix.
Rename ghes_init with an "acpi_" prefix, then all looks fine.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
From commit e147133a42cb ("ACPI / APEI: Make hest.c manage the estatus
memory pool") was merged, ghes_init() relies on acpi_hest_init() to manage
the estatus memory pool. On the other hand, ghes_init() relies on
sdei_init() to detect the SDEI version and (un)register events. The
dependencies are as follows:
ghes_init() => acpi_hest_init() => acpi_bus_init() => acpi_init()
ghes_init() => sdei_init()
HEST is not PCI-specific and initcall ordering is implicit and not
well-defined within a level.
Based on above, remove acpi_hest_init() from acpi_pci_root_init() and
convert ghes_init() and sdei_init() from initcalls to explicit calls in the
following order:
acpi_hest_init()
ghes_init()
sdei_init()
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|
|
SGX EPC pages do not have a "struct page" associated with them so the
pfn_valid() sanity check fails and results in a warning message to
the console.
Add an additional check to skip the warning if the address of the error
is in an SGX EPC page.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-8-tony.luck@intel.com
|
|
SGX reserved memory does not appear in the standard address maps.
Add hook to call into the SGX code to check if an address is located
in SGX memory.
There are other challenges in injecting errors into SGX. Update the
documentation with a sequence of operations to inject.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Tested-by: Reinette Chatre <reinette.chatre@intel.com>
Link: https://lkml.kernel.org/r/20211026220050.697075-7-tony.luck@intel.com
|
|
apei_hest_parse() is only used in hest.c, so mark it static.
Signed-off-by: Christoph Hellwig <hch@lst.de>
[ rjw: Minor subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
|