diff options
Diffstat (limited to 'Documentation/arch')
-rw-r--r-- | Documentation/arch/arm64/booting.rst | 23 | ||||
-rw-r--r-- | Documentation/arch/arm64/elf_hwcaps.rst | 6 | ||||
-rw-r--r-- | Documentation/arch/arm64/tagged-pointers.rst | 11 | ||||
-rw-r--r-- | Documentation/arch/riscv/cmodx.rst | 46 | ||||
-rw-r--r-- | Documentation/arch/riscv/hwprobe.rst | 26 | ||||
-rw-r--r-- | Documentation/arch/x86/amd-hfi.rst | 133 | ||||
-rw-r--r-- | Documentation/arch/x86/index.rst | 1 | ||||
-rw-r--r-- | Documentation/arch/x86/mds.rst | 8 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/mm.rst | 2 |
9 files changed, 238 insertions, 18 deletions
diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst index dee7b6de864f..12bac2ea141c 100644 --- a/Documentation/arch/arm64/booting.rst +++ b/Documentation/arch/arm64/booting.rst @@ -234,7 +234,7 @@ Before jumping into the kernel, the following conditions must be met: - If the kernel is entered at EL1: - - ICC.SRE_EL2.Enable (bit 3) must be initialised to 0b1 + - ICC_SRE_EL2.Enable (bit 3) must be initialised to 0b1 - ICC_SRE_EL2.SRE (bit 0) must be initialised to 0b1. - The DT or ACPI tables must describe a GICv3 interrupt controller. @@ -388,6 +388,27 @@ Before jumping into the kernel, the following conditions must be met: - SMCR_EL2.EZT0 (bit 30) must be initialised to 0b1. + For CPUs with the Branch Record Buffer Extension (FEAT_BRBE): + + - If EL3 is present: + + - MDCR_EL3.SBRBE (bits 33:32) must be initialised to 0b01 or 0b11. + + - If the kernel is entered at EL1 and EL2 is present: + + - BRBCR_EL2.CC (bit 3) must be initialised to 0b1. + - BRBCR_EL2.MPRED (bit 4) must be initialised to 0b1. + + - HDFGRTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1. + - HDFGRTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1. + - HDFGRTR_EL2.nBRBIDR (bit 59) must be initialised to 0b1. + + - HDFGWTR_EL2.nBRBDATA (bit 61) must be initialised to 0b1. + - HDFGWTR_EL2.nBRBCTL (bit 60) must be initialised to 0b1. + + - HFGITR_EL2.nBRBIALL (bit 56) must be initialised to 0b1. + - HFGITR_EL2.nBRBINJ (bit 55) must be initialised to 0b1. + For CPUs with the Performance Monitors Extension (FEAT_PMUv3p9): - If EL3 is present: diff --git a/Documentation/arch/arm64/elf_hwcaps.rst b/Documentation/arch/arm64/elf_hwcaps.rst index 69d7afe56853..f58ada4d6cb2 100644 --- a/Documentation/arch/arm64/elf_hwcaps.rst +++ b/Documentation/arch/arm64/elf_hwcaps.rst @@ -435,6 +435,12 @@ HWCAP2_SME_SF8DP4 HWCAP2_POE Functionality implied by ID_AA64MMFR3_EL1.S1POE == 0b0001. +HWCAP3_MTE_FAR + Functionality implied by ID_AA64PFR2_EL1.MTEFAR == 0b0001. + +HWCAP3_MTE_STORE_ONLY + Functionality implied by ID_AA64PFR2_EL1.MTESTOREONLY == 0b0001. + 4. Unused AT_HWCAP bits ----------------------- diff --git a/Documentation/arch/arm64/tagged-pointers.rst b/Documentation/arch/arm64/tagged-pointers.rst index 81b6c2a770dd..f87a925ca9a5 100644 --- a/Documentation/arch/arm64/tagged-pointers.rst +++ b/Documentation/arch/arm64/tagged-pointers.rst @@ -60,11 +60,12 @@ that signal handlers in applications making use of tags cannot rely on the tag information for user virtual addresses being maintained in these fields unless the flag was set. -Due to architecture limitations, bits 63:60 of the fault address -are not preserved in response to synchronous tag check faults -(SEGV_MTESERR) even if SA_EXPOSE_TAGBITS was set. Applications should -treat the values of these bits as undefined in order to accommodate -future architecture revisions which may preserve the bits. +If FEAT_MTE_TAGGED_FAR (Armv8.9) is supported, bits 63:60 of the fault address +are preserved in response to synchronous tag check faults (SEGV_MTESERR) +otherwise not preserved even if SA_EXPOSE_TAGBITS was set. +Applications should interpret the values of these bits based on +the support for the HWCAP3_MTE_FAR. If the support is not present, +the values of these bits should be considered as undefined otherwise valid. For signals raised in response to watchpoint debug exceptions, the tag information will be preserved regardless of the SA_EXPOSE_TAGBITS diff --git a/Documentation/arch/riscv/cmodx.rst b/Documentation/arch/riscv/cmodx.rst index 8c48bcff3df9..40ba53bed5df 100644 --- a/Documentation/arch/riscv/cmodx.rst +++ b/Documentation/arch/riscv/cmodx.rst @@ -10,13 +10,45 @@ modified by the program itself. Instruction storage and the instruction cache program must enforce its own synchronization with the unprivileged fence.i instruction. -However, the default Linux ABI prohibits the use of fence.i in userspace -applications. At any point the scheduler may migrate a task onto a new hart. If -migration occurs after the userspace synchronized the icache and instruction -storage with fence.i, the icache on the new hart will no longer be clean. This -is due to the behavior of fence.i only affecting the hart that it is called on. -Thus, the hart that the task has been migrated to may not have synchronized -instruction storage and icache. +CMODX in the Kernel Space +------------------------- + +Dynamic ftrace +--------------------- + +Essentially, dynamic ftrace directs the control flow by inserting a function +call at each patchable function entry, and patches it dynamically at runtime to +enable or disable the redirection. In the case of RISC-V, 2 instructions, +AUIPC + JALR, are required to compose a function call. However, it is impossible +to patch 2 instructions and expect that a concurrent read-side executes them +without a race condition. This series makes atmoic code patching possible in +RISC-V ftrace. Kernel preemption makes things even worse as it allows the old +state to persist across the patching process with stop_machine(). + +In order to get rid of stop_machine() and run dynamic ftrace with full kernel +preemption, we partially initialize each patchable function entry at boot-time, +setting the first instruction to AUIPC, and the second to NOP. Now, atmoic +patching is possible because the kernel only has to update one instruction. +According to Ziccif, as long as an instruction is naturally aligned, the ISA +guarantee an atomic update. + +By fixing down the first instruction, AUIPC, the range of the ftrace trampoline +is limited to +-2K from the predetermined target, ftrace_caller, due to the lack +of immediate encoding space in RISC-V. To address the issue, we introduce +CALL_OPS, where an 8B naturally align metadata is added in front of each +pacthable function. The metadata is resolved at the first trampoline, then the +execution can be derect to another custom trampoline. + +CMODX in the User Space +----------------------- + +Though fence.i is an unprivileged instruction, the default Linux ABI prohibits +the use of fence.i in userspace applications. At any point the scheduler may +migrate a task onto a new hart. If migration occurs after the userspace +synchronized the icache and instruction storage with fence.i, the icache on the +new hart will no longer be clean. This is due to the behavior of fence.i only +affecting the hart that it is called on. Thus, the hart that the task has been +migrated to may not have synchronized instruction storage and icache. There are two ways to solve this problem: use the riscv_flush_icache() syscall, or use the ``PR_RISCV_SET_ICACHE_FLUSH_CTX`` prctl() and emit fence.i in diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst index f60bf5991755..2aa9be272d5d 100644 --- a/Documentation/arch/riscv/hwprobe.rst +++ b/Documentation/arch/riscv/hwprobe.rst @@ -271,6 +271,10 @@ The following keys are defined: * :c:macro:`RISCV_HWPROBE_EXT_ZICBOM`: The Zicbom extension is supported, as ratified in commit 3dd606f ("Create cmobase-v1.0.pdf") of riscv-CMOs. + * :c:macro:`RISCV_HWPROBE_EXT_ZABHA`: The Zabha extension is supported as + ratified in commit 49f49c842ff9 ("Update to Rafified state") of + riscv-zabha. + * :c:macro:`RISCV_HWPROBE_KEY_CPUPERF_0`: Deprecated. Returns similar values to :c:macro:`RISCV_HWPROBE_KEY_MISALIGNED_SCALAR_PERF`, but the key was mistakenly classified as a bitmask rather than a value. @@ -335,3 +339,25 @@ The following keys are defined: * :c:macro:`RISCV_HWPROBE_KEY_ZICBOM_BLOCK_SIZE`: An unsigned int which represents the size of the Zicbom block in bytes. + +* :c:macro:`RISCV_HWPROBE_KEY_VENDOR_EXT_SIFIVE_0`: A bitmask containing the + sifive vendor extensions that are compatible with the + :c:macro:`RISCV_HWPROBE_BASE_BEHAVIOR_IMA`: base system behavior. + + * SIFIVE + + * :c:macro:`RISCV_HWPROBE_VENDOR_EXT_XSFVQMACCDOD`: The Xsfqmaccdod vendor + extension is supported in version 1.1 of SiFive Int8 Matrix Multiplication + Extensions Specification. + + * :c:macro:`RISCV_HWPROBE_VENDOR_EXT_XSFVQMACCQOQ`: The Xsfqmaccqoq vendor + extension is supported in version 1.1 of SiFive Int8 Matrix Multiplication + Instruction Extensions Specification. + + * :c:macro:`RISCV_HWPROBE_VENDOR_EXT_XSFVFNRCLIPXFQF`: The Xsfvfnrclipxfqf + vendor extension is supported in version 1.0 of SiFive FP32-to-int8 Ranged + Clip Instructions Extensions Specification. + + * :c:macro:`RISCV_HWPROBE_VENDOR_EXT_XSFVFWMACCQQQ`: The Xsfvfwmaccqqq + vendor extension is supported in version 1.0 of Matrix Multiply Accumulate + Instruction Extensions Specification.
\ No newline at end of file diff --git a/Documentation/arch/x86/amd-hfi.rst b/Documentation/arch/x86/amd-hfi.rst new file mode 100644 index 000000000000..bf3d3a1985a2 --- /dev/null +++ b/Documentation/arch/x86/amd-hfi.rst @@ -0,0 +1,133 @@ +.. SPDX-License-Identifier: GPL-2.0 + +====================================================================== +Hardware Feedback Interface For Hetero Core Scheduling On AMD Platform +====================================================================== + +:Copyright: 2025 Advanced Micro Devices, Inc. All Rights Reserved. + +:Author: Perry Yuan <perry.yuan@amd.com> +:Author: Mario Limonciello <mario.limonciello@amd.com> + +Overview +-------- + +AMD Heterogeneous Core implementations are comprised of more than one +architectural class and CPUs are comprised of cores of various efficiency and +power capabilities: performance-oriented *classic cores* and power-efficient +*dense cores*. As such, power management strategies must be designed to +accommodate the complexities introduced by incorporating different core types. +Heterogeneous systems can also extend to more than two architectural classes +as well. The purpose of the scheduling feedback mechanism is to provide +information to the operating system scheduler in real time such that the +scheduler can direct threads to the optimal core. + +The goal of AMD's heterogeneous architecture is to attain power benefit by +sending background threads to the dense cores while sending high priority +threads to the classic cores. From a performance perspective, sending +background threads to dense cores can free up power headroom and allow the +classic cores to optimally service demanding threads. Furthermore, the area +optimized nature of the dense cores allows for an increasing number of +physical cores. This improved core density will have positive multithreaded +performance impact. + +AMD Heterogeneous Core Driver +----------------------------- + +The ``amd_hfi`` driver delivers the operating system a performance and energy +efficiency capability data for each CPU in the system. The scheduler can use +the ranking data from the HFI driver to make task placement decisions. + +Thread Classification and Ranking Table Interaction +---------------------------------------------------- + +The thread classification is used to select into a ranking table that +describes an efficiency and performance ranking for each classification. + +Threads are classified during runtime into enumerated classes. The classes +represent thread performance/power characteristics that may benefit from +special scheduling behaviors. The below table depicts an example of thread +classification and a preference where a given thread should be scheduled +based on its thread class. The real time thread classification is consumed +by the operating system and is used to inform the scheduler of where the +thread should be placed. + +Thread Classification Example Table +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ++----------+----------------+-------------------------------+---------------------+---------+ +| class ID | Classification | Preferred scheduling behavior | Preemption priority | Counter | ++----------+----------------+-------------------------------+---------------------+---------+ +| 0 | Default | Performant | Highest | | ++----------+----------------+-------------------------------+---------------------+---------+ +| 1 | Non-scalable | Efficient | Lowest | PMCx1A1 | ++----------+----------------+-------------------------------+---------------------+---------+ +| 2 | I/O bound | Efficient | Lowest | PMCx044 | ++----------+----------------+-------------------------------+---------------------+---------+ + +Thread classification is performed by the hardware each time that the thread is switched out. +Threads that don't meet any hardware specified criteria are classified as "default". + +AMD Hardware Feedback Interface +-------------------------------- + +The Hardware Feedback Interface provides to the operating system information +about the performance and energy efficiency of each CPU in the system. Each +capability is given as a unit-less quantity in the range [0-255]. A higher +performance value indicates higher performance capability, and a higher +efficiency value indicates more efficiency. Energy efficiency and performance +are reported in separate capabilities in the shared memory based ranking table. + +These capabilities may change at runtime as a result of changes in the +operating conditions of the system or the action of external factors. +Power Management firmware is responsible for detecting events that require +a reordering of the performance and efficiency ranking. Table updates happen +relatively infrequently and occur on the time scale of seconds or more. + +The following events trigger a table update: + * Thermal Stress Events + * Silent Compute + * Extreme Low Battery Scenarios + +The kernel or a userspace policy daemon can use these capabilities to modify +task placement decisions. For instance, if either the performance or energy +capabilities of a given logical processor becomes zero, it is an indication +that the hardware recommends to the operating system to not schedule any tasks +on that processor for performance or energy efficiency reasons, respectively. + +Implementation details for Linux +-------------------------------- + +The implementation of threads scheduling consists of the following steps: + +1. A thread is spawned and scheduled to the ideal core using the default + heterogeneous scheduling policy. +2. The processor profiles thread execution and assigns an enumerated + classification ID. + This classification is communicated to the OS via logical processor + scope MSR. +3. During the thread context switch out the operating system consumes the + workload (WL) classification which resides in a logical processor scope MSR. +4. The OS triggers the hardware to clear its history by writing to an MSR, + after consuming the WL classification and before switching in the new thread. +5. If due to the classification, ranking table, and processor availability, + the thread is not on its ideal processor, the OS will then consider + scheduling the thread on its ideal processor (if available). + +Ranking Table +------------- +The ranking table is a shared memory region that is used to communicate the +performance and energy efficiency capabilities of each CPU in the system. + +The ranking table design includes rankings for each APIC ID in the system and +rankings both for performance and efficiency for each workload classification. + +.. kernel-doc:: drivers/platform/x86/amd/hfi/hfi.c + :doc: amd_shmem_info + +Ranking Table update +--------------------------- +The power management firmware issues an platform interrupt after updating the +ranking table and is ready for the operating system to consume it. CPUs receive +such interrupt and read new ranking table from shared memory which PCCT table +has provided, then ``amd_hfi`` driver parses the new table to provide new +consume data for scheduling decisions. diff --git a/Documentation/arch/x86/index.rst b/Documentation/arch/x86/index.rst index 8ea762494bcc..f88bcfceb7f2 100644 --- a/Documentation/arch/x86/index.rst +++ b/Documentation/arch/x86/index.rst @@ -28,6 +28,7 @@ x86-specific Documentation amd-debugging amd-memory-encryption amd_hsmp + amd-hfi tdx pti mds diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst index 5a2e6c0ef04a..3518671e1a85 100644 --- a/Documentation/arch/x86/mds.rst +++ b/Documentation/arch/x86/mds.rst @@ -93,7 +93,7 @@ enters a C-state. The kernel provides a function to invoke the buffer clearing: - mds_clear_cpu_buffers() + x86_clear_cpu_buffers() Also macro CLEAR_CPU_BUFFERS can be used in ASM late in exit-to-user path. Other than CFLAGS.ZF, this macro doesn't clobber any registers. @@ -185,9 +185,9 @@ Mitigation points idle clearing would be a window dressing exercise and is therefore not activated. - The invocation is controlled by the static key mds_idle_clear which is - switched depending on the chosen mitigation mode and the SMT state of - the system. + The invocation is controlled by the static key cpu_buf_idle_clear which is + switched depending on the chosen mitigation mode and the SMT state of the + system. The buffer clear is only invoked before entering the C-State to prevent that stale data from the idling CPU from spilling to the Hyper-Thread diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst index f2db178b353f..a6cf05d51bd8 100644 --- a/Documentation/arch/x86/x86_64/mm.rst +++ b/Documentation/arch/x86/x86_64/mm.rst @@ -176,5 +176,5 @@ Be very careful vs. KASLR when changing anything here. The KASLR address range must not overlap with anything except the KASAN shadow area, which is correct as KASAN disables KASLR. -For both 4- and 5-level layouts, the STACKLEAK_POISON value in the last 2MB +For both 4- and 5-level layouts, the KSTACK_ERASE_POISON value in the last 2MB hole: ffffffffffff4111 |