diff options
Diffstat (limited to 'Documentation/arch')
54 files changed, 2024 insertions, 796 deletions
diff --git a/Documentation/arch/arm/mem_alignment.rst b/Documentation/arch/arm/mem_alignment.rst index aa22893b62bc..64bd77959300 100644 --- a/Documentation/arch/arm/mem_alignment.rst +++ b/Documentation/arch/arm/mem_alignment.rst @@ -12,7 +12,7 @@ ones. Of course this is a bad idea to rely on the alignment trap to perform unaligned memory access in general. If those access are predictable, you -are better to use the macros provided by include/asm/unaligned.h. The +are better to use the macros provided by include/linux/unaligned.h. The alignment trap can fixup misaligned access for the exception cases, but at a high performance cost. It better be rare. diff --git a/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst b/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst index 2945e0e33104..301aa30890ae 100644 --- a/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst +++ b/Documentation/arch/arm/stm32/stm32-dma-mdma-chaining.rst @@ -359,7 +359,7 @@ Driver updates for STM32 DMA-MDMA chaining support in foo driver descriptor you want a callback to be called at the end of the transfer (dmaengine_prep_slave_sg()) or the period (dmaengine_prep_dma_cyclic()). Depending on the direction, set the callback on the descriptor that finishes - the overal transfer: + the overall transfer: * DMA_DEV_TO_MEM: set the callback on the "MDMA" descriptor * DMA_MEM_TO_DEV: set the callback on the "DMA" descriptor @@ -371,7 +371,7 @@ Driver updates for STM32 DMA-MDMA chaining support in foo driver As STM32 MDMA channel transfer is triggered by STM32 DMA, you must issue STM32 MDMA channel before STM32 DMA channel. - If any, your callback will be called to warn you about the end of the overal + If any, your callback will be called to warn you about the end of the overall transfer or the period completion. Don't forget to terminate both channels. STM32 DMA channel is configured in diff --git a/Documentation/arch/arm64/arm-cca.rst b/Documentation/arch/arm64/arm-cca.rst new file mode 100644 index 000000000000..c48b7d4ab6bd --- /dev/null +++ b/Documentation/arch/arm64/arm-cca.rst @@ -0,0 +1,69 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +Arm Confidential Compute Architecture +===================================== + +Arm systems that support the Realm Management Extension (RME) contain +hardware to allow a VM guest to be run in a way which protects the code +and data of the guest from the hypervisor. It extends the older "two +world" model (Normal and Secure World) into four worlds: Normal, Secure, +Root and Realm. Linux can then also be run as a guest to a monitor +running in the Realm world. + +The monitor running in the Realm world is known as the Realm Management +Monitor (RMM) and implements the Realm Management Monitor +specification[1]. The monitor acts a bit like a hypervisor (e.g. it runs +in EL2 and manages the stage 2 page tables etc of the guests running in +Realm world), however much of the control is handled by a hypervisor +running in the Normal World. The Normal World hypervisor uses the Realm +Management Interface (RMI) defined by the RMM specification to request +the RMM to perform operations (e.g. mapping memory or executing a vCPU). + +The RMM defines an environment for guests where the address space (IPA) +is split into two. The lower half is protected - any memory that is +mapped in this half cannot be seen by the Normal World and the RMM +restricts what operations the Normal World can perform on this memory +(e.g. the Normal World cannot replace pages in this region without the +guest's cooperation). The upper half is shared, the Normal World is free +to make changes to the pages in this region, and is able to emulate MMIO +devices in this region too. + +A guest running in a Realm may also communicate with the RMM using the +Realm Services Interface (RSI) to request changes in its environment or +to perform attestation about its environment. In particular it may +request that areas of the protected address space are transitioned +between 'RAM' and 'EMPTY' (in either direction). This allows a Realm +guest to give up memory to be returned to the Normal World, or to +request new memory from the Normal World. Without an explicit request +from the Realm guest the RMM will otherwise prevent the Normal World +from making these changes. + +Linux as a Realm Guest +---------------------- + +To run Linux as a guest within a Realm, the following must be provided +either by the VMM or by a `boot loader` run in the Realm before Linux: + + * All protected RAM described to Linux (by DT or ACPI) must be marked + RIPAS RAM before handing control over to Linux. + + * MMIO devices must be either unprotected (e.g. emulated by the Normal + World) or marked RIPAS DEV. + + * MMIO devices emulated by the Normal World and used very early in boot + (specifically earlycon) must be specified in the upper half of IPA. + For earlycon this can be done by specifying the address on the + command line, e.g. with an IPA size of 33 bits and the base address + of the emulated UART at 0x1000000: ``earlycon=uart,mmio,0x101000000`` + + * Linux will use bounce buffers for communicating with unprotected + devices. It will transition some protected memory to RIPAS EMPTY and + expect to be able to access unprotected pages at the same IPA address + but with the highest valid IPA bit set. The expectation is that the + VMM will remove the physical pages from the protected mapping and + provide those pages as unprotected pages. + +References +---------- +[1] https://developer.arm.com/documentation/den0137/ diff --git a/Documentation/arch/arm64/asymmetric-32bit.rst b/Documentation/arch/arm64/asymmetric-32bit.rst index 64a0b505da7d..1ca2b359a907 100644 --- a/Documentation/arch/arm64/asymmetric-32bit.rst +++ b/Documentation/arch/arm64/asymmetric-32bit.rst @@ -153,3 +153,11 @@ asymmetric system, a broken guest at EL1 could still attempt to execute mode will return to host userspace with an ``exit_reason`` of ``KVM_EXIT_FAIL_ENTRY`` and will remain non-runnable until successfully re-initialised by a subsequent ``KVM_ARM_VCPU_INIT`` operation. + +NOHZ FULL +--------- + +To avoid perturbing an adaptive-ticks CPU (specified using +``nohz_full=``) when a 32-bit task is forcefully migrated, these CPUs +are treated as 64-bit-only when support for asymmetric 32-bit systems +is enabled. diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst index b57776a68f15..cad6fdc96b98 100644 --- a/Documentation/arch/arm64/booting.rst +++ b/Documentation/arch/arm64/booting.rst @@ -41,6 +41,9 @@ to automatically locate and size all RAM, or it may use knowledge of the RAM in the machine, or any other method the boot loader designer sees fit.) +For Arm Confidential Compute Realms this includes ensuring that all +protected RAM has a Realm IPA state (RIPAS) of "RAM". + 2. Setup the device tree ------------------------- @@ -385,6 +388,9 @@ Before jumping into the kernel, the following conditions must be met: - HCRX_EL2.MSCEn (bit 11) must be initialised to 0b1. + - HCRX_EL2.MCE2 (bit 10) must be initialised to 0b1 and the hypervisor + must handle MOPS exceptions as described in :ref:`arm64_mops_hyp`. + For CPUs with the Extended Translation Control Register feature (FEAT_TCR2): - If EL3 is present: @@ -411,6 +417,50 @@ Before jumping into the kernel, the following conditions must be met: - HFGRWR_EL2.nPIRE0_EL1 (bit 57) must be initialised to 0b1. + - For CPUs with Guarded Control Stacks (FEAT_GCS): + + - GCSCR_EL1 must be initialised to 0. + + - GCSCRE0_EL1 must be initialised to 0. + + - If EL3 is present: + + - SCR_EL3.GCSEn (bit 39) must be initialised to 0b1. + + - If EL2 is present: + + - GCSCR_EL2 must be initialised to 0. + + - If the kernel is entered at EL1 and EL2 is present: + + - HCRX_EL2.GCSEn must be initialised to 0b1. + + - HFGITR_EL2.nGCSEPP (bit 59) must be initialised to 0b1. + + - HFGITR_EL2.nGCSSTR_EL1 (bit 58) must be initialised to 0b1. + + - HFGITR_EL2.nGCSPUSHM_EL1 (bit 57) must be initialised to 0b1. + + - HFGRTR_EL2.nGCS_EL1 (bit 53) must be initialised to 0b1. + + - HFGRTR_EL2.nGCS_EL0 (bit 52) must be initialised to 0b1. + + - HFGWTR_EL2.nGCS_EL1 (bit 53) must be initialised to 0b1. + + - HFGWTR_EL2.nGCS_EL0 (bit 52) must be initialised to 0b1. + + - For CPUs with debug architecture i.e FEAT_Debugv8pN (all versions): + + - If EL3 is present: + + - MDCR_EL3.TDA (bit 9) must be initialized to 0b0 + + - For CPUs with FEAT_PMUv3: + + - If EL3 is present: + + - MDCR_EL3.TPM (bit 6) must be initialized to 0b0 + The requirements described above for CPU mode, caches, MMUs, architected timers, coherency and system registers apply to all CPUs. All CPUs must enter the kernel in the same exception level. Where the values documented diff --git a/Documentation/arch/arm64/cpu-feature-registers.rst b/Documentation/arch/arm64/cpu-feature-registers.rst index 44f9bd78539d..253e9743de2f 100644 --- a/Documentation/arch/arm64/cpu-feature-registers.rst +++ b/Documentation/arch/arm64/cpu-feature-registers.rst @@ -152,6 +152,8 @@ infrastructure: +------------------------------+---------+---------+ | DIT | [51-48] | y | +------------------------------+---------+---------+ + | MPAM | [43-40] | n | + +------------------------------+---------+---------+ | SVE | [35-32] | y | +------------------------------+---------+---------+ | GIC | [27-24] | n | diff --git a/Documentation/arch/arm64/cpu-hotplug.rst b/Documentation/arch/arm64/cpu-hotplug.rst new file mode 100644 index 000000000000..8fb438bf7781 --- /dev/null +++ b/Documentation/arch/arm64/cpu-hotplug.rst @@ -0,0 +1,79 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. _cpuhp_index: + +==================== +CPU Hotplug and ACPI +==================== + +CPU hotplug in the arm64 world is commonly used to describe the kernel taking +CPUs online/offline using PSCI. This document is about ACPI firmware allowing +CPUs that were not available during boot to be added to the system later. + +``possible`` and ``present`` refer to the state of the CPU as seen by linux. + + +CPU Hotplug on physical systems - CPUs not present at boot +---------------------------------------------------------- + +Physical systems need to mark a CPU that is ``possible`` but not ``present`` as +being ``present``. An example would be a dual socket machine, where the package +in one of the sockets can be replaced while the system is running. + +This is not supported. + +In the arm64 world CPUs are not a single device but a slice of the system. +There are no systems that support the physical addition (or removal) of CPUs +while the system is running, and ACPI is not able to sufficiently describe +them. + +e.g. New CPUs come with new caches, but the platform's cache topology is +described in a static table, the PPTT. How caches are shared between CPUs is +not discoverable, and must be described by firmware. + +e.g. The GIC redistributor for each CPU must be accessed by the driver during +boot to discover the system wide supported features. ACPI's MADT GICC +structures can describe a redistributor associated with a disabled CPU, but +can't describe whether the redistributor is accessible, only that it is not +'always on'. + +arm64's ACPI tables assume that everything described is ``present``. + + +CPU Hotplug on virtual systems - CPUs not enabled at boot +--------------------------------------------------------- + +Virtual systems have the advantage that all the properties the system will +ever have can be described at boot. There are no power-domain considerations +as such devices are emulated. + +CPU Hotplug on virtual systems is supported. It is distinct from physical +CPU Hotplug as all resources are described as ``present``, but CPUs may be +marked as disabled by firmware. Only the CPU's online/offline behaviour is +influenced by firmware. An example is where a virtual machine boots with a +single CPU, and additional CPUs are added once a cloud orchestrator deploys +the workload. + +For a virtual machine, the VMM (e.g. Qemu) plays the part of firmware. + +Virtual hotplug is implemented as a firmware policy affecting which CPUs can be +brought online. Firmware can enforce its policy via PSCI's return codes. e.g. +``DENIED``. + +The ACPI tables must describe all the resources of the virtual machine. CPUs +that firmware wishes to disable either from boot (or later) should not be +``enabled`` in the MADT GICC structures, but should have the ``online capable`` +bit set, to indicate they can be enabled later. The boot CPU must be marked as +``enabled``. The 'always on' GICR structure must be used to describe the +redistributors. + +CPUs described as ``online capable`` but not ``enabled`` can be set to enabled +by the DSDT's Processor object's _STA method. On virtual systems the _STA method +must always report the CPU as ``present``. Changes to the firmware policy can +be notified to the OS via device-check or eject-request. + +CPUs described as ``enabled`` in the static table, should not have their _STA +modified dynamically by firmware. Soft-restart features such as kexec will +re-read the static properties of the system from these static tables, and +may malfunction if these no longer describe the running system. Linux will +re-discover the dynamic properties of the system from the _STA method later +during boot. diff --git a/Documentation/arch/arm64/elf_hwcaps.rst b/Documentation/arch/arm64/elf_hwcaps.rst index ced7b335e2e0..69d7afe56853 100644 --- a/Documentation/arch/arm64/elf_hwcaps.rst +++ b/Documentation/arch/arm64/elf_hwcaps.rst @@ -16,9 +16,9 @@ architected discovery mechanism available to userspace code at EL0. The kernel exposes the presence of these features to userspace through a set of flags called hwcaps, exposed in the auxiliary vector. -Userspace software can test for features by acquiring the AT_HWCAP or -AT_HWCAP2 entry of the auxiliary vector, and testing whether the relevant -flags are set, e.g.:: +Userspace software can test for features by acquiring the AT_HWCAP, +AT_HWCAP2 or AT_HWCAP3 entry of the auxiliary vector, and testing +whether the relevant flags are set, e.g.:: bool floating_point_is_present(void) { @@ -170,26 +170,86 @@ HWCAP_PACG ID_AA64ISAR1_EL1.GPI == 0b0001, as described by Documentation/arch/arm64/pointer-authentication.rst. +HWCAP_GCS + Functionality implied by ID_AA64PFR1_EL1.GCS == 0b1, as + described by Documentation/arch/arm64/gcs.rst. + +HWCAP_CMPBR + Functionality implied by ID_AA64ISAR2_EL1.CSSC == 0b0010. + +HWCAP_FPRCVT + Functionality implied by ID_AA64ISAR3_EL1.FPRCVT == 0b0001. + +HWCAP_F8MM8 + Functionality implied by ID_AA64FPFR0_EL1.F8MM8 == 0b0001. + +HWCAP_F8MM4 + Functionality implied by ID_AA64FPFR0_EL1.F8MM4 == 0b0001. + +HWCAP_SVE_F16MM + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.F16MM == 0b0001. + +HWCAP_SVE_ELTPERM + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.ELTPERM == 0b0001. + +HWCAP_SVE_AES2 + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.AES == 0b0011. + +HWCAP_SVE_BFSCALE + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.B16B16 == 0b0010. + +HWCAP_SVE2P2 + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SVEver == 0b0011. + +HWCAP_SME2P2 + Functionality implied by ID_AA64SMFR0_EL1.SMEver == 0b0011. + +HWCAP_SME_SBITPERM + Functionality implied by ID_AA64SMFR0_EL1.SBitPerm == 0b1. + +HWCAP_SME_AES + Functionality implied by ID_AA64SMFR0_EL1.AES == 0b1. + +HWCAP_SME_SFEXPA + Functionality implied by ID_AA64SMFR0_EL1.SFEXPA == 0b1. + +HWCAP_SME_STMOP + Functionality implied by ID_AA64SMFR0_EL1.STMOP == 0b1. + +HWCAP_SME_SMOP4 + Functionality implied by ID_AA64SMFR0_EL1.SMOP4 == 0b1. + HWCAP2_DCPODP Functionality implied by ID_AA64ISAR1_EL1.DPB == 0b0010. HWCAP2_SVE2 - Functionality implied by ID_AA64ZFR0_EL1.SVEver == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SVEver == 0b0001. HWCAP2_SVEAES - Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.AES == 0b0001. HWCAP2_SVEPMULL - Functionality implied by ID_AA64ZFR0_EL1.AES == 0b0010. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.AES == 0b0010. HWCAP2_SVEBITPERM - Functionality implied by ID_AA64ZFR0_EL1.BitPerm == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.BitPerm == 0b0001. HWCAP2_SVESHA3 - Functionality implied by ID_AA64ZFR0_EL1.SHA3 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SHA3 == 0b0001. HWCAP2_SVESM4 - Functionality implied by ID_AA64ZFR0_EL1.SM4 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SM4 == 0b0001. HWCAP2_FLAGM2 Functionality implied by ID_AA64ISAR0_EL1.TS == 0b0010. @@ -198,16 +258,20 @@ HWCAP2_FRINT Functionality implied by ID_AA64ISAR1_EL1.FRINTTS == 0b0001. HWCAP2_SVEI8MM - Functionality implied by ID_AA64ZFR0_EL1.I8MM == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.I8MM == 0b0001. HWCAP2_SVEF32MM - Functionality implied by ID_AA64ZFR0_EL1.F32MM == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.F32MM == 0b0001. HWCAP2_SVEF64MM - Functionality implied by ID_AA64ZFR0_EL1.F64MM == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.F64MM == 0b0001. HWCAP2_SVEBF16 - Functionality implied by ID_AA64ZFR0_EL1.BF16 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.BF16 == 0b0001. HWCAP2_I8MM Functionality implied by ID_AA64ISAR1_EL1.I8MM == 0b0001. @@ -273,7 +337,8 @@ HWCAP2_EBF16 Functionality implied by ID_AA64ISAR1_EL1.BF16 == 0b0010. HWCAP2_SVE_EBF16 - Functionality implied by ID_AA64ZFR0_EL1.BF16 == 0b0010. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.BF16 == 0b0010. HWCAP2_CSSC Functionality implied by ID_AA64ISAR2_EL1.CSSC == 0b0001. @@ -282,7 +347,8 @@ HWCAP2_RPRFM Functionality implied by ID_AA64ISAR2_EL1.RPRFM == 0b0001. HWCAP2_SVE2P1 - Functionality implied by ID_AA64ZFR0_EL1.SVEver == 0b0010. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.SVEver == 0b0010. HWCAP2_SME2 Functionality implied by ID_AA64SMFR0_EL1.SMEver == 0b0001. @@ -309,7 +375,8 @@ HWCAP2_HBC Functionality implied by ID_AA64ISAR2_EL1.BC == 0b0001. HWCAP2_SVE_B16B16 - Functionality implied by ID_AA64ZFR0_EL1.B16B16 == 0b0001. + Functionality implied by ID_AA64PFR0_EL1.SVE == 0b0001 and + ID_AA64ZFR0_EL1.B16B16 == 0b0001. HWCAP2_LRCPC3 Functionality implied by ID_AA64ISAR1_EL1.LRCPC == 0b0011. @@ -317,6 +384,57 @@ HWCAP2_LRCPC3 HWCAP2_LSE128 Functionality implied by ID_AA64ISAR0_EL1.Atomic == 0b0011. +HWCAP2_FPMR + Functionality implied by ID_AA64PFR2_EL1.FMR == 0b0001. + +HWCAP2_LUT + Functionality implied by ID_AA64ISAR2_EL1.LUT == 0b0001. + +HWCAP2_FAMINMAX + Functionality implied by ID_AA64ISAR3_EL1.FAMINMAX == 0b0001. + +HWCAP2_F8CVT + Functionality implied by ID_AA64FPFR0_EL1.F8CVT == 0b1. + +HWCAP2_F8FMA + Functionality implied by ID_AA64FPFR0_EL1.F8FMA == 0b1. + +HWCAP2_F8DP4 + Functionality implied by ID_AA64FPFR0_EL1.F8DP4 == 0b1. + +HWCAP2_F8DP2 + Functionality implied by ID_AA64FPFR0_EL1.F8DP2 == 0b1. + +HWCAP2_F8E4M3 + Functionality implied by ID_AA64FPFR0_EL1.F8E4M3 == 0b1. + +HWCAP2_F8E5M2 + Functionality implied by ID_AA64FPFR0_EL1.F8E5M2 == 0b1. + +HWCAP2_SME_LUTV2 + Functionality implied by ID_AA64SMFR0_EL1.LUTv2 == 0b1. + +HWCAP2_SME_F8F16 + Functionality implied by ID_AA64SMFR0_EL1.F8F16 == 0b1. + +HWCAP2_SME_F8F32 + Functionality implied by ID_AA64SMFR0_EL1.F8F32 == 0b1. + +HWCAP2_SME_SF8FMA + Functionality implied by ID_AA64SMFR0_EL1.SF8FMA == 0b1. + +HWCAP2_SME_SF8DP4 + Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1. + +HWCAP2_SME_SF8DP2 + Functionality implied by ID_AA64SMFR0_EL1.SF8DP2 == 0b1. + +HWCAP2_SME_SF8DP4 + Functionality implied by ID_AA64SMFR0_EL1.SF8DP4 == 0b1. + +HWCAP2_POE + Functionality implied by ID_AA64MMFR3_EL1.S1POE == 0b0001. + 4. Unused AT_HWCAP bits ----------------------- diff --git a/Documentation/arch/arm64/gcs.rst b/Documentation/arch/arm64/gcs.rst new file mode 100644 index 000000000000..226c0b008456 --- /dev/null +++ b/Documentation/arch/arm64/gcs.rst @@ -0,0 +1,227 @@ +=============================================== +Guarded Control Stack support for AArch64 Linux +=============================================== + +This document outlines briefly the interface provided to userspace by Linux in +order to support use of the ARM Guarded Control Stack (GCS) feature. + +This is an outline of the most important features and issues only and not +intended to be exhaustive. + + + +1. General +----------- + +* GCS is an architecture feature intended to provide greater protection + against return oriented programming (ROP) attacks and to simplify the + implementation of features that need to collect stack traces such as + profiling. + +* When GCS is enabled a separate guarded control stack is maintained by the + PE which is writeable only through specific GCS operations. This + stores the call stack only, when a procedure call instruction is + performed the current PC is pushed onto the GCS and on RET the + address in the LR is verified against that on the top of the GCS. + +* When active the current GCS pointer is stored in the system register + GCSPR_EL0. This is readable by userspace but can only be updated + via specific GCS instructions. + +* The architecture provides instructions for switching between guarded + control stacks with checks to ensure that the new stack is a valid + target for switching. + +* The functionality of GCS is similar to that provided by the x86 Shadow + Stack feature, due to sharing of userspace interfaces the ABI refers to + shadow stacks rather than GCS. + +* Support for GCS is reported to userspace via HWCAP_GCS in the aux vector + AT_HWCAP entry. + +* GCS is enabled per thread. While there is support for disabling GCS + at runtime this should be done with great care. + +* GCS memory access faults are reported as normal memory access faults. + +* GCS specific errors (those reported with EC 0x2d) will be reported as + SIGSEGV with a si_code of SEGV_CPERR (control protection error). + +* GCS is supported only for AArch64. + +* On systems where GCS is supported GCSPR_EL0 is always readable by EL0 + regardless of the GCS configuration for the thread. + +* The architecture supports enabling GCS without verifying that return values + in LR match those in the GCS, the LR will be ignored. This is not supported + by Linux. + + + +2. Enabling and disabling Guarded Control Stacks +------------------------------------------------- + +* GCS is enabled and disabled for a thread via the PR_SET_SHADOW_STACK_STATUS + prctl(), this takes a single flags argument specifying which GCS features + should be used. + +* When set PR_SHADOW_STACK_ENABLE flag allocates a Guarded Control Stack + and enables GCS for the thread, enabling the functionality controlled by + GCSCRE0_EL1.{nTR, RVCHKEN, PCRSEL}. + +* When set the PR_SHADOW_STACK_PUSH flag enables the functionality controlled + by GCSCRE0_EL1.PUSHMEn, allowing explicit GCS pushes. + +* When set the PR_SHADOW_STACK_WRITE flag enables the functionality controlled + by GCSCRE0_EL1.STREn, allowing explicit stores to the Guarded Control Stack. + +* Any unknown flags will cause PR_SET_SHADOW_STACK_STATUS to return -EINVAL. + +* PR_LOCK_SHADOW_STACK_STATUS is passed a bitmask of features with the same + values as used for PR_SET_SHADOW_STACK_STATUS. Any future changes to the + status of the specified GCS mode bits will be rejected. + +* PR_LOCK_SHADOW_STACK_STATUS allows any bit to be locked, this allows + userspace to prevent changes to any future features. + +* There is no support for a process to remove a lock that has been set for + it. + +* PR_SET_SHADOW_STACK_STATUS and PR_LOCK_SHADOW_STACK_STATUS affect only the + thread that called them, any other running threads will be unaffected. + +* New threads inherit the GCS configuration of the thread that created them. + +* GCS is disabled on exec(). + +* The current GCS configuration for a thread may be read with the + PR_GET_SHADOW_STACK_STATUS prctl(), this returns the same flags that + are passed to PR_SET_SHADOW_STACK_STATUS. + +* If GCS is disabled for a thread after having previously been enabled then + the stack will remain allocated for the lifetime of the thread. At present + any attempt to reenable GCS for the thread will be rejected, this may be + revisited in future. + +* It should be noted that since enabling GCS will result in GCS becoming + active immediately it is not normally possible to return from the function + that invoked the prctl() that enabled GCS. It is expected that the normal + usage will be that GCS is enabled very early in execution of a program. + + + +3. Allocation of Guarded Control Stacks +---------------------------------------- + +* When GCS is enabled for a thread a new Guarded Control Stack will be + allocated for it of half the standard stack size or 2 gigabytes, + whichever is smaller. + +* When a new thread is created by a thread which has GCS enabled then a + new Guarded Control Stack will be allocated for the new thread with + half the size of the standard stack. + +* When a stack is allocated by enabling GCS or during thread creation then + the top 8 bytes of the stack will be initialised to 0 and GCSPR_EL0 will + be set to point to the address of this 0 value, this can be used to + detect the top of the stack. + +* Additional Guarded Control Stacks can be allocated using the + map_shadow_stack() system call. + +* Stacks allocated using map_shadow_stack() can optionally have an end of + stack marker and cap placed at the top of the stack. If the flag + SHADOW_STACK_SET_TOKEN is specified a cap will be placed on the stack, + if SHADOW_STACK_SET_MARKER is not specified the cap will be the top 8 + bytes of the stack and if it is specified then the cap will be the next + 8 bytes. While specifying just SHADOW_STACK_SET_MARKER by itself is + valid since the marker is all bits 0 it has no observable effect. + +* Stacks allocated using map_shadow_stack() must have a size which is a + multiple of 8 bytes larger than 8 bytes and must be 8 bytes aligned. + +* An address can be specified to map_shadow_stack(), if one is provided then + it must be aligned to a page boundary. + +* When a thread is freed the Guarded Control Stack initially allocated for + that thread will be freed. Note carefully that if the stack has been + switched this may not be the stack currently in use by the thread. + + +4. Signal handling +-------------------- + +* A new signal frame record gcs_context encodes the current GCS mode and + pointer for the interrupted context on signal delivery. This will always + be present on systems that support GCS. + +* The record contains a flag field which reports the current GCS configuration + for the interrupted context as PR_GET_SHADOW_STACK_STATUS would. + +* The signal handler is run with the same GCS configuration as the interrupted + context. + +* When GCS is enabled for the interrupted thread a signal handling specific + GCS cap token will be written to the GCS, this is an architectural GCS cap + with the token type (bits 0..11) all clear. The GCSPR_EL0 reported in the + signal frame will point to this cap token. + +* The signal handler will use the same GCS as the interrupted context. + +* When GCS is enabled on signal entry a frame with the address of the signal + return handler will be pushed onto the GCS, allowing return from the signal + handler via RET as normal. This will not be reported in the gcs_context in + the signal frame. + + +5. Signal return +----------------- + +When returning from a signal handler: + +* If there is a gcs_context record in the signal frame then the GCS flags + and GCSPR_EL0 will be restored from that context prior to further + validation. + +* If there is no gcs_context record in the signal frame then the GCS + configuration will be unchanged. + +* If GCS is enabled on return from a signal handler then GCSPR_EL0 must + point to a valid GCS signal cap record, this will be popped from the + GCS prior to signal return. + +* If the GCS configuration is locked when returning from a signal then any + attempt to change the GCS configuration will be treated as an error. This + is true even if GCS was not enabled prior to signal entry. + +* GCS may be disabled via signal return but any attempt to enable GCS via + signal return will be rejected. + + +6. ptrace extensions +--------------------- + +* A new regset NT_ARM_GCS is defined for use with PTRACE_GETREGSET and + PTRACE_SETREGSET. + +* The GCS mode, including enable and disable, may be configured via ptrace. + If GCS is enabled via ptrace no new GCS will be allocated for the thread. + +* Configuration via ptrace ignores locking of GCS mode bits. + + +7. ELF coredump extensions +--------------------------- + +* NT_ARM_GCS notes will be added to each coredump for each thread of the + dumped process. The contents will be equivalent to the data that would + have been read if a PTRACE_GETREGSET of the corresponding type were + executed for each thread when the coredump was generated. + + + +8. /proc extensions +-------------------- + +* Guarded Control Stack pages will include "ss" in their VmFlags in + /proc/<pid>/smaps. diff --git a/Documentation/arch/arm64/index.rst b/Documentation/arch/arm64/index.rst index d08e924204bf..6a012c98bdcd 100644 --- a/Documentation/arch/arm64/index.rst +++ b/Documentation/arch/arm64/index.rst @@ -10,15 +10,19 @@ ARM64 Architecture acpi_object_usage amu arm-acpi + arm-cca asymmetric-32bit booting cpu-feature-registers + cpu-hotplug elf_hwcaps + gcs hugetlbpage kdump legacy_instructions memory memory-tagging-extension + mops perf pointer-authentication ptdump diff --git a/Documentation/arch/arm64/memory.rst b/Documentation/arch/arm64/memory.rst index 55a55f30eed8..678fbb418c3a 100644 --- a/Documentation/arch/arm64/memory.rst +++ b/Documentation/arch/arm64/memory.rst @@ -18,77 +18,10 @@ ARMv8.2 adds optional support for Large Virtual Address space. This is only available when running with a 64KB page size and expands the number of descriptors in the first level of translation. -User addresses have bits 63:48 set to 0 while the kernel addresses have -the same bits set to 1. TTBRx selection is given by bit 63 of the -virtual address. The swapper_pg_dir contains only kernel (global) -mappings while the user pgd contains only user (non-global) mappings. -The swapper_pg_dir address is written to TTBR1 and never written to -TTBR0. - - -AArch64 Linux memory layout with 4KB pages + 4 levels (48-bit):: - - Start End Size Use - ----------------------------------------------------------------------- - 0000000000000000 0000ffffffffffff 256TB user - ffff000000000000 ffff7fffffffffff 128TB kernel logical memory map - [ffff600000000000 ffff7fffffffffff] 32TB [kasan shadow region] - ffff800000000000 ffff80007fffffff 2GB modules - ffff800080000000 fffffbffefffffff 124TB vmalloc - fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) - fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] - fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space - fffffbffff800000 fffffbffffffffff 8MB [guard region] - fffffc0000000000 fffffdffffffffff 2TB vmemmap - fffffe0000000000 ffffffffffffffff 2TB [guard region] - - -AArch64 Linux memory layout with 64KB pages + 3 levels (52-bit with HW support):: - - Start End Size Use - ----------------------------------------------------------------------- - 0000000000000000 000fffffffffffff 4PB user - fff0000000000000 ffff7fffffffffff ~4PB kernel logical memory map - [fffd800000000000 ffff7fffffffffff] 512TB [kasan shadow region] - ffff800000000000 ffff80007fffffff 2GB modules - ffff800080000000 fffffbffefffffff 124TB vmalloc - fffffbfff0000000 fffffbfffdffffff 224MB fixed mappings (top down) - fffffbfffe000000 fffffbfffe7fffff 8MB [guard region] - fffffbfffe800000 fffffbffff7fffff 16MB PCI I/O space - fffffbffff800000 fffffbffffffffff 8MB [guard region] - fffffc0000000000 ffffffdfffffffff ~4TB vmemmap - ffffffe000000000 ffffffffffffffff 128GB [guard region] - - -Translation table lookup with 4KB pages:: - - +--------+--------+--------+--------+--------+--------+--------+--------+ - |63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0| - +--------+--------+--------+--------+--------+--------+--------+--------+ - | | | | | | - | | | | | v - | | | | | [11:0] in-page offset - | | | | +-> [20:12] L3 index - | | | +-----------> [29:21] L2 index - | | +---------------------> [38:30] L1 index - | +-------------------------------> [47:39] L0 index - +-------------------------------------------------> [63] TTBR0/1 - - -Translation table lookup with 64KB pages:: - - +--------+--------+--------+--------+--------+--------+--------+--------+ - |63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0| - +--------+--------+--------+--------+--------+--------+--------+--------+ - | | | | | - | | | | v - | | | | [15:0] in-page offset - | | | +----------> [28:16] L3 index - | | +--------------------------> [41:29] L2 index - | +-------------------------------> [47:42] L1 index (48-bit) - | [51:42] L1 index (52-bit) - +-------------------------------------------------> [63] TTBR0/1 - +TTBRx selection is given by bit 55 of the virtual address. The +swapper_pg_dir contains only kernel (global) mappings while the user pgd +contains only user (non-global) mappings. The swapper_pg_dir address is +written to TTBR1 and never written to TTBR0. When using KVM without the Virtualization Host Extensions, the hypervisor maps kernel pages in EL2 at a fixed (and potentially diff --git a/Documentation/arch/arm64/mops.rst b/Documentation/arch/arm64/mops.rst new file mode 100644 index 000000000000..2ef5b147f8dc --- /dev/null +++ b/Documentation/arch/arm64/mops.rst @@ -0,0 +1,44 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Memory copy/set instructions (MOPS) +=================================== + +A MOPS memory copy/set operation consists of three consecutive CPY* or SET* +instructions: a prologue, main and epilogue (for example: CPYP, CPYM, CPYE). + +A main or epilogue instruction can take a MOPS exception for various reasons, +for example when a task is migrated to a CPU with a different MOPS +implementation, or when the instruction's alignment and size requirements are +not met. The software exception handler is then expected to reset the registers +and restart execution from the prologue instruction. Normally this is handled +by the kernel. + +For more details refer to "D1.3.5.7 Memory Copy and Memory Set exceptions" in +the Arm Architecture Reference Manual DDI 0487K.a (Arm ARM). + +.. _arm64_mops_hyp: + +Hypervisor requirements +----------------------- + +A hypervisor running a Linux guest must handle all MOPS exceptions from the +guest kernel, as Linux may not be able to handle the exception at all times. +For example, a MOPS exception can be taken when the hypervisor migrates a vCPU +to another physical CPU with a different MOPS implementation. + +To do this, the hypervisor must: + + - Set HCRX_EL2.MCE2 to 1 so that the exception is taken to the hypervisor. + + - Have an exception handler that implements the algorithm from the Arm ARM + rules CNTMJ and MWFQH. + + - Set the guest's PSTATE.SS to 0 in the exception handler, to handle a + potential step of the current instruction. + + Note: Clearing PSTATE.SS is needed so that a single step exception is taken + on the next instruction (the prologue instruction). Otherwise prologue + would get silently stepped over and the single step exception taken on the + main instruction. Note that if the guest instruction is not being stepped + then clearing PSTATE.SS has no effect. diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst index 45a7f4932fe0..f074f6219f5c 100644 --- a/Documentation/arch/arm64/silicon-errata.rst +++ b/Documentation/arch/arm64/silicon-errata.rst @@ -35,8 +35,9 @@ can be triggered by Linux). For software workarounds that may adversely impact systems unaffected by the erratum in question, a Kconfig entry is added under "Kernel Features" -> "ARM errata workarounds via the alternatives framework". -These are enabled by default and patched in at runtime when an affected -CPU is detected. For less-intrusive workarounds, a Kconfig option is not +With the exception of workarounds for errata deemed "rare" by Arm, these +are enabled by default and patched in at runtime when an affected CPU is +detected. For less-intrusive workarounds, a Kconfig option is not available and the code is structured (preferably with a comment) in such a way that the erratum will not be hit. @@ -54,6 +55,8 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | Ampere | AmpereOne | AC03_CPU_38 | AMPERE_ERRATUM_AC03_CPU_38 | +----------------+-----------------+-----------------+-----------------------------+ +| Ampere | AmpereOne AC04 | AC04_CPU_10 | AMPERE_ERRATUM_AC03_CPU_38 | ++----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A510 | #2457168 | ARM64_ERRATUM_2457168 | +----------------+-----------------+-----------------+-----------------------------+ @@ -121,24 +124,52 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A76 | #1490853 | N/A | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A76 | #3324349 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A77 | #1491015 | N/A | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A77 | #1508412 | ARM64_ERRATUM_1508412 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A77 | #3324348 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A78 | #3324344 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A78C | #3324346,3324347| ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A710 | #2119858 | ARM64_ERRATUM_2119858 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A710 | #2054223 | ARM64_ERRATUM_2054223 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A710 | #2224489 | ARM64_ERRATUM_2224489 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A710 | #3324338 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A715 | #2645198 | ARM64_ERRATUM_2645198 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A715 | #3456084 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A720 | #3456091 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-A725 | #3456106 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-X1 | #1502854 | N/A | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-X1 | #3324344 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-X1C | #3324346 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-X2 | #2119858 | ARM64_ERRATUM_2119858 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-X2 | #2224489 | ARM64_ERRATUM_2224489 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-X2 | #3324338 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-X3 | #3324335 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-X4 | #3194386 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Cortex-X925 | #3324334 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1188873,1418040| ARM64_ERRATUM_1418040 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1349291 | N/A | @@ -147,15 +178,28 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N1 | #1542419 | ARM64_ERRATUM_1542419 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Neoverse-N1 | #3324349 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N2 | #2139208 | ARM64_ERRATUM_2139208 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N2 | #2067961 | ARM64_ERRATUM_2067961 | +----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-N2 | #2253138 | ARM64_ERRATUM_2253138 | +----------------+-----------------+-----------------+-----------------------------+ +| ARM | Neoverse-N2 | #3324339 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Neoverse-N3 | #3456111 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ | ARM | Neoverse-V1 | #1619801 | N/A | +----------------+-----------------+-----------------+-----------------------------+ -| ARM | MMU-500 | #841119,826419 | N/A | +| ARM | Neoverse-V1 | #3324341 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Neoverse-V2 | #3324336 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | Neoverse-V3 | #3312417 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ +| ARM | MMU-500 | #841119,826419 | ARM_SMMU_MMU_500_CPRE_ERRATA| +| | | #562869,1047329 | | +----------------+-----------------+-----------------+-----------------------------+ | ARM | MMU-600 | #1076982,1209401| N/A | +----------------+-----------------+-----------------+-----------------------------+ @@ -212,8 +256,11 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | Hisilicon | Hip08 SMMU PMCG | #162001800 | N/A | +----------------+-----------------+-----------------+-----------------------------+ -| Hisilicon | Hip08 SMMU PMCG | #162001900 | N/A | -| | Hip09 SMMU PMCG | | | +| Hisilicon | Hip{08,09,09A,10| #162001900 | N/A | +| | ,10C,11} | | | +| | SMMU PMCG | | | ++----------------+-----------------+-----------------+-----------------------------+ +| Hisilicon | Hip09 | #162100801 | HISILICON_ERRATUM_162100801 | +----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ | Qualcomm Tech. | Kryo/Falkor v1 | E1003 | QCOM_FALKOR_ERRATUM_1003 | @@ -250,3 +297,5 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | Microsoft | Azure Cobalt 100| #2253138 | ARM64_ERRATUM_2253138 | +----------------+-----------------+-----------------+-----------------------------+ +| Microsoft | Azure Cobalt 100| #3324339 | ARM64_ERRATUM_3194386 | ++----------------+-----------------+-----------------+-----------------------------+ diff --git a/Documentation/arch/arm64/sme.rst b/Documentation/arch/arm64/sme.rst index 3d0e53ecac4f..b2fa01f85cb5 100644 --- a/Documentation/arch/arm64/sme.rst +++ b/Documentation/arch/arm64/sme.rst @@ -75,7 +75,7 @@ model features for SME is included in Appendix A. 2. Vector lengths ------------------ -SME defines a second vector length similar to the SVE vector length which is +SME defines a second vector length similar to the SVE vector length which controls the size of the streaming mode SVE vectors and the ZA matrix array. The ZA matrix is square with each side having as many bytes as a streaming mode SVE vector. @@ -238,12 +238,12 @@ prctl(PR_SME_SET_VL, unsigned long arg) bits of Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become unspecified, including both streaming and non-streaming SVE state. Calling PR_SME_SET_VL with vl equal to the thread's current vector - length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag, + length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag, does not constitute a change to the vector length for this purpose. * Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared. Calling PR_SME_SET_VL with vl equal to the thread's current vector - length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag, + length, or calling PR_SME_SET_VL with the PR_SME_SET_VL_ONEXEC flag, does not constitute a change to the vector length for this purpose. @@ -346,6 +346,10 @@ The regset data starts with struct user_za_header, containing: * Writes to NT_ARM_ZT will set PSTATE.ZA to 1. +* If any register data is provided along with SME_PT_VL_ONEXEC then the + registers data will be interpreted with the current vector length, not + the vector length configured for use on exec. + 8. ELF coredump extensions --------------------------- @@ -379,9 +383,8 @@ The regset data starts with struct user_za_header, containing: /proc/sys/abi/sme_default_vector_length Writing the text representation of an integer to this file sets the system - default vector length to the specified value, unless the value is greater - than the maximum vector length supported by the system in which case the - default vector length is set to that maximum. + default vector length to the specified value rounded to a supported value + using the same rules as for setting vector length via PR_SME_SET_VL. The result can be determined by reopening the file and reading its contents. diff --git a/Documentation/arch/arm64/sve.rst b/Documentation/arch/arm64/sve.rst index 0d9a426e9f85..28152492c29c 100644 --- a/Documentation/arch/arm64/sve.rst +++ b/Documentation/arch/arm64/sve.rst @@ -117,11 +117,6 @@ the SVE instruction set architecture. * The SVE registers are not used to pass arguments to or receive results from any syscall. -* In practice the affected registers/bits will be preserved or will be replaced - with zeros on return from a syscall, but userspace should not make - assumptions about this. The kernel behaviour may vary on a case-by-case - basis. - * All other SVE state of a thread, including the currently configured vector length, the state of the PR_SVE_VL_INHERIT flag, and the deferred vector length (if any), is preserved across all syscalls, subject to the specific @@ -407,6 +402,10 @@ The regset data starts with struct user_sve_header, containing: streaming mode and any SETREGSET of NT_ARM_SSVE will enter streaming mode if the target was not in streaming mode. +* If any register data is provided along with SVE_PT_VL_ONEXEC then the + registers data will be interpreted with the current vector length, not + the vector length configured for use on exec. + * The effect of writing a partial, incomplete payload is unspecified. @@ -428,9 +427,8 @@ The regset data starts with struct user_sve_header, containing: /proc/sys/abi/sve_default_vector_length Writing the text representation of an integer to this file sets the system - default vector length to the specified value, unless the value is greater - than the maximum vector length supported by the system in which case the - default vector length is set to that maximum. + default vector length to the specified value rounded to a supported value + using the same rules as for setting vector length via PR_SVE_SET_VL. The result can be determined by reopening the file and reading its contents. diff --git a/Documentation/arch/loongarch/irq-chip-model.rst b/Documentation/arch/loongarch/irq-chip-model.rst index 7988f4192363..a7ecce11e445 100644 --- a/Documentation/arch/loongarch/irq-chip-model.rst +++ b/Documentation/arch/loongarch/irq-chip-model.rst @@ -85,6 +85,102 @@ to CPUINTC directly:: | Devices | +---------+ +Virtual Extended IRQ model +========================== + +In this model, IPI (Inter-Processor Interrupt) and CPU Local Timer interrupt +go to CPUINTC directly, CPU UARTS interrupts go to PCH-PIC, while all other +devices interrupts go to PCH-PIC/PCH-MSI and gathered by V-EIOINTC (Virtual +Extended I/O Interrupt Controller), and then go to CPUINTC directly:: + + +-----+ +-------------------+ +-------+ + | IPI |--> | CPUINTC(0-255vcpu)| <-- | Timer | + +-----+ +-------------------+ +-------+ + ^ + | + +-----------+ + | V-EIOINTC | + +-----------+ + ^ ^ + | | + +---------+ +---------+ + | PCH-PIC | | PCH-MSI | + +---------+ +---------+ + ^ ^ ^ + | | | + +--------+ +---------+ +---------+ + | UARTs | | Devices | | Devices | + +--------+ +---------+ +---------+ + + +Description +----------- +V-EIOINTC (Virtual Extended I/O Interrupt Controller) is an extension of +EIOINTC, it only works in VM mode which runs in KVM hypervisor. Interrupts can +be routed to up to four vCPUs via standard EIOINTC, however with V-EIOINTC +interrupts can be routed to up to 256 virtual cpus. + +With standard EIOINTC, interrupt routing setting includes two parts: eight +bits for CPU selection and four bits for CPU IP (Interrupt Pin) selection. +For CPU selection there is four bits for EIOINTC node selection, four bits +for EIOINTC CPU selection. Bitmap method is used for CPU selection and +CPU IP selection, so interrupt can only route to CPU0 - CPU3 and IP0-IP3 in +one EIOINTC node. + +With V-EIOINTC it supports to route more CPUs and CPU IP (Interrupt Pin), +there are two newly added registers with V-EIOINTC. + +EXTIOI_VIRT_FEATURES +-------------------- +This register is read-only register, which indicates supported features with +V-EIOINTC. Feature EXTIOI_HAS_INT_ENCODE and EXTIOI_HAS_CPU_ENCODE is added. + +Feature EXTIOI_HAS_INT_ENCODE is part of standard EIOINTC. If it is 1, it +indicates that CPU Interrupt Pin selection can be normal method rather than +bitmap method, so interrupt can be routed to IP0 - IP15. + +Feature EXTIOI_HAS_CPU_ENCODE is entension of V-EIOINTC. If it is 1, it +indicates that CPU selection can be normal method rather than bitmap method, +so interrupt can be routed to CPU0 - CPU255. + +EXTIOI_VIRT_CONFIG +------------------ +This register is read-write register, for compatibility intterupt routed uses +the default method which is the same with standard EIOINTC. If the bit is set +with 1, it indicated HW to use normal method rather than bitmap method. + +Advanced Extended IRQ model +=========================== + +In this model, IPI (Inter-Processor Interrupt) and CPU Local Timer interrupt go +to CPUINTC directly, CPU UARTS interrupts go to LIOINTC, PCH-MSI interrupts go +to AVECINTC, and then go to CPUINTC directly, while all other devices interrupts +go to PCH-PIC/PCH-LPC and gathered by EIOINTC, and then go to CPUINTC directly:: + + +-----+ +-----------------------+ +-------+ + | IPI | --> | CPUINTC | <-- | Timer | + +-----+ +-----------------------+ +-------+ + ^ ^ ^ + | | | + +---------+ +----------+ +---------+ +-------+ + | EIOINTC | | AVECINTC | | LIOINTC | <-- | UARTs | + +---------+ +----------+ +---------+ +-------+ + ^ ^ + | | + +---------+ +---------+ + | PCH-PIC | | PCH-MSI | + +---------+ +---------+ + ^ ^ ^ + | | | + +---------+ +---------+ +---------+ + | Devices | | PCH-LPC | | Devices | + +---------+ +---------+ +---------+ + ^ + | + +---------+ + | Devices | + +---------+ + ACPI-related definitions ======================== diff --git a/Documentation/arch/m68k/buddha-driver.rst b/Documentation/arch/m68k/buddha-driver.rst index 20e401413991..5d1bc824978b 100644 --- a/Documentation/arch/m68k/buddha-driver.rst +++ b/Documentation/arch/m68k/buddha-driver.rst @@ -173,7 +173,7 @@ When accessing IDE registers with A6=1 (for example $84x), the timing will always be mode 0 8-bit compatible, no matter what you have selected in the speed register: -781ns select, IOR/IOW after 4 clock cycles (=314ns) aktive. +781ns select, IOR/IOW after 4 clock cycles (=314ns) active. All the timings with a very short select-signal (the 355ns fast accesses) depend on the accelerator card used in the diff --git a/Documentation/arch/powerpc/booting.rst b/Documentation/arch/powerpc/booting.rst index 11aa440f98cc..472e97891aef 100644 --- a/Documentation/arch/powerpc/booting.rst +++ b/Documentation/arch/powerpc/booting.rst @@ -93,8 +93,8 @@ given platform based on the content of the device-tree. Thus, you should: a) add your platform support as a _boolean_ option in - arch/powerpc/Kconfig, following the example of PPC_PSERIES, - PPC_PMAC and PPC_MAPLE. The latter is probably a good + arch/powerpc/Kconfig, following the example of PPC_PSERIES + and PPC_PMAC. The latter is probably a good example of a board support to start from. b) create your main platform file as diff --git a/Documentation/arch/powerpc/cpu_families.rst b/Documentation/arch/powerpc/cpu_families.rst index eb7e60649b43..f55433c6b8f3 100644 --- a/Documentation/arch/powerpc/cpu_families.rst +++ b/Documentation/arch/powerpc/cpu_families.rst @@ -128,24 +128,6 @@ IBM BookE - All 32 bit:: +--------------+ - | 401 | - +--------------+ - | - | - v - +--------------+ - | 403 | - +--------------+ - | - | - v - +--------------+ - | 405 | - +--------------+ - | - | - v - +--------------+ | 440 | +--------------+ | diff --git a/Documentation/arch/powerpc/cxl.rst b/Documentation/arch/powerpc/cxl.rst index d2d77057610e..778adda740d2 100644 --- a/Documentation/arch/powerpc/cxl.rst +++ b/Documentation/arch/powerpc/cxl.rst @@ -18,6 +18,7 @@ Introduction both access system memory directly and with the same effective addresses. + **This driver is deprecated and will be removed in a future release.** Hardware overview ================= @@ -453,7 +454,7 @@ Sysfs Class A cxl sysfs class is added under /sys/class/cxl to facilitate enumeration and tuning of the accelerators. Its layout is - described in Documentation/ABI/testing/sysfs-class-cxl + described in Documentation/ABI/obsolete/sysfs-class-cxl Udev rules diff --git a/Documentation/arch/powerpc/dexcr.rst b/Documentation/arch/powerpc/dexcr.rst index 615a631f51fa..ab0724212fcd 100644 --- a/Documentation/arch/powerpc/dexcr.rst +++ b/Documentation/arch/powerpc/dexcr.rst @@ -36,8 +36,145 @@ state for a process. Configuration ============= -The DEXCR is currently unconfigurable. All threads are run with the -NPHIE aspect enabled. +prctl +----- + +A process can control its own userspace DEXCR value using the +``PR_PPC_GET_DEXCR`` and ``PR_PPC_SET_DEXCR`` pair of +:manpage:`prctl(2)` commands. These calls have the form:: + + prctl(PR_PPC_GET_DEXCR, unsigned long which, 0, 0, 0); + prctl(PR_PPC_SET_DEXCR, unsigned long which, unsigned long ctrl, 0, 0); + +The possible 'which' and 'ctrl' values are as follows. Note there is no relation +between the 'which' value and the DEXCR aspect's index. + +.. flat-table:: + :header-rows: 1 + :widths: 2 7 1 + + * - ``prctl()`` which + - Aspect name + - Aspect index + + * - ``PR_PPC_DEXCR_SBHE`` + - Speculative Branch Hint Enable (SBHE) + - 0 + + * - ``PR_PPC_DEXCR_IBRTPD`` + - Indirect Branch Recurrent Target Prediction Disable (IBRTPD) + - 3 + + * - ``PR_PPC_DEXCR_SRAPD`` + - Subroutine Return Address Prediction Disable (SRAPD) + - 4 + + * - ``PR_PPC_DEXCR_NPHIE`` + - Non-Privileged Hash Instruction Enable (NPHIE) + - 5 + +.. flat-table:: + :header-rows: 1 + :widths: 2 8 + + * - ``prctl()`` ctrl + - Meaning + + * - ``PR_PPC_DEXCR_CTRL_EDITABLE`` + - This aspect can be configured with PR_PPC_SET_DEXCR (get only) + + * - ``PR_PPC_DEXCR_CTRL_SET`` + - This aspect is set / set this aspect + + * - ``PR_PPC_DEXCR_CTRL_CLEAR`` + - This aspect is clear / clear this aspect + + * - ``PR_PPC_DEXCR_CTRL_SET_ONEXEC`` + - This aspect will be set after exec / set this aspect after exec + + * - ``PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC`` + - This aspect will be clear after exec / clear this aspect after exec + +Note that + +* which is a plain value, not a bitmask. Aspects must be worked with individually. + +* ctrl is a bitmask. ``PR_PPC_GET_DEXCR`` returns both the current and onexec + configuration. For example, ``PR_PPC_GET_DEXCR`` may return + ``PR_PPC_DEXCR_CTRL_EDITABLE | PR_PPC_DEXCR_CTRL_SET | + PR_PPC_DEXCR_CTRL_CLEAR_ONEXEC``. This would indicate the aspect is currently + set, it will be cleared when you run exec, and you can change this with the + ``PR_PPC_SET_DEXCR`` prctl. + +* The set/clear terminology refers to setting/clearing the bit in the DEXCR. + For example:: + + prctl(PR_PPC_SET_DEXCR, PR_PPC_DEXCR_IBRTPD, PR_PPC_DEXCR_CTRL_SET, 0, 0); + + will set the IBRTPD aspect bit in the DEXCR, causing indirect branch prediction + to be disabled. + +* The status returned by ``PR_PPC_GET_DEXCR`` represents what value the process + would like applied. It does not include any alternative overrides, such as if + the hypervisor is enforcing the aspect be set. To see the true DEXCR state + software should read the appropriate SPRs directly. + +* The aspect state when starting a process is copied from the parent's state on + :manpage:`fork(2)`. The state is reset to a fixed value on + :manpage:`execve(2)`. The PR_PPC_SET_DEXCR prctl() can control both of these + values. + +* The ``*_ONEXEC`` controls do not change the current process's DEXCR. + +Use ``PR_PPC_SET_DEXCR`` with one of ``PR_PPC_DEXCR_CTRL_SET`` or +``PR_PPC_DEXCR_CTRL_CLEAR`` to edit a given aspect. + +Common error codes for both getting and setting the DEXCR are as follows: + +.. flat-table:: + :header-rows: 1 + :widths: 2 8 + + * - Error + - Meaning + + * - ``EINVAL`` + - The DEXCR is not supported by the kernel. + + * - ``ENODEV`` + - The aspect is not recognised by the kernel or not supported by the + hardware. + +``PR_PPC_SET_DEXCR`` may also report the following error codes: + +.. flat-table:: + :header-rows: 1 + :widths: 2 8 + + * - Error + - Meaning + + * - ``EINVAL`` + - The ctrl value contains unrecognised flags. + + * - ``EINVAL`` + - The ctrl value contains mutually conflicting flags (e.g., + ``PR_PPC_DEXCR_CTRL_SET | PR_PPC_DEXCR_CTRL_CLEAR``) + + * - ``EPERM`` + - This aspect cannot be modified with prctl() (check for the + PR_PPC_DEXCR_CTRL_EDITABLE flag with PR_PPC_GET_DEXCR). + + * - ``EPERM`` + - The process does not have sufficient privilege to perform the operation. + For example, clearing NPHIE on exec is a privileged operation (a process + can still clear its own NPHIE aspect without privileges). + +This interface allows a process to control its own DEXCR aspects, and also set +the initial DEXCR value for any children in its process tree (up to the next +child to use an ``*_ONEXEC`` control). This allows fine-grained control over the +default value of the DEXCR, for example allowing containers to run with different +default values. coredump and ptrace diff --git a/Documentation/arch/powerpc/elf_hwcaps.rst b/Documentation/arch/powerpc/elf_hwcaps.rst index 4c896cf077c2..fce7489877b5 100644 --- a/Documentation/arch/powerpc/elf_hwcaps.rst +++ b/Documentation/arch/powerpc/elf_hwcaps.rst @@ -91,6 +91,7 @@ PPC_FEATURE_HAS_MMU PPC_FEATURE_HAS_4xxMAC The processor is 40x or 44x family. + Unused in the kernel since 732b32daef80 ("powerpc: Remove core support for 40x") PPC_FEATURE_UNIFIED_CACHE The processor has a unified L1 cache for instructions and data, as diff --git a/Documentation/arch/powerpc/firmware-assisted-dump.rst b/Documentation/arch/powerpc/firmware-assisted-dump.rst index e363fc48529a..7e37aadd1f77 100644 --- a/Documentation/arch/powerpc/firmware-assisted-dump.rst +++ b/Documentation/arch/powerpc/firmware-assisted-dump.rst @@ -134,12 +134,12 @@ that are run. If there is dump data, then the memory is held. If there is no waiting dump data, then only the memory required to -hold CPU state, HPTE region, boot memory dump, FADump header and -elfcore header, is usually reserved at an offset greater than boot -memory size (see Fig. 1). This area is *not* released: this region -will be kept permanently reserved, so that it can act as a receptacle -for a copy of the boot memory content in addition to CPU state and -HPTE region, in the case a crash does occur. +hold CPU state, HPTE region, boot memory dump, and FADump header is +usually reserved at an offset greater than boot memory size (see Fig. 1). +This area is *not* released: this region will be kept permanently +reserved, so that it can act as a receptacle for a copy of the boot +memory content in addition to CPU state and HPTE region, in the case +a crash does occur. Since this reserved memory area is used only after the system crash, there is no point in blocking this significant chunk of memory from @@ -153,22 +153,22 @@ that were present in CMA region:: o Memory Reservation during first kernel - Low memory Top of memory - 0 boot memory size |<--- Reserved dump area --->| | - | | | Permanent Reservation | | - V V | | V - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | | |///|////| DUMP | HDR | ELF |////| | - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | ^ ^ ^ ^ ^ - | | | | | | - \ CPU HPTE / | | - ------------------------------ | | - Boot memory content gets transferred | | - to reserved area by firmware at the | | - time of crash. | | - FADump Header | - (meta area) | + Low memory Top of memory + 0 boot memory size |<------ Reserved dump area ----->| | + | | | Permanent Reservation | | + V V | | V + +-----------+-----/ /---+---+----+-----------+-------+----+-----+ + | | |///|////| DUMP | HDR |////| | + +-----------+-----/ /---+---+----+-----------+-------+----+-----+ + | ^ ^ ^ ^ ^ + | | | | | | + \ CPU HPTE / | | + -------------------------------- | | + Boot memory content gets transferred | | + to reserved area by firmware at the | | + time of crash. | | + FADump Header | + (meta area) | | | Metadata: This area holds a metadata structure whose @@ -186,13 +186,20 @@ that were present in CMA region:: 0 boot memory size | | |<------------ Crash preserved area ------------>| V V |<--- Reserved dump area --->| | - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | | |///|////| DUMP | HDR | ELF |////| | - +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ - | | - V V - Used by second /proc/vmcore - kernel to boot + +----+---+--+-----/ /---+---+----+-------+-----+-----+-------+ + | |ELF| | |///|////| DUMP | HDR |/////| | + +----+---+--+-----/ /---+---+----+-------+-----+-----+-------+ + | | | | | | + ----- ------------------------------ --------------- + \ | | + \ | | + \ | | + \ | ---------------------------- + \ | / + \ | / + \ | / + /proc/vmcore + +---+ |///| -> Regions (CPU, HPTE & Metadata) marked like this in the above @@ -200,6 +207,12 @@ that were present in CMA region:: does not have CPU & HPTE regions while Metadata region is not supported on pSeries currently. + +---+ + |ELF| -> elfcorehdr, it is created in second kernel after crash. + +---+ + + Note: Memory from 0 to the boot memory size is used by second kernel + Fig. 2 @@ -353,26 +366,6 @@ TODO: - Need to come up with the better approach to find out more accurate boot memory size that is required for a kernel to boot successfully when booted with restricted memory. - - The FADump implementation introduces a FADump crash info structure - in the scratch area before the ELF core header. The idea of introducing - this structure is to pass some important crash info data to the second - kernel which will help second kernel to populate ELF core header with - correct data before it gets exported through /proc/vmcore. The current - design implementation does not address a possibility of introducing - additional fields (in future) to this structure without affecting - compatibility. Need to come up with the better approach to address this. - - The possible approaches are: - - 1. Introduce version field for version tracking, bump up the version - whenever a new field is added to the structure in future. The version - field can be used to find out what fields are valid for the current - version of the structure. - 2. Reserve the area of predefined size (say PAGE_SIZE) for this - structure and have unused area as reserved (initialized to zero) - for future field additions. - - The advantage of approach 1 over 2 is we don't need to reserve extra space. Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> diff --git a/Documentation/arch/powerpc/kvm-nested.rst b/Documentation/arch/powerpc/kvm-nested.rst index 630602a8aa00..5defd13cc6c1 100644 --- a/Documentation/arch/powerpc/kvm-nested.rst +++ b/Documentation/arch/powerpc/kvm-nested.rst @@ -546,7 +546,9 @@ table information. +--------+-------+----+--------+----------------------------------+ | 0x1052 | 0x08 | RW | T | CTRL | +--------+-------+----+--------+----------------------------------+ -| 0x1053-| | | | Reserved | +| 0x1053 | 0x08 | RW | T | DPDES | ++--------+-------+----+--------+----------------------------------+ +| 0x1054-| | | | Reserved | | 0x1FFF | | | | | +--------+-------+----+--------+----------------------------------+ | 0x2000 | 0x04 | RW | T | CR | diff --git a/Documentation/arch/powerpc/ultravisor.rst b/Documentation/arch/powerpc/ultravisor.rst index ba6b1bf1cc44..6d0407b2f5a1 100644 --- a/Documentation/arch/powerpc/ultravisor.rst +++ b/Documentation/arch/powerpc/ultravisor.rst @@ -134,7 +134,7 @@ Hardware * PTCR and partition table entries (partition table is in secure memory). An attempt to write to PTCR will cause a Hypervisor - Emulation Assitance interrupt. + Emulation Assistance interrupt. * LDBAR (LD Base Address Register) and IMC (In-Memory Collection) non-architected registers. An attempt to write to them will cause a diff --git a/Documentation/arch/riscv/cmodx.rst b/Documentation/arch/riscv/cmodx.rst new file mode 100644 index 000000000000..8c48bcff3df9 --- /dev/null +++ b/Documentation/arch/riscv/cmodx.rst @@ -0,0 +1,98 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================================================== +Concurrent Modification and Execution of Instructions (CMODX) for RISC-V Linux +============================================================================== + +CMODX is a programming technique where a program executes instructions that were +modified by the program itself. Instruction storage and the instruction cache +(icache) are not guaranteed to be synchronized on RISC-V hardware. Therefore, the +program must enforce its own synchronization with the unprivileged fence.i +instruction. + +However, the default Linux ABI prohibits the use of fence.i in userspace +applications. At any point the scheduler may migrate a task onto a new hart. If +migration occurs after the userspace synchronized the icache and instruction +storage with fence.i, the icache on the new hart will no longer be clean. This +is due to the behavior of fence.i only affecting the hart that it is called on. +Thus, the hart that the task has been migrated to may not have synchronized +instruction storage and icache. + +There are two ways to solve this problem: use the riscv_flush_icache() syscall, +or use the ``PR_RISCV_SET_ICACHE_FLUSH_CTX`` prctl() and emit fence.i in +userspace. The syscall performs a one-off icache flushing operation. The prctl +changes the Linux ABI to allow userspace to emit icache flushing operations. + +As an aside, "deferred" icache flushes can sometimes be triggered in the kernel. +At the time of writing, this only occurs during the riscv_flush_icache() syscall +and when the kernel uses copy_to_user_page(). These deferred flushes happen only +when the memory map being used by a hart changes. If the prctl() context caused +an icache flush, this deferred icache flush will be skipped as it is redundant. +Therefore, there will be no additional flush when using the riscv_flush_icache() +syscall inside of the prctl() context. + +prctl() Interface +--------------------- + +Call prctl() with ``PR_RISCV_SET_ICACHE_FLUSH_CTX`` as the first argument. The +remaining arguments will be delegated to the riscv_set_icache_flush_ctx +function detailed below. + +.. kernel-doc:: arch/riscv/mm/cacheflush.c + :identifiers: riscv_set_icache_flush_ctx + +Example usage: + +The following files are meant to be compiled and linked with each other. The +modify_instruction() function replaces an add with 0 with an add with one, +causing the instruction sequence in get_value() to change from returning a zero +to returning a one. + +cmodx.c:: + + #include <stdio.h> + #include <sys/prctl.h> + + extern int get_value(); + extern void modify_instruction(); + + int main() + { + int value = get_value(); + printf("Value before cmodx: %d\n", value); + + // Call prctl before first fence.i is called inside modify_instruction + prctl(PR_RISCV_SET_ICACHE_FLUSH_CTX, PR_RISCV_CTX_SW_FENCEI_ON, PR_RISCV_SCOPE_PER_PROCESS); + modify_instruction(); + // Call prctl after final fence.i is called in process + prctl(PR_RISCV_SET_ICACHE_FLUSH_CTX, PR_RISCV_CTX_SW_FENCEI_OFF, PR_RISCV_SCOPE_PER_PROCESS); + + value = get_value(); + printf("Value after cmodx: %d\n", value); + return 0; + } + +cmodx.S:: + + .option norvc + + .text + .global modify_instruction + modify_instruction: + lw a0, new_insn + lui a5,%hi(old_insn) + sw a0,%lo(old_insn)(a5) + fence.i + ret + + .section modifiable, "awx" + .global get_value + get_value: + li a0, 0 + old_insn: + addi a0, a0, 0 + ret + + .data + new_insn: + addi a0, a0, 1 diff --git a/Documentation/arch/riscv/hwprobe.rst b/Documentation/arch/riscv/hwprobe.rst index b2bcc9eed9aa..f273ea15a8e8 100644 --- a/Documentation/arch/riscv/hwprobe.rst +++ b/Documentation/arch/riscv/hwprobe.rst @@ -188,25 +188,118 @@ The following keys are defined: manual starting from commit 95cf1f9 ("Add changes requested by Ved during signoff") -* :c:macro:`RISCV_HWPROBE_KEY_CPUPERF_0`: A bitmask that contains performance - information about the selected set of processors. + * :c:macro:`RISCV_HWPROBE_EXT_ZIHINTPAUSE`: The Zihintpause extension is + supported as defined in the RISC-V ISA manual starting from commit + d8ab5c78c207 ("Zihintpause is ratified"). - * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNKNOWN`: The performance of misaligned - accesses is unknown. + * :c:macro:`RISCV_HWPROBE_EXT_ZVE32X`: The Vector sub-extension Zve32x is + supported, as defined by version 1.0 of the RISC-V Vector extension manual. - * :c:macro:`RISCV_HWPROBE_MISALIGNED_EMULATED`: Misaligned accesses are - emulated via software, either in or below the kernel. These accesses are - always extremely slow. + * :c:macro:`RISCV_HWPROBE_EXT_ZVE32F`: The Vector sub-extension Zve32f is + supported, as defined by version 1.0 of the RISC-V Vector extension manual. - * :c:macro:`RISCV_HWPROBE_MISALIGNED_SLOW`: Misaligned accesses are slower - than equivalent byte accesses. Misaligned accesses may be supported - directly in hardware, or trapped and emulated by software. + * :c:macro:`RISCV_HWPROBE_EXT_ZVE64X`: The Vector sub-extension Zve64x is + supported, as defined by version 1.0 of the RISC-V Vector extension manual. - * :c:macro:`RISCV_HWPROBE_MISALIGNED_FAST`: Misaligned accesses are faster - than equivalent byte accesses. + * :c:macro:`RISCV_HWPROBE_EXT_ZVE64F`: The Vector sub-extension Zve64f is + supported, as defined by version 1.0 of the RISC-V Vector extension manual. - * :c:macro:`RISCV_HWPROBE_MISALIGNED_UNSUPPORTED`: Misaligned accesses are - not supported at all and will generate a misaligned address fault. + * :c:macro:`RISCV_HWPROBE_EXT_ZVE64D`: The Vector sub-extension Zve64d is + supported, as defined by version 1.0 of the RISC-V Vector extension manual. + + * :c:macro:`RISCV_HWPROBE_EXT_ZIMOP`: The Zimop May-Be-Operations extension is + supported as defined in the RISC-V ISA manual starting from commit + 58220614a5f ("Zimop is ratified/1.0"). + + * :c:macro:`RISCV_HWPROBE_EXT_ZCA`: The Zca extension part of Zc* standard + extensions for code size reduction, as ratified in commit 8be3419c1c0 + ("Zcf doesn't exist on RV64 as it contains no instructions") of + riscv-code-size-reduction. + + * :c:macro:`RISCV_HWPROBE_EXT_ZCB`: The Zcb extension part of Zc* standard + extensions for code size reduction, as ratified in commit 8be3419c1c0 + ("Zcf doesn't exist on RV64 as it contains no instructions") of + riscv-code-size-reduction. + + * :c:macro:`RISCV_HWPROBE_EXT_ZCD`: The Zcd extension part of Zc* standard + extensions for code size reduction, as ratified in commit 8be3419c1c0 + ("Zcf doesn't exist on RV64 as it contains no instructions") of + riscv-code-size-reduction. + + * :c:macro:`RISCV_HWPROBE_EXT_ZCF`: The Zcf extension part of Zc* standard + extensions for code size reduction, as ratified in commit 8be3419c1c0 + ("Zcf doesn't exist on RV64 as it contains no instructions") of + riscv-code-size-reduction. + + * :c:macro:`RISCV_HWPROBE_EXT_ZCMOP`: The Zcmop May-Be-Operations extension is + supported as defined in the RISC-V ISA manual starting from commit + c732a4f39a4 ("Zcmop is ratified/1.0"). + + * :c:macro:`RISCV_HWPROBE_EXT_ZAWRS`: The Zawrs extension is supported as + ratified in commit 98918c844281 ("Merge pull request #1217 from + riscv/zawrs") of riscv-isa-manual. + + * :c:macro:`RISCV_HWPROBE_EXT_SUPM`: The Supm extension is supported as + defined in version 1.0 of the RISC-V Pointer Masking extensions. + +* :c:macro:`RISCV_HWPROBE_KEY_CPUPERF_0`: Deprecated. Returns similar values to + :c:macro:`RISCV_HWPROBE_KEY_MISALIGNED_SCALAR_PERF`, but the key was + mistakenly classified as a bitmask rather than a value. + +* :c:macro:`RISCV_HWPROBE_KEY_MISALIGNED_SCALAR_PERF`: An enum value describing + the performance of misaligned scalar native word accesses on the selected set + of processors. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_SCALAR_UNKNOWN`: The performance of + misaligned scalar accesses is unknown. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_SCALAR_EMULATED`: Misaligned scalar + accesses are emulated via software, either in or below the kernel. These + accesses are always extremely slow. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_SCALAR_SLOW`: Misaligned scalar native + word sized accesses are slower than the equivalent quantity of byte + accesses. Misaligned accesses may be supported directly in hardware, or + trapped and emulated by software. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_SCALAR_FAST`: Misaligned scalar native + word sized accesses are faster than the equivalent quantity of byte + accesses. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_SCALAR_UNSUPPORTED`: Misaligned scalar + accesses are not supported at all and will generate a misaligned address + fault. * :c:macro:`RISCV_HWPROBE_KEY_ZICBOZ_BLOCK_SIZE`: An unsigned int which represents the size of the Zicboz block in bytes. + +* :c:macro:`RISCV_HWPROBE_KEY_HIGHEST_VIRT_ADDRESS`: An unsigned long which + represent the highest userspace virtual address usable. + +* :c:macro:`RISCV_HWPROBE_KEY_TIME_CSR_FREQ`: Frequency (in Hz) of `time CSR`. + +* :c:macro:`RISCV_HWPROBE_KEY_MISALIGNED_VECTOR_PERF`: An enum value describing the + performance of misaligned vector accesses on the selected set of processors. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_VECTOR_UNKNOWN`: The performance of misaligned + vector accesses is unknown. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_VECTOR_SLOW`: 32-bit misaligned accesses using vector + registers are slower than the equivalent quantity of byte accesses via vector registers. + Misaligned accesses may be supported directly in hardware, or trapped and emulated by software. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_VECTOR_FAST`: 32-bit misaligned accesses using vector + registers are faster than the equivalent quantity of byte accesses via vector registers. + + * :c:macro:`RISCV_HWPROBE_MISALIGNED_VECTOR_UNSUPPORTED`: Misaligned vector accesses are + not supported at all and will generate a misaligned address fault. + +* :c:macro:`RISCV_HWPROBE_KEY_VENDOR_EXT_THEAD_0`: A bitmask containing the + thead vendor extensions that are compatible with the + :c:macro:`RISCV_HWPROBE_BASE_BEHAVIOR_IMA`: base system behavior. + + * T-HEAD + + * :c:macro:`RISCV_HWPROBE_VENDOR_EXT_XTHEADVECTOR`: The xtheadvector vendor + extension is supported in the T-Head ISA extensions spec starting from + commit a18c801634 ("Add T-Head VECTOR vendor extension. "). diff --git a/Documentation/arch/riscv/index.rst b/Documentation/arch/riscv/index.rst index 4dab0cb4b900..eecf347ce849 100644 --- a/Documentation/arch/riscv/index.rst +++ b/Documentation/arch/riscv/index.rst @@ -13,6 +13,7 @@ RISC-V architecture patch-acceptance uabi vector + cmodx features diff --git a/Documentation/arch/riscv/uabi.rst b/Documentation/arch/riscv/uabi.rst index 54d199dce78b..243e40062e34 100644 --- a/Documentation/arch/riscv/uabi.rst +++ b/Documentation/arch/riscv/uabi.rst @@ -65,4 +65,22 @@ the extension, or may have deliberately removed it from the listing. Misaligned accesses ------------------- -Misaligned accesses are supported in userspace, but they may perform poorly. +Misaligned scalar accesses are supported in userspace, but they may perform +poorly. Misaligned vector accesses are only supported if the Zicclsm extension +is supported. + +Pointer masking +--------------- + +Support for pointer masking in userspace (the Supm extension) is provided via +the ``PR_SET_TAGGED_ADDR_CTRL`` and ``PR_GET_TAGGED_ADDR_CTRL`` ``prctl()`` +operations. Pointer masking is disabled by default. To enable it, userspace +must call ``PR_SET_TAGGED_ADDR_CTRL`` with the ``PR_PMLEN`` field set to the +number of mask/tag bits needed by the application. ``PR_PMLEN`` is interpreted +as a lower bound; if the kernel is unable to satisfy the request, the +``PR_SET_TAGGED_ADDR_CTRL`` operation will fail. The actual number of tag bits +is returned in ``PR_PMLEN`` by the ``PR_GET_TAGGED_ADDR_CTRL`` operation. + +Additionally, when pointer masking is enabled (``PR_PMLEN`` is greater than 0), +a tagged address ABI is supported, with the same interface and behavior as +documented for AArch64 (Documentation/arch/arm64/tagged-address-abi.rst). diff --git a/Documentation/arch/riscv/vector.rst b/Documentation/arch/riscv/vector.rst index 75dd88a62e1d..3987f5f76a9d 100644 --- a/Documentation/arch/riscv/vector.rst +++ b/Documentation/arch/riscv/vector.rst @@ -15,7 +15,7 @@ status for the use of Vector in userspace. The intended usage guideline for these interfaces is to give init systems a way to modify the availability of V for processes running under its domain. Calling these interfaces is not recommended in libraries routines because libraries should not override policies -configured from the parant process. Also, users must noted that these interfaces +configured from the parent process. Also, users must note that these interfaces are not portable to non-Linux, nor non-RISC-V environments, so it is discourage to use in a portable code. To get the availability of V in an ELF program, please read :c:macro:`COMPAT_HWCAP_ISA_V` bit of :c:macro:`ELF_HWCAP` in the diff --git a/Documentation/arch/riscv/vm-layout.rst b/Documentation/arch/riscv/vm-layout.rst index 69ff6da1dbf8..eabec99b5852 100644 --- a/Documentation/arch/riscv/vm-layout.rst +++ b/Documentation/arch/riscv/vm-layout.rst @@ -47,11 +47,12 @@ RISC-V Linux Kernel SV39 | Kernel-space virtual memory, shared between all processes: ____________________________________________________________|___________________________________________________________ | | | | - ffffffc6fea00000 | -228 GB | ffffffc6feffffff | 6 MB | fixmap - ffffffc6ff000000 | -228 GB | ffffffc6ffffffff | 16 MB | PCI io - ffffffc700000000 | -228 GB | ffffffc7ffffffff | 4 GB | vmemmap - ffffffc800000000 | -224 GB | ffffffd7ffffffff | 64 GB | vmalloc/ioremap space - ffffffd800000000 | -160 GB | fffffff6ffffffff | 124 GB | direct mapping of all physical memory + ffffffc4fea00000 | -236 GB | ffffffc4feffffff | 6 MB | fixmap + ffffffc4ff000000 | -236 GB | ffffffc4ffffffff | 16 MB | PCI io + ffffffc500000000 | -236 GB | ffffffc5ffffffff | 4 GB | vmemmap + ffffffc600000000 | -232 GB | ffffffd5ffffffff | 64 GB | vmalloc/ioremap space + ffffffd600000000 | -168 GB | fffffff5ffffffff | 128 GB | direct mapping of all physical memory + | | | | fffffff700000000 | -36 GB | fffffffeffffffff | 32 GB | kasan __________________|____________|__________________|_________|____________________________________________________________ | @@ -133,25 +134,3 @@ RISC-V Linux Kernel SV57 ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | modules, BPF ffffffff80000000 | -2 GB | ffffffffffffffff | 2 GB | kernel __________________|____________|__________________|_________|____________________________________________________________ - - -Userspace VAs --------------------- -To maintain compatibility with software that relies on the VA space with a -maximum of 48 bits the kernel will, by default, return virtual addresses to -userspace from a 48-bit range (sv48). This default behavior is achieved by -passing 0 into the hint address parameter of mmap. On CPUs with an address space -smaller than sv48, the CPU maximum supported address space will be the default. - -Software can "opt-in" to receiving VAs from another VA space by providing -a hint address to mmap. A hint address passed to mmap will cause the largest -address space that fits entirely into the hint to be used, unless there is no -space left in the address space. If there is no space available in the requested -address space, an address in the next smallest available address space will be -returned. - -For example, in order to obtain 48-bit VA space, a hint address greater than -:code:`1 << 47` must be provided. Note that this is 47 due to sv48 userspace -ending at :code:`1 << 47` and the addresses beyond this are reserved for the -kernel. Similarly, to obtain 57-bit VA space addresses, a hint address greater -than or equal to :code:`1 << 56` must be provided. diff --git a/Documentation/arch/s390/index.rst b/Documentation/arch/s390/index.rst index 73c79bf586fd..e75a6e5d2505 100644 --- a/Documentation/arch/s390/index.rst +++ b/Documentation/arch/s390/index.rst @@ -8,6 +8,7 @@ s390 Architecture cds 3270 driver-model + mm monreader qeth s390dbf diff --git a/Documentation/arch/s390/mm.rst b/Documentation/arch/s390/mm.rst new file mode 100644 index 000000000000..084adad5eef9 --- /dev/null +++ b/Documentation/arch/s390/mm.rst @@ -0,0 +1,111 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Memory Management +================= + +Virtual memory layout +===================== + +.. note:: + + - Some aspects of the virtual memory layout setup are not + clarified (number of page levels, alignment, DMA memory). + + - Unused gaps in the virtual memory layout could be present + or not - depending on how partucular system is configured. + No page tables are created for the unused gaps. + + - The virtual memory regions are tracked or untracked by KASAN + instrumentation, as well as the KASAN shadow memory itself is + created only when CONFIG_KASAN configuration option is enabled. + +:: + + ============================================================================= + | Physical | Virtual | VM area description + ============================================================================= + +- 0 --------------+- 0 --------------+ + | | S390_lowcore | Low-address memory + | +- 8 KB -----------+ + | | | + | | | + | | ... unused gap | KASAN untracked + | | | + +- AMODE31_START --+- AMODE31_START --+ .amode31 rand. phys/virt start + |.amode31 text/data|.amode31 text/data| KASAN untracked + +- AMODE31_END ----+- AMODE31_END ----+ .amode31 rand. phys/virt end (<2GB) + | | | + | | | + +- __kaslr_offset_phys | kernel rand. phys start + | | | + | kernel text/data | | + | | | + +------------------+ | kernel phys end + | | | + | | | + | | | + | | | + +- ident_map_size -+ | + | | + | ... unused gap | KASAN untracked + | | + +- __identity_base + identity mapping start (>= 2GB) + | | + | identity | phys == virt - __identity_base + | mapping | virt == phys + __identity_base + | | + | | KASAN tracked + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + | | + +---- vmemmap -----+ 'struct page' array start + | | + | virtually mapped | + | memory map | KASAN untracked + | | + +- __abs_lowcore --+ + | | + | Absolute Lowcore | KASAN untracked + | | + +- __memcpy_real_area + | | + | Real Memory Copy| KASAN untracked + | | + +- VMALLOC_START --+ vmalloc area start + | | KASAN untracked or + | vmalloc area | KASAN shallowly populated in case + | | CONFIG_KASAN_VMALLOC=y + +- MODULES_VADDR --+ modules area start + | | KASAN allocated per module or + | modules area | KASAN shallowly populated in case + | | CONFIG_KASAN_VMALLOC=y + +- __kaslr_offset -+ kernel rand. virt start + | | KASAN tracked + | kernel text/data | phys == (kvirt - __kaslr_offset) + + | | __kaslr_offset_phys + +- kernel .bss end + kernel rand. virt end + | | + | ... unused gap | KASAN untracked + | | + +------------------+ UltraVisor Secure Storage limit + | | + | ... unused gap | KASAN untracked + | | + +KASAN_SHADOW_START+ KASAN shadow memory start + | | + | KASAN shadow | KASAN untracked + | | + +------------------+ ASCE limit diff --git a/Documentation/arch/s390/vfio-ap.rst b/Documentation/arch/s390/vfio-ap.rst index 929ee1c1c940..eba1991fbdba 100644 --- a/Documentation/arch/s390/vfio-ap.rst +++ b/Documentation/arch/s390/vfio-ap.rst @@ -380,6 +380,36 @@ matrix device. control_domains: A read-only file for displaying the control domain numbers assigned to the vfio_ap mediated device. + ap_config: + A read/write file that, when written to, allows all three of the + vfio_ap mediated device's ap matrix masks to be replaced in one shot. + Three masks are given, one for adapters, one for domains, and one for + control domains. If the given state cannot be set then no changes are + made to the vfio-ap mediated device. + + The format of the data written to ap_config is as follows: + {amask},{dmask},{cmask}\n + + \n is a newline character. + + amask, dmask, and cmask are masks identifying which adapters, domains, + and control domains should be assigned to the mediated device. + + The format of a mask is as follows: + 0xNN..NN + + Where NN..NN is 64 hexadecimal characters representing a 256-bit value. + The leftmost (highest order) bit represents adapter/domain 0. + + For an example set of masks that represent your mdev's current + configuration, simply cat ap_config. + + Setting an adapter or domain number greater than the maximum allowed for + the system will result in an error. + + This attribute is intended to be used by automation. End users would be + better served using the respective assign/unassign attributes for + adapters, domains, and control domains. * functions: @@ -550,7 +580,7 @@ These are the steps: following Kconfig elements selected: * IOMMU_SUPPORT * S390 - * ZCRYPT + * AP * VFIO * KVM @@ -969,6 +999,36 @@ the vfio_ap mediated device to which it is assigned as long as each new APQN resulting from plugging it in references a queue device bound to the vfio_ap device driver. +Driver Features +=============== +The vfio_ap driver exposes a sysfs file containing supported features. +This exists so third party tools (like Libvirt and mdevctl) can query the +availability of specific features. + +The features list can be found here: /sys/bus/matrix/devices/matrix/features + +Entries are space delimited. Each entry consists of a combination of +alphanumeric and underscore characters. + +Example: +cat /sys/bus/matrix/devices/matrix/features +guest_matrix dyn ap_config + +the following features are advertised: + +---------------+---------------------------------------------------------------+ +| Flag | Description | ++==============+===============================================================+ +| guest_matrix | guest_matrix attribute exists. It reports the matrix of | +| | adapters and domains that are or will be passed through to a | +| | guest when the mdev is attached to it. | ++--------------+---------------------------------------------------------------+ +| dyn | Indicates hot plug/unplug of AP adapters, domains and control | +| | domains for a guest to which the mdev is attached. | ++------------+-----------------------------------------------------------------+ +| ap_config | ap_config interface for one-shot modifications to mdev config | ++--------------+---------------------------------------------------------------+ + Limitations =========== Live guest migration is not supported for guests using AP devices without diff --git a/Documentation/arch/sparc/oradax/dax-hv-api.txt b/Documentation/arch/sparc/oradax/dax-hv-api.txt index 7ecd0bf4957b..ef1a4c2bf08b 100644 --- a/Documentation/arch/sparc/oradax/dax-hv-api.txt +++ b/Documentation/arch/sparc/oradax/dax-hv-api.txt @@ -41,7 +41,7 @@ Chapter 36. Coprocessor services submissions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would not be a guarantee that a future submission would succeed. - The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual + The availability of DAX coprocessor command service is indicated by the presence of the DAX virtual device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device node”). diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst index 07caa8fff852..bd840df708ea 100644 --- a/Documentation/arch/x86/amd-memory-encryption.rst +++ b/Documentation/arch/x86/amd-memory-encryption.rst @@ -87,14 +87,14 @@ The state of SME in the Linux kernel can be documented as follows: kernel is non-zero). SME can also be enabled and activated in the BIOS. If SME is enabled and -activated in the BIOS, then all memory accesses will be encrypted and it will -not be necessary to activate the Linux memory encryption support. If the BIOS -merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate -memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or -by supplying mem_encrypt=on on the kernel command line. However, if BIOS does -not enable SME, then Linux will not be able to activate memory encryption, even -if configured to do so by default or the mem_encrypt=on command line parameter -is specified. +activated in the BIOS, then all memory accesses will be encrypted and it +will not be necessary to activate the Linux memory encryption support. + +If the BIOS merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), +then memory encryption can be enabled by supplying mem_encrypt=on on the +kernel command line. However, if BIOS does not enable SME, then Linux +will not be able to activate memory encryption, even if configured to do +so by default or the mem_encrypt=on command line parameter is specified. Secure Nested Paging (SNP) ========================== @@ -130,4 +130,149 @@ SNP feature support. More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR -[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf +Reverse Map Table (RMP) +======================= + +The RMP is a structure in system memory that is used to ensure a one-to-one +mapping between system physical addresses and guest physical addresses. Each +page of memory that is potentially assignable to guests has one entry within +the RMP. + +The RMP table can be either contiguous in memory or a collection of segments +in memory. + +Contiguous RMP +-------------- + +Support for this form of the RMP is present when support for SEV-SNP is +present, which can be determined using the CPUID instruction:: + + 0x8000001f[eax]: + Bit[4] indicates support for SEV-SNP + +The location of the RMP is identified to the hardware through two MSRs:: + + 0xc0010132 (RMP_BASE): + System physical address of the first byte of the RMP + + 0xc0010133 (RMP_END): + System physical address of the last byte of the RMP + +Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV +firmware increases the alignment requirement to require a 1MB alignment. + +The RMP consists of a 16KB region used for processor bookkeeping followed +by the RMP entries, which are 16 bytes in size. The size of the RMP +determines the range of physical memory that the hypervisor can assign to +SEV-SNP guests. The RMP covers the system physical address from:: + + 0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB. + +The current Linux support relies on BIOS to allocate/reserve the memory for +the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR +values to locate the RMP and determine the size of the RMP. The RMP must +cover all of system memory in order for Linux to enable SEV-SNP. + +Segmented RMP +------------- + +Segmented RMP support is a new way of representing the layout of an RMP. +Initial RMP support required the RMP table to be contiguous in memory. +RMP accesses from a NUMA node on which the RMP doesn't reside +can take longer than accesses from a NUMA node on which the RMP resides. +Segmented RMP support allows the RMP entries to be located on the same +node as the memory the RMP is covering, potentially reducing latency +associated with accessing an RMP entry associated with the memory. Each +RMP segment covers a specific range of system physical addresses. + +Support for this form of the RMP can be determined using the CPUID +instruction:: + + 0x8000001f[eax]: + Bit[23] indicates support for segmented RMP + +If supported, segmented RMP attributes can be found using the CPUID +instruction:: + + 0x80000025[eax]: + Bits[5:0] minimum supported RMP segment size + Bits[11:6] maximum supported RMP segment size + + 0x80000025[ebx]: + Bits[9:0] number of cacheable RMP segment definitions + Bit[10] indicates if the number of cacheable RMP segments + is a hard limit + +To enable a segmented RMP, a new MSR is available:: + + 0xc0010136 (RMP_CFG): + Bit[0] indicates if segmented RMP is enabled + Bits[13:8] contains the size of memory covered by an RMP + segment (expressed as a power of 2) + +The RMP segment size defined in the RMP_CFG MSR applies to all segments +of the RMP. Therefore each RMP segment covers a specific range of system +physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then +the RMP segment coverage value is 0x24 => 36, meaning the size of memory +covered by an RMP segment is 64GB (1 << 36). So the first RMP segment +covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment +covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc. + +When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping +area as it does today (16K in size). However, instead of RMP entries +beginning immediately after the bookkeeping area, there is a 4K RMP +segment table (RST). Each entry in the RST is 8-bytes in size and represents +an RMP segment:: + + Bits[19:0] mapped size (in GB) + The mapped size can be less than the defined segment size. + A value of zero, indicates that no RMP exists for the range + of system physical addresses associated with this segment. + Bits[51:20] segment physical address + This address is left shift 20-bits (or just masked when + read) to form the physical address of the segment (1MB + alignment). + +The RST can hold 512 segment entries but can be limited in size to the number +of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable +RMP segments is a hard limit (CPUID 0x80000025_EBX[10]). + +The current Linux support relies on BIOS to allocate/reserve the memory for +the segmented RMP (the bookkeeping area, RST, and all segments), build the RST +and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR +values to locate the RMP and determine the size and location of the RMP +segments. The RMP must cover all of system memory in order for Linux to enable +SEV-SNP. + +More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table", +docID: 24593. + +Secure VM Service Module (SVSM) +=============================== + +SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which +defines four privilege levels at which guest software can run. The most +privileged level is 0 and numerically higher numbers have lesser privileges. +More details in the AMD64 APM Vol 2, section "15.35.7 Virtual Machine +Privilege Levels", docID: 24593. + +When using that feature, different services can run at different protection +levels, apart from the guest OS but still within the secure SNP environment. +They can provide services to the guest, like a vTPM, for example. + +When a guest is not running at VMPL0, it needs to communicate with the software +running at VMPL0 to perform privileged operations or to interact with secure +services. An example fur such a privileged operation is PVALIDATE which is +*required* to be executed at VMPL0. + +In this scenario, the software running at VMPL0 is usually called a Secure VM +Service Module (SVSM). Discovery of an SVSM and the API used to communicate +with it is documented in "Secure VM Service Module for SEV-SNP Guests", docID: +58019. + +(Latest versions of the above-mentioned documents can be found by using +a search engine like duckduckgo.com and typing in: + + site:amd.com "Secure VM Service Module for SEV-SNP Guests", docID: 58019 + +for example.) diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst index c92bfd55359f..2fd917638e42 100644 --- a/Documentation/arch/x86/amd_hsmp.rst +++ b/Documentation/arch/x86/amd_hsmp.rst @@ -4,8 +4,9 @@ AMD HSMP interface ============================================ -Newer Fam19h EPYC server line of processors from AMD support system -management functionality via HSMP (Host System Management Port). +Newer Fam19h(model 0x00-0x1f, 0x30-0x3f, 0x90-0x9f, 0xa0-0xaf), +Fam1Ah(model 0x00-0x1f) EPYC server line of processors from AMD support +system management functionality via HSMP (Host System Management Port). The Host System Management Port (HSMP) is an interface to provide OS-level software with access to system management functions via a @@ -13,16 +14,28 @@ set of mailbox registers. More details on the interface can be found in chapter "7 Host System Management Port (HSMP)" of the family/model PPR -Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip +Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip -HSMP interface is supported on EPYC server CPU models only. + +HSMP interface is supported on EPYC line of server CPUs and MI300A (APU). HSMP device ============================================ -amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice -/dev/hsmp to let user space programs run hsmp mailbox commands. +amd_hsmp driver under drivers/platforms/x86/amd/hsmp/ has separate driver files +for ACPI object based probing, platform device based probing and for the common +code for these two drivers. + +Kconfig option CONFIG_AMD_HSMP_PLAT compiles plat.c and creates amd_hsmp.ko. +Kconfig option CONFIG_AMD_HSMP_ACPI compiles acpi.c and creates hsmp_acpi.ko. +Selecting any of these two configs automatically selects CONFIG_AMD_HSMP. This +compiles common code hsmp.c and creates hsmp_common.ko module. + +Both the ACPI and plat drivers create the miscdevice /dev/hsmp to let +user space programs run hsmp mailbox commands. + +The ACPI object format supported by the driver is defined below. $ ls -al /dev/hsmp crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp @@ -58,6 +71,51 @@ Note: lseek() is not supported as entire metrics table is read. Metrics table definitions will be documented as part of Public PPR. The same is defined in the amd_hsmp.h header. +ACPI device object format +========================= +The ACPI object format expected from the amd_hsmp driver +for socket with ID00 is given below:: + + Device(HSMP) + { + Name(_HID, "AMDI0097") + Name(_UID, "ID00") + Name(HSE0, 0x00000001) + Name(RBF0, ResourceTemplate() + { + Memory32Fixed(ReadWrite, 0xxxxxxx, 0x00100000) + }) + Method(_CRS, 0, NotSerialized) + { + Return(RBF0) + } + Method(_STA, 0, NotSerialized) + { + If(LEqual(HSE0, One)) + { + Return(0x0F) + } + Else + { + Return(Zero) + } + } + Name(_DSD, Package(2) + { + Buffer(0x10) + { + 0x9D, 0x61, 0x4D, 0xB7, 0x07, 0x57, 0xBD, 0x48, + 0xA6, 0x9F, 0x4E, 0xA2, 0x87, 0x1F, 0xC2, 0xF6 + }, + Package(3) + { + Package(2) {"MsgIdOffset", 0x00010934}, + Package(2) {"MsgRspOffset", 0x00010980}, + Package(2) {"MsgArgOffset", 0x000109E0} + } + }) + } + An example ========== @@ -97,8 +155,8 @@ what happened. The transaction returns 0 on success. More details on the interface and message definitions can be found in chapter "7 Host System Management Port (HSMP)" of the respective family/model PPR -eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip +eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip User space C-APIs are made available by linking against the esmi library, -which is provided by the E-SMS project https://developer.amd.com/e-sms/. +which is provided by the E-SMS project https://www.amd.com/en/developer/e-sms.html. See: https://github.com/amd/esmi_ib_library diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst index c513855a54bb..76f53d3450e7 100644 --- a/Documentation/arch/x86/boot.rst +++ b/Documentation/arch/x86/boot.rst @@ -77,7 +77,7 @@ Protocol 2.14 BURNT BY INCORRECT COMMIT Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max. ============= ============================================================ - .. note:: +.. note:: The protocol version number should be changed only if the setup header is changed. There is no need to update the version number if boot_params or kernel_info are changed. Additionally, it is recommended to use @@ -95,27 +95,27 @@ Memory Layout The traditional memory map for the kernel loader, used for Image or zImage kernels, typically looks like:: - | | - 0A0000 +------------------------+ - | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. - 09A000 +------------------------+ - | Command line | - | Stack/heap | For use by the kernel real-mode code. - 098000 +------------------------+ - | Kernel setup | The kernel real-mode code. - 090200 +------------------------+ - | Kernel boot sector | The kernel legacy boot sector. - 090000 +------------------------+ - | Protected-mode kernel | The bulk of the kernel image. - 010000 +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 - 001000 +------------------------+ - | Reserved for MBR/BIOS | - 000800 +------------------------+ - | Typically used by MBR | - 000600 +------------------------+ - | BIOS use only | - 000000 +------------------------+ + | | + 0A0000 +------------------------+ + | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. + 09A000 +------------------------+ + | Command line | + | Stack/heap | For use by the kernel real-mode code. + 098000 +------------------------+ + | Kernel setup | The kernel real-mode code. + 090200 +------------------------+ + | Kernel boot sector | The kernel legacy boot sector. + 090000 +------------------------+ + | Protected-mode kernel | The bulk of the kernel image. + 010000 +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ When using bzImage, the protected-mode kernel was relocated to 0x100000 ("high memory"), and the kernel real-mode block (boot sector, @@ -142,28 +142,28 @@ above the 0x9A000 point; too many BIOSes will break above that point. For a modern bzImage kernel with boot protocol version >= 2.02, a memory layout like the following is suggested:: - ~ ~ - | Protected-mode kernel | - 100000 +------------------------+ - | I/O memory hole | - 0A0000 +------------------------+ - | Reserved for BIOS | Leave as much as possible unused - ~ ~ - | Command line | (Can also be below the X+10000 mark) - X+10000 +------------------------+ - | Stack/heap | For use by the kernel real-mode code. - X+08000 +------------------------+ - | Kernel setup | The kernel real-mode code. - | Kernel boot sector | The kernel legacy boot sector. - X +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 - 001000 +------------------------+ - | Reserved for MBR/BIOS | - 000800 +------------------------+ - | Typically used by MBR | - 000600 +------------------------+ - | BIOS use only | - 000000 +------------------------+ + ~ ~ + | Protected-mode kernel | + 100000 +------------------------+ + | I/O memory hole | + 0A0000 +------------------------+ + | Reserved for BIOS | Leave as much as possible unused + ~ ~ + | Command line | (Can also be below the X+10000 mark) + X+10000 +------------------------+ + | Stack/heap | For use by the kernel real-mode code. + X+08000 +------------------------+ + | Kernel setup | The kernel real-mode code. + | Kernel boot sector | The kernel legacy boot sector. + X +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ ... where the address X is as low as the design of the boot loader permits. @@ -229,22 +229,22 @@ Offset/Size Proto Name Meaning =========== ======== ===================== ============================================ .. note:: - (1) For backwards compatibility, if the setup_sects field contains 0, the - real value is 4. + (1) For backwards compatibility, if the setup_sects field contains 0, + the real value is 4. - (2) For boot protocol prior to 2.04, the upper two bytes of the syssize - field are unusable, which means the size of a bzImage kernel - cannot be determined. + (2) For boot protocol prior to 2.04, the upper two bytes of the syssize + field are unusable, which means the size of a bzImage kernel + cannot be determined. - (3) Ignored, but safe to set, for boot protocols 2.02-2.09. + (3) Ignored, but safe to set, for boot protocols 2.02-2.09. If the "HdrS" (0x53726448) magic number is not found at offset 0x202, the boot protocol version is "old". Loading an old kernel, the following parameters should be assumed:: - Image type = zImage - initrd not supported - Real-mode kernel must be located at 0x90000. + Image type = zImage + initrd not supported + Real-mode kernel must be located at 0x90000. Otherwise, the "version" field contains the protocol version, e.g. protocol version 2.01 will contain 0x0201 in this field. When @@ -265,7 +265,7 @@ All general purpose boot loaders should write the fields marked nonstandard address should fill in the fields marked (reloc); other boot loaders can ignore those fields. -The byte order of all fields is littleendian (this is x86, after all.) +The byte order of all fields is little endian (this is x86, after all.) ============ =========== Field name: setup_sects @@ -365,7 +365,7 @@ Offset/size: 0x206/2 Protocol: 2.00+ ============ ======= - Contains the boot protocol version, in (major << 8)+minor format, + Contains the boot protocol version, in (major << 8) + minor format, e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version 10.17. @@ -397,17 +397,17 @@ Protocol: 2.00+ If set to a nonzero value, contains a pointer to a NUL-terminated human-readable kernel version number string, less 0x200. This can be used to display the kernel version to the user. This value - should be less than (0x200*setup_sects). + should be less than (0x200 * setup_sects). For example, if this value is set to 0x1c00, the kernel version number string can be found at offset 0x1e00 in the kernel file. This is a valid value if and only if the "setup_sects" field contains the value 15 or higher, as:: - 0x1c00 < 15*0x200 (= 0x1e00) but - 0x1c00 >= 14*0x200 (= 0x1c00) + 0x1c00 < 15 * 0x200 (= 0x1e00) but + 0x1c00 >= 14 * 0x200 (= 0x1c00) - 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. + 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. ============ ================== Field name: type_of_loader @@ -427,9 +427,9 @@ Protocol: 2.00+ For example, for T = 0x15, V = 0x234, write:: - type_of_loader <- 0xE4 - ext_loader_type <- 0x05 - ext_loader_ver <- 0x23 + type_of_loader <- 0xE4 + ext_loader_type <- 0x05 + ext_loader_ver <- 0x23 Assigned boot loader ids (hexadecimal): @@ -686,7 +686,7 @@ Protocol: 2.10+ If a boot loader makes use of this field, it should update the kernel_alignment field with the alignment unit desired; typically:: - kernel_alignment = 1 << min_alignment + kernel_alignment = 1 << min_alignment; There may be a considerable performance cost with an excessively misaligned kernel. Therefore, a loader should typically try each @@ -754,7 +754,7 @@ Protocol: 2.07+ 0x00000000 The default x86/PC environment 0x00000001 lguest 0x00000002 Xen - 0x00000003 Moorestown MID + 0x00000003 Intel MID (Moorestown, CloverTrail, Merrifield, Moorefield) 0x00000004 CE4100 TV Platform ========== ============================== @@ -808,13 +808,13 @@ Protocol: 2.09+ parameters passing mechanism. The definition of struct setup_data is as follow:: - struct setup_data { - u64 next; - u32 type; - u32 len; - u8 data[0]; - }; - + struct setup_data { + __u64 next; + __u32 type; + __u32 len; + __u8 data[]; + } + Where, the next is a 64-bit physical pointer to the next node of linked list, the next field of the last node is 0; the type is used to identify the contents of data; the len is the length of data @@ -834,12 +834,12 @@ Protocol: 2.09+ Thus setup_indirect struct and SETUP_INDIRECT type were introduced in protocol 2.15:: - struct setup_indirect { - __u32 type; - __u32 reserved; /* Reserved, must be set to zero. */ - __u64 len; - __u64 addr; - }; + struct setup_indirect { + __u32 type; + __u32 reserved; /* Reserved, must be set to zero. */ + __u64 len; + __u64 addr; + }; The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be SETUP_INDIRECT itself since making the setup_indirect a tree structure @@ -849,17 +849,17 @@ Protocol: 2.09+ Let's give an example how to point to SETUP_E820_EXT data using setup_indirect. In this case setup_data and setup_indirect will look like this:: - struct setup_data { - __u64 next = 0 or <addr_of_next_setup_data_struct>; - __u32 type = SETUP_INDIRECT; - __u32 len = sizeof(setup_indirect); - __u8 data[sizeof(setup_indirect)] = struct setup_indirect { - __u32 type = SETUP_INDIRECT | SETUP_E820_EXT; - __u32 reserved = 0; - __u64 len = <len_of_SETUP_E820_EXT_data>; - __u64 addr = <addr_of_SETUP_E820_EXT_data>; - } - } + struct setup_data { + .next = 0, /* or <addr_of_next_setup_data_struct> */ + .type = SETUP_INDIRECT, + .len = sizeof(setup_indirect), + .data[sizeof(setup_indirect)] = (struct setup_indirect) { + .type = SETUP_INDIRECT | SETUP_E820_EXT, + .reserved = 0, + .len = <len_of_SETUP_E820_EXT_data>, + .addr = <addr_of_SETUP_E820_EXT_data>, + }, + } .. note:: SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished @@ -878,7 +878,8 @@ Protocol: 2.10+ address if possible. A non-relocatable kernel will unconditionally move itself and to run - at this address. + at this address. A relocatable kernel will move itself to this address if it + loaded below this address. ============ ======= Field name: init_size @@ -895,10 +896,19 @@ Offset/size: 0x260/4 The kernel runtime start address is determined by the following algorithm:: - if (relocatable_kernel) - runtime_start = align_up(load_address, kernel_alignment) - else - runtime_start = pref_address + if (relocatable_kernel) { + if (load_address < pref_address) + load_address = pref_address; + runtime_start = align_up(load_address, kernel_alignment); + } else { + runtime_start = pref_address; + } + +Hence the necessary memory window location and size can be estimated by +a boot loader as:: + + memory_window_start = runtime_start; + memory_window_size = init_size; ============ =============== Field name: handover_offset @@ -928,12 +938,12 @@ The kernel_info =============== The relationships between the headers are analogous to the various data -sections: +sections:: setup_header = .data boot_params/setup_data = .bss -What is missing from the above list? That's right: +What is missing from the above list? That's right:: kernel_info = .rodata @@ -965,22 +975,22 @@ after kernel_info_var_len_data label. Each chunk of variable size data has to be prefixed with header/magic and its size, e.g.:: kernel_info: - .ascii "LToP" /* Header, Linux top (structure). */ - .long kernel_info_var_len_data - kernel_info - .long kernel_info_end - kernel_info - .long 0x01234567 /* Some fixed size data for the bootloaders. */ + .ascii "LToP" /* Header, Linux top (structure). */ + .long kernel_info_var_len_data - kernel_info + .long kernel_info_end - kernel_info + .long 0x01234567 /* Some fixed size data for the bootloaders. */ kernel_info_var_len_data: - example_struct: /* Some variable size data for the bootloaders. */ - .ascii "0123" /* Header/Magic. */ - .long example_struct_end - example_struct - .ascii "Struct" - .long 0x89012345 + example_struct: /* Some variable size data for the bootloaders. */ + .ascii "0123" /* Header/Magic. */ + .long example_struct_end - example_struct + .ascii "Struct" + .long 0x89012345 example_struct_end: - example_strings: /* Some variable size data for the bootloaders. */ - .ascii "ABCD" /* Header/Magic. */ - .long example_strings_end - example_strings - .asciz "String_0" - .asciz "String_1" + example_strings: /* Some variable size data for the bootloaders. */ + .ascii "ABCD" /* Header/Magic. */ + .long example_strings_end - example_strings + .asciz "String_0" + .asciz "String_1" example_strings_end: kernel_info_end: @@ -1129,67 +1139,63 @@ mode segment. Such a boot loader should enter the following fields in the header:: - unsigned long base_ptr; /* base address for real-mode segment */ + unsigned long base_ptr; /* base address for real-mode segment */ - if ( setup_sects == 0 ) { - setup_sects = 4; - } + if (setup_sects == 0) + setup_sects = 4; - if ( protocol >= 0x0200 ) { - type_of_loader = <type code>; - if ( loading_initrd ) { - ramdisk_image = <initrd_address>; - ramdisk_size = <initrd_size>; - } + if (protocol >= 0x0200) { + type_of_loader = <type code>; + if (loading_initrd) { + ramdisk_image = <initrd_address>; + ramdisk_size = <initrd_size>; + } - if ( protocol >= 0x0202 && loadflags & 0x01 ) - heap_end = 0xe000; - else - heap_end = 0x9800; + if (protocol >= 0x0202 && loadflags & 0x01) + heap_end = 0xe000; + else + heap_end = 0x9800; - if ( protocol >= 0x0201 ) { - heap_end_ptr = heap_end - 0x200; - loadflags |= 0x80; /* CAN_USE_HEAP */ - } + if (protocol >= 0x0201) { + heap_end_ptr = heap_end - 0x200; + loadflags |= 0x80; /* CAN_USE_HEAP */ + } - if ( protocol >= 0x0202 ) { - cmd_line_ptr = base_ptr + heap_end; - strcpy(cmd_line_ptr, cmdline); - } else { - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; - setup_move_size = heap_end + strlen(cmdline)+1; - strcpy(base_ptr+cmd_line_offset, cmdline); - } - } else { - /* Very old kernel */ + if (protocol >= 0x0202) { + cmd_line_ptr = base_ptr + heap_end; + strcpy(cmd_line_ptr, cmdline); + } else { + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + setup_move_size = heap_end + strlen(cmdline) + 1; + strcpy(base_ptr + cmd_line_offset, cmdline); + } + } else { + /* Very old kernel */ - heap_end = 0x9800; + heap_end = 0x9800; - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; - /* A very old kernel MUST have its real-mode code - loaded at 0x90000 */ + /* A very old kernel MUST have its real-mode code loaded at 0x90000 */ + if (base_ptr != 0x90000) { + /* Copy the real-mode kernel */ + memcpy(0x90000, base_ptr, (setup_sects + 1) * 512); + base_ptr = 0x90000; /* Relocated */ + } - if ( base_ptr != 0x90000 ) { - /* Copy the real-mode kernel */ - memcpy(0x90000, base_ptr, (setup_sects+1)*512); - base_ptr = 0x90000; /* Relocated */ - } + strcpy(0x90000 + cmd_line_offset, cmdline); - strcpy(0x90000+cmd_line_offset, cmdline); - - /* It is recommended to clear memory up to the 32K mark */ - memset(0x90000 + (setup_sects+1)*512, 0, - (64-(setup_sects+1))*512); - } + /* It is recommended to clear memory up to the 32K mark */ + memset(0x90000 + (setup_sects + 1) * 512, 0, (64 - (setup_sects + 1)) * 512); + } Loading The Rest of The Kernel ============================== -The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 +The 32-bit (non-real-mode) kernel starts at offset (setup_sects + 1) * 512 in the kernel file (again, if setup_sects == 0 the real value is 4.) It should be loaded at address 0x10000 for Image/zImage kernels and 0x100000 for bzImage kernels. @@ -1197,13 +1203,14 @@ It should be loaded at address 0x10000 for Image/zImage kernels and The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 bit (LOAD_HIGH) in the loadflags field is set:: - is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); - load_address = is_bzImage ? 0x100000 : 0x10000; + is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); + load_address = is_bzImage ? 0x100000 : 0x10000; -Note that Image/zImage kernels can be up to 512K in size, and thus use -the entire 0x10000-0x90000 range of memory. This means it is pretty -much a requirement for these kernels to load the real-mode part at -0x90000. bzImage kernels allow much more flexibility. +.. note:: + Image/zImage kernels can be up to 512K in size, and thus use the entire + 0x10000-0x90000 range of memory. This means it is pretty much a + requirement for these kernels to load the real-mode part at 0x90000. + bzImage kernels allow much more flexibility. Special Command Line Options ============================ @@ -1272,19 +1279,20 @@ es = ss. In our example from above, we would do:: - /* Note: in the case of the "old" kernel protocol, base_ptr must - be == 0x90000 at this point; see the previous sample code */ + /* + * Note: in the case of the "old" kernel protocol, base_ptr must + * be == 0x90000 at this point; see the previous sample code. + */ + seg = base_ptr >> 4; - seg = base_ptr >> 4; + cli(); /* Enter with interrupts disabled! */ - cli(); /* Enter with interrupts disabled! */ + /* Set up the real-mode kernel stack */ + _SS = seg; + _SP = heap_end; - /* Set up the real-mode kernel stack */ - _SS = seg; - _SP = heap_end; - - _DS = _ES = _FS = _GS = seg; - jmp_far(seg+0x20, 0); /* Run the kernel */ + _DS = _ES = _FS = _GS = seg; + jmp_far(seg + 0x20, 0); /* Run the kernel */ If your boot sector accesses a floppy drive, it is recommended to switch off the floppy motor before running the kernel, since the @@ -1339,7 +1347,7 @@ from offset 0x01f1 of kernel image on should be loaded into struct boot_params and examined. The end of setup header can be calculated as follow:: - 0x0202 + byte value at offset 0x0201 + 0x0202 + byte value at offset 0x0201 In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should @@ -1375,7 +1383,7 @@ Then, the setup header at offset 0x01f1 of kernel image on should be loaded into struct boot_params and examined. The end of setup header can be calculated as follows:: - 0x0202 + byte value at offset 0x0201 + 0x0202 + byte value at offset 0x0201 In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should @@ -1417,7 +1425,7 @@ execution context provided by the EFI firmware. The function prototype for the handover entry point looks like this:: - efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp) + void efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp); 'handle' is the EFI image handle passed to the boot loader by the EFI firmware, 'table' is the EFI system table - these are the first two @@ -1432,12 +1440,13 @@ The boot loader *must* fill out the following fields in bp:: All other fields should be zero. -NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF - entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd - loading protocol (refer to [0] for an example of the bootloader side of - this), which removes the need for any knowledge on the part of the EFI - bootloader regarding the internal representation of boot_params or any - requirements/limitations regarding the placement of the command line - and ramdisk in memory, or the placement of the kernel image itself. +.. note:: + The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF + entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd + loading protocol (refer to [0] for an example of the bootloader side of + this), which removes the need for any knowledge on the part of the EFI + bootloader regarding the internal representation of boot_params or any + requirements/limitations regarding the placement of the command line + and ramdisk in memory, or the placement of the kernel image itself. [0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0 diff --git a/Documentation/arch/x86/buslock.rst b/Documentation/arch/x86/buslock.rst index 4c5a4822eeb7..31f1bfdff16f 100644 --- a/Documentation/arch/x86/buslock.rst +++ b/Documentation/arch/x86/buslock.rst @@ -26,7 +26,8 @@ Detection ========= Intel processors may support either or both of the following hardware -mechanisms to detect split locks and bus locks. +mechanisms to detect split locks and bus locks. Some AMD processors also +support bus lock detect. #AC exception for split lock detection -------------------------------------- diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst index 8895784d4784..6ef426a52cdc 100644 --- a/Documentation/arch/x86/cpuinfo.rst +++ b/Documentation/arch/x86/cpuinfo.rst @@ -112,7 +112,7 @@ conditions are met, the features are enabled by the set_cpu_cap or setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS, the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and "split_lock_detect" will be displayed. The flag "ring3mwait" will be -displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors. +displayed only when running on INTEL_XEON_PHI_[KNL|KNM] processors. d: Flags can represent purely software features. ------------------------------------------------ diff --git a/Documentation/arch/x86/exception-tables.rst b/Documentation/arch/x86/exception-tables.rst index efde1fef4fbd..6e7177363f8f 100644 --- a/Documentation/arch/x86/exception-tables.rst +++ b/Documentation/arch/x86/exception-tables.rst @@ -297,7 +297,7 @@ vma occurs? c) execution continues at local label 2 (address of the instruction immediately after the faulting user access). -The steps 8a to 8c in a certain way emulate the faulting instruction. + The steps a to c above in a certain way emulate the faulting instruction. That's it, mostly. If you look at our example, you might ask why we set EAX to -EFAULT in the exception handler code. Well, the diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst index c58c72362911..5a2e6c0ef04a 100644 --- a/Documentation/arch/x86/mds.rst +++ b/Documentation/arch/x86/mds.rst @@ -162,7 +162,7 @@ Mitigation points 3. It would take a large number of these precisely-timed NMIs to mount an actual attack. There's presumably not enough bandwidth. 4. The NMI in question occurs after a VERW, i.e. when user state is - restored and most interesting data is already scrubbed. Whats left + restored and most interesting data is already scrubbed. What's left is only the data that NMI touches, and that may or may not be of any interest. diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst index e08d35177bc0..57e8392f61d3 100644 --- a/Documentation/arch/x86/pti.rst +++ b/Documentation/arch/x86/pti.rst @@ -26,9 +26,9 @@ comments in pti.c). This approach helps to ensure that side-channel attacks leveraging the paging structures do not function when PTI is enabled. It can be -enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. -Once enabled at compile-time, it can be disabled at boot with the -'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). +enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile +time. Once enabled at compile-time, it can be disabled at boot with +the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). Page Table Management ===================== diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index a6279df64a9d..6768fc1fad16 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -45,7 +45,7 @@ mount options are: Enable code/data prioritization in L2 cache allocations. "mba_MBps": Enable the MBA Software Controller(mba_sc) to specify MBA - bandwidth in MBps + bandwidth in MiBps "debug": Make debug files accessible. Available debug files are annotated with "Available only with debug option". @@ -375,11 +375,25 @@ When monitoring is enabled all MON groups will also contain: all tasks in the group. In CTRL_MON groups these files provide the sum for all tasks in the CTRL_MON group and all tasks in MON groups. Please see example section for more details on usage. + On systems with Sub-NUMA Cluster (SNC) enabled there are extra + directories for each node (located within the "mon_L3_XX" directory + for the L3 cache they occupy). These are named "mon_sub_L3_YY" + where "YY" is the node number. "mon_hw_id": Available only with debug option. The identifier used by hardware for the monitor group. On x86 this is the RMID. +When the "mba_MBps" mount option is used all CTRL_MON groups will also contain: + +"mba_MBps_event": + Reading this file shows which memory bandwidth event is used + as input to the software feedback loop that keeps memory bandwidth + below the value specified in the schemata file. Writing the + name of one of the supported memory bandwidth events found in + /sys/fs/resctrl/info/L3_MON/mon_features changes the input + event. + Resource allocation rules ------------------------- @@ -446,6 +460,12 @@ during mkdir. max_threshold_occupancy is a user configurable value to determine the occupancy at which an RMID can be freed. +The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes +for a subset of RMID that are not immediately available for allocation. +This can't be relied on to produce output every second, it may be necessary +to attempt to create an empty monitor group to force an update. Output may +only be produced if creation of a control or monitor group fails. + Schemata files - general concepts --------------------------------- Each line in the file describes one resource. The line starts with @@ -478,6 +498,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask each bit represents 5% of the capacity of the cache. You could partition the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. +Notes on Sub-NUMA Cluster mode +============================== +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA +nodes much more readily than between regular NUMA nodes since the CPUs +on Sub-NUMA nodes share the same L3 cache and the system may report +the NUMA distance between Sub-NUMA nodes with a lower value than used +for regular NUMA nodes. + +The top-level monitoring files in each "mon_L3_XX" directory provide +the sum of data across all SNC nodes sharing an L3 cache instance. +Users who bind tasks to the CPUs of a specific Sub-NUMA node can read +the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the +"mon_sub_L3_YY" directories to get node local data. + +Memory bandwidth allocation is still performed at the L3 cache +level. I.e. throttling controls are applied to all SNC nodes. + +L3 cache allocation bitmaps also apply to all SNC nodes. But note that +the amount of L3 cache represented by each bit is divided by the number +of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit +allocation masks each bit normally represents 10MB. With SNC mode enabled +with two SNC nodes per L3 cache, each bit only represents 5MB. + Memory bandwidth Allocation and monitoring ========================================== @@ -526,7 +569,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may increase or vary although user specified bandwidth percentage is same. In order to mitigate this and make the interface more user friendly, -resctrl added support for specifying the bandwidth in MBps as well. The +resctrl added support for specifying the bandwidth in MiBps as well. The kernel underneath would use a software feedback mechanism or a "Software Controller(mba_sc)" which reads the actual bandwidth using MBM counters and adjust the memory bandwidth percentages to ensure:: @@ -573,13 +616,13 @@ Memory b/w domain is L3 cache. MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;... -Memory bandwidth Allocation specified in MBps ---------------------------------------------- +Memory bandwidth Allocation specified in MiBps +---------------------------------------------- Memory bandwidth domain is L3 cache. :: - MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;... + MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;... Slow Memory Bandwidth Allocation (SMBA) --------------------------------------- diff --git a/Documentation/arch/x86/sva.rst b/Documentation/arch/x86/sva.rst index 33cb05005982..6a759984d471 100644 --- a/Documentation/arch/x86/sva.rst +++ b/Documentation/arch/x86/sva.rst @@ -25,7 +25,7 @@ to cache translations for virtual addresses. The IOMMU driver uses the mmu_notifier() support to keep the device TLB cache and the CPU cache in sync. When an ATS lookup fails for a virtual address, the device should use the PRI in order to request the virtual address to be paged into the -CPU page tables. The device must use ATS again in order the fetch the +CPU page tables. The device must use ATS again in order to fetch the translation before use. Shared Hardware Workqueues @@ -216,7 +216,7 @@ submitting work and processing completions. Single Root I/O Virtualization (SR-IOV) focuses on providing independent hardware interfaces for virtualizing hardware. Hence, it's required to be -almost fully functional interface to software supporting the traditional +an almost fully functional interface to software supporting the traditional BARs, space for interrupts via MSI-X, its own register layout. Virtual Functions (VFs) are assisted by the Physical Function (PF) driver. diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst index 08ebf9edbfc1..c12837e61bda 100644 --- a/Documentation/arch/x86/topology.rst +++ b/Documentation/arch/x86/topology.rst @@ -47,17 +47,21 @@ AMD nomenclature for package is 'Node'. Package-related topology information in the kernel: - - cpuinfo_x86.x86_max_cores: + - topology_num_threads_per_package() - The number of cores in a package. This information is retrieved via CPUID. + The number of threads in a package. - - cpuinfo_x86.x86_max_dies: + - topology_num_cores_per_package() - The number of dies in a package. This information is retrieved via CPUID. + The number of cores in a package. + + - topology_max_dies_per_package() + + The maximum number of dies in a package. - cpuinfo_x86.topo.die_id: - The physical ID of the die. This information is retrieved via CPUID. + The physical ID of the die. - cpuinfo_x86.topo.pkg_id: @@ -96,16 +100,6 @@ are SMT- or CMT-type threads. AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses "core". -Core-related topology information in the kernel: - - - smp_num_siblings: - - The number of threads in a core. The number of threads in a package can be - calculated by:: - - threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings - - Threads ======= A thread is a single scheduling unit. It's the equivalent to a logical Linux @@ -141,6 +135,10 @@ Thread-related topology information in the kernel: The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo "core_id." + - topology_logical_core_id(); + + The logical core ID to which a thread belongs. + System topology examples diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst deleted file mode 100644 index 137432d34109..000000000000 --- a/Documentation/arch/x86/x86_64/boot-options.rst +++ /dev/null @@ -1,319 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=========================== -AMD64 Specific Boot Options -=========================== - -There are many others (usually documented in driver documentation), but -only the AMD64 specific ones are listed here. - -Machine check -============= -Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. - - mce=off - Disable machine check - mce=no_cmci - Disable CMCI(Corrected Machine Check Interrupt) that - Intel processor supports. Usually this disablement is - not recommended, but it might be handy if your hardware - is misbehaving. - Note that you'll get more problems without CMCI than with - due to the shared banks, i.e. you might get duplicated - error logs. - mce=dont_log_ce - Don't make logs for corrected errors. All events reported - as corrected are silently cleared by OS. - This option will be useful if you have no interest in any - of corrected errors. - mce=ignore_ce - Disable features for corrected errors, e.g. polling timer - and CMCI. All events reported as corrected are not cleared - by OS and remained in its error banks. - Usually this disablement is not recommended, however if - there is an agent checking/clearing corrected errors - (e.g. BIOS or hardware monitoring applications), conflicting - with OS's error handling, and you cannot deactivate the agent, - then this option will be a help. - mce=no_lmce - Do not opt-in to Local MCE delivery. Use legacy method - to broadcast MCEs. - mce=bootlog - Enable logging of machine checks left over from booting. - Disabled by default on AMD Fam10h and older because some BIOS - leave bogus ones. - If your BIOS doesn't do that it's a good idea to enable though - to make sure you log even machine check events that result - in a reboot. On Intel systems it is enabled by default. - mce=nobootlog - Disable boot machine check logging. - mce=monarchtimeout (number) - monarchtimeout: - Sets the time in us to wait for other CPUs on machine checks. 0 - to disable. - mce=bios_cmci_threshold - Don't overwrite the bios-set CMCI threshold. This boot option - prevents Linux from overwriting the CMCI threshold set by the - bios. Without this option, Linux always sets the CMCI - threshold to 1. Enabling this may make memory predictive failure - analysis less effective if the bios sets thresholds for memory - errors since we will not see details for all errors. - mce=recovery - Force-enable recoverable machine check code paths - - nomce (for compatibility with i386) - same as mce=off - - Everything else is in sysfs now. - -APICs -===== - - apic - Use IO-APIC. Default - - noapic - Don't use the IO-APIC. - - disableapic - Don't use the local APIC - - nolapic - Don't use the local APIC (alias for i386 compatibility) - - pirq=... - See Documentation/arch/x86/i386/IO-APIC.rst - - noapictimer - Don't set up the APIC timer - - no_timer_check - Don't check the IO-APIC timer. This can work around - problems with incorrect timer initialization on some boards. - - apicpmtimer - Do APIC timer calibration using the pmtimer. Implies - apicmaintimer. Useful when your PIT timer is totally broken. - -Timing -====== - - notsc - Deprecated, use tsc=unstable instead. - - nohpet - Don't use the HPET timer. - -Idle loop -========= - - idle=poll - Don't do power saving in the idle loop using HLT, but poll for rescheduling - event. This will make the CPUs eat a lot more power, but may be useful - to get slightly better performance in multiprocessor benchmarks. It also - makes some profiling using performance counters more accurate. - Please note that on systems with MONITOR/MWAIT support (like Intel EM64T - CPUs) this option has no performance advantage over the normal idle loop. - It may also interact badly with hyperthreading. - -Rebooting -========= - - reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old] - bios - Use the CPU reboot vector for warm reset - warm - Don't set the cold reboot flag - cold - Set the cold reboot flag - triple - Force a triple fault (init) - kbd - Use the keyboard controller. cold reset (default) - acpi - Use the ACPI RESET_REG in the FADT. If ACPI is not configured or - the ACPI reset does not work, the reboot path attempts the reset - using the keyboard controller. - efi - Use efi reset_system runtime service. If EFI is not configured or - the EFI reset does not work, the reboot path attempts the reset using - the keyboard controller. - pci - Use a write to the PCI config space register 0xcf9 to trigger reboot. - - Using warm reset will be much faster especially on big memory - systems because the BIOS will not go through the memory check. - Disadvantage is that not all hardware will be completely reinitialized - on reboot so there may be boot problems on some systems. - - reboot=force - Don't stop other CPUs on reboot. This can make reboot more reliable - in some cases. - - reboot=default - There are some built-in platform specific "quirks" - you may see: - "reboot: <name> series board detected. Selecting <type> for reboots." - In the case where you think the quirk is in error (e.g. you have - newer BIOS, or newer board) using this option will ignore the built-in - quirk table, and use the generic default reboot actions. - -NUMA -==== - - numa=off - Only set up a single NUMA node spanning all memory. - - numa=noacpi - Don't parse the SRAT table for NUMA setup - - numa=nohmat - Don't parse the HMAT table for NUMA setup, or soft-reserved memory - partitioning. - - numa=fake=<size>[MG] - If given as a memory unit, fills all system RAM with nodes of - size interleaved over physical nodes. - - numa=fake=<N> - If given as an integer, fills all system RAM with N fake nodes - interleaved over physical nodes. - - numa=fake=<N>U - If given as an integer followed by 'U', it will divide each - physical node into N emulated nodes. - -ACPI -==== - - acpi=off - Don't enable ACPI - acpi=ht - Use ACPI boot table parsing, but don't enable ACPI interpreter - acpi=force - Force ACPI on (currently not needed) - acpi=strict - Disable out of spec ACPI workarounds. - acpi_sci={edge,level,high,low} - Set up ACPI SCI interrupt. - acpi=noirq - Don't route interrupts - acpi=nocmcff - Disable firmware first mode for corrected errors. This - disables parsing the HEST CMC error source to check if - firmware has set the FF flag. This may result in - duplicate corrected error reports. - -PCI -=== - - pci=off - Don't use PCI - pci=conf1 - Use conf1 access. - pci=conf2 - Use conf2 access. - pci=rom - Assign ROMs. - pci=assign-busses - Assign busses - pci=irqmask=MASK - Set PCI interrupt mask to MASK - pci=lastbus=NUMBER - Scan up to NUMBER busses, no matter what the mptable says. - pci=noacpi - Don't use ACPI to set up PCI interrupt routing. - -IOMMU (input/output memory management unit) -=========================================== -Multiple x86-64 PCI-DMA mapping implementations exist, for example: - - 1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all - (e.g. because you have < 3 GB memory). - Kernel boot message: "PCI-DMA: Disabling IOMMU" - - 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU. - Kernel boot message: "PCI-DMA: using GART IOMMU" - - 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used - e.g. if there is no hardware IOMMU in the system and it is need because - you have >3GB memory or told the kernel to us it (iommu=soft)) - Kernel boot message: "PCI-DMA: Using software bounce buffering - for IO (SWIOTLB)" - -:: - - iommu=[<size>][,noagp][,off][,force][,noforce] - [,memaper[=<order>]][,merge][,fullflush][,nomerge] - [,noaperture] - -General iommu options: - - off - Don't initialize and use any kind of IOMMU. - noforce - Don't force hardware IOMMU usage when it is not needed. (default). - force - Force the use of the hardware IOMMU even when it is - not actually needed (e.g. because < 3 GB memory). - soft - Use software bounce buffering (SWIOTLB) (default for - Intel machines). This can be used to prevent the usage - of an available hardware IOMMU. - -iommu options only relevant to the AMD GART hardware IOMMU: - - <size> - Set the size of the remapping area in bytes. - allowed - Overwrite iommu off workarounds for specific chipsets. - fullflush - Flush IOMMU on each allocation (default). - nofullflush - Don't use IOMMU fullflush. - memaper[=<order>] - Allocate an own aperture over RAM with size 32MB<<order. - (default: order=1, i.e. 64MB) - merge - Do scatter-gather (SG) merging. Implies "force" (experimental). - nomerge - Don't do scatter-gather (SG) merging. - noaperture - Ask the IOMMU not to touch the aperture for AGP. - noagp - Don't initialize the AGP driver and use full aperture. - panic - Always panic when IOMMU overflows. - -iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU -implementation: - - swiotlb=<slots>[,force,noforce] - <slots> - Prereserve that many 2K slots for the software IO bounce buffering. - force - Force all IO through the software TLB. - noforce - Do not initialize the software TLB. - - -Miscellaneous -============= - - nogbpages - Do not use GB pages for kernel direct mappings. - gbpages - Use GB pages for kernel direct mappings. - - -AMD SEV (Secure Encrypted Virtualization) -========================================= -Options relating to AMD SEV, specified via the following format: - -:: - - sev=option1[,option2] - -The available options are: - - debug - Enable debug messages. diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst index ba74617d4999..970ee94eb551 100644 --- a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst +++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst @@ -18,7 +18,7 @@ For more information on the features of cpusets, see Documentation/admin-guide/cgroup-v1/cpusets.rst. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of -configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst. +configuring fake nodes, see Documentation/admin-guide/kernel-parameters.txt For the purposes of this introduction, we'll assume a very primitive NUMA emulation setup of "numa=fake=4*512,". This will split our system memory into diff --git a/Documentation/arch/x86/x86_64/fred.rst b/Documentation/arch/x86/x86_64/fred.rst new file mode 100644 index 000000000000..9f57e7b91f7e --- /dev/null +++ b/Documentation/arch/x86/x86_64/fred.rst @@ -0,0 +1,96 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================= +Flexible Return and Event Delivery (FRED) +========================================= + +Overview +======== + +The FRED architecture defines simple new transitions that change +privilege level (ring transitions). The FRED architecture was +designed with the following goals: + +1) Improve overall performance and response time by replacing event + delivery through the interrupt descriptor table (IDT event + delivery) and event return by the IRET instruction with lower + latency transitions. + +2) Improve software robustness by ensuring that event delivery + establishes the full supervisor context and that event return + establishes the full user context. + +The new transitions defined by the FRED architecture are FRED event +delivery and, for returning from events, two FRED return instructions. +FRED event delivery can effect a transition from ring 3 to ring 0, but +it is used also to deliver events incident to ring 0. One FRED +instruction (ERETU) effects a return from ring 0 to ring 3, while the +other (ERETS) returns while remaining in ring 0. Collectively, FRED +event delivery and the FRED return instructions are FRED transitions. + +In addition to these transitions, the FRED architecture defines a new +instruction (LKGS) for managing the state of the GS segment register. +The LKGS instruction can be used by 64-bit operating systems that do +not use the new FRED transitions. + +Furthermore, the FRED architecture is easy to extend for future CPU +architectures. + +Software based event dispatching +================================ + +FRED operates differently from IDT in terms of event handling. Instead +of directly dispatching an event to its handler based on the event +vector, FRED requires the software to dispatch an event to its handler +based on both the event's type and vector. Therefore, an event dispatch +framework must be implemented to facilitate the event-to-handler +dispatch process. The FRED event dispatch framework takes control +once an event is delivered, and employs a two-level dispatch. + +The first level dispatching is event type based, and the second level +dispatching is event vector based. + +Full supervisor/user context +============================ + +FRED event delivery atomically save and restore full supervisor/user +context upon event delivery and return. Thus it avoids the problem of +transient states due to %cr2 and/or %dr6, and it is no longer needed +to handle all the ugly corner cases caused by half baked entry states. + +FRED allows explicit unblock of NMI with new event return instructions +ERETS/ERETU, avoiding the mess caused by IRET which unconditionally +unblocks NMI, e.g., when an exception happens during NMI handling. + +FRED always restores the full value of %rsp, thus ESPFIX is no longer +needed when FRED is enabled. + +LKGS +==== + +LKGS behaves like the MOV to GS instruction except that it loads the +base address into the IA32_KERNEL_GS_BASE MSR instead of the GS +segment’s descriptor cache. With LKGS, it ends up with avoiding +mucking with kernel GS, i.e., an operating system can always operate +with its own GS base address. + +Because FRED event delivery from ring 3 and ERETU both swap the value +of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus +the introduction of LKGS instruction, the SWAPGS instruction is no +longer needed when FRED is enabled, thus is disallowed (#UD). + +Stack levels +============ + +4 stack levels 0~3 are introduced to replace the nonreentrant IST for +event handling, and each stack level should be configured to use a +dedicated stack. + +The current stack level could be unchanged or go higher upon FRED +event delivery. If unchanged, the CPU keeps using the current event +stack. If higher, the CPU switches to a new event stack specified by +the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123]. + +Only execution of a FRED return instruction ERET[US], could lower the +current stack level, causing the CPU to switch back to the stack it was +on before a previous event delivery that promoted the stack level. diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst index 50960e09e1f6..d07e445dac5c 100644 --- a/Documentation/arch/x86/x86_64/fsgs.rst +++ b/Documentation/arch/x86/x86_64/fsgs.rst @@ -125,7 +125,7 @@ FSGSBASE instructions enablement FSGSBASE instructions compiler support ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE +GCC version 4.6.4 and newer provide intrinsics for the FSGSBASE instructions. Clang 5 supports them as well. =================== =========================== @@ -135,7 +135,7 @@ instructions. Clang 5 supports them as well. _writegsbase_u64() Write the GS base register =================== =========================== -To utilize these instrinsics <immintrin.h> must be included in the source +To utilize these intrinsics <immintrin.h> must be included in the source code and the compiler option -mfsgsbase has to be added. Compiler support for FS/GS based addressing diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst index a56070fc8e77..a0261957a08a 100644 --- a/Documentation/arch/x86/x86_64/index.rst +++ b/Documentation/arch/x86/x86_64/index.rst @@ -7,7 +7,6 @@ x86_64 Support .. toctree:: :maxdepth: 2 - boot-options uefi mm 5level-paging @@ -15,3 +14,4 @@ x86_64 Support cpu-hotplug-spec machinecheck fsgs + fred diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst index 35e5e18c83d0..f2db178b353f 100644 --- a/Documentation/arch/x86/x86_64/mm.rst +++ b/Documentation/arch/x86/x86_64/mm.rst @@ -29,15 +29,27 @@ Complete virtual memory map with 4-level page tables Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | - 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm + 0000000000000000 | 0 | 00007fffffffefff | ~128 TB | user-space virtual memory, different per mm + 00007ffffffff000 | ~128 TB | 00007fffffffffff | 4 kB | ... guard hole __________________|____________|__________________|_________|___________________________________________________________ | | | | - 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB + 0000800000000000 | +128 TB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -8 EB | | | | starting offset of kernel mappings. + | | | | + | | | | LAM relaxes canonicallity check allowing to create aliases + | | | | for userspace memory here. __________________|____________|__________________|_________|___________________________________________________________ | | Kernel-space virtual memory, shared between all processes: + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 8000000000000000 | -8 EB | ffff7fffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -128 TB + | | | | starting offset of kernel mappings. + | | | | + | | | | LAM_SUP relaxes canonicallity check allowing to create + | | | | aliases for kernel memory here. ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor @@ -88,16 +100,27 @@ Complete virtual memory map with 5-level page tables Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | - 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm + 0000000000000000 | 0 | 00fffffffffff000 | ~64 PB | user-space virtual memory, different per mm + 00fffffffffff000 | ~64 PB | 00ffffffffffffff | 4 kB | ... guard hole __________________|____________|__________________|_________|___________________________________________________________ | | | | - 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -64 PB + 0100000000000000 | +64 PB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -8EB TB | | | | starting offset of kernel mappings. + | | | | + | | | | LAM relaxes canonicallity check allowing to create aliases + | | | | for userspace memory here. __________________|____________|__________________|_________|___________________________________________________________ | | Kernel-space virtual memory, shared between all processes: ____________________________________________________________|___________________________________________________________ + 8000000000000000 | -8 EB | feffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -64 PB + | | | | starting offset of kernel mappings. + | | | | + | | | | LAM_SUP relaxes canonicallity check allowing to create + | | | | aliases for kernel memory here. + ____________________________________________________________|___________________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst index fbc30c9a071d..e84592dbd6c1 100644 --- a/Documentation/arch/x86/x86_64/uefi.rst +++ b/Documentation/arch/x86/x86_64/uefi.rst @@ -12,14 +12,20 @@ with EFI firmware and specifications are listed below. 1. UEFI specification: http://www.uefi.org -2. Booting Linux kernel on UEFI x86_64 platform requires bootloader - support. Elilo with x86_64 support can be used. +2. Booting Linux kernel on UEFI x86_64 platform can either be + done using the <Documentation/admin-guide/efi-stub.rst> or using a + separate bootloader. 3. x86_64 platform with EFI/UEFI firmware. Mechanics --------- +Refer to <Documentation/admin-guide/efi-stub.rst> to learn how to use the EFI stub. + +Below are general EFI setup guidelines on the x86_64 platform, +regardless of whether you use the EFI stub or a separate bootloader. + - Build the kernel with the following configuration:: CONFIG_FB_EFI=y @@ -31,16 +37,27 @@ Mechanics CONFIG_EFI=y CONFIG_EFIVAR_FS=y or m # optional -- Create a VFAT partition on the disk -- Copy the following to the VFAT partition: +- Create a VFAT partition on the disk with the EFI System flag + You can do this with fdisk with the following commands: + + 1. g - initialize a GPT partition table + 2. n - create a new partition + 3. t - change the partition type to "EFI System" (number 1) + 4. w - write and save the changes + + Afterwards, initialize the VFAT filesystem by running mkfs:: + + mkfs.fat /dev/<your-partition> + +- Copy the boot files to the VFAT partition: + If you use the EFI stub method, the kernel acts also as an EFI executable. + + You can just copy the bzImage to the EFI/boot/bootx64.efi path on the partition + so that it will automatically get booted, see the <Documentation/admin-guide/efi-stub.rst> page + for additional instructions regarding passage of kernel parameters and initramfs. - elilo bootloader with x86_64 support, elilo configuration file, - kernel image built in first step and corresponding - initrd. Instructions on building elilo and its dependencies - can be found in the elilo sourceforge project. + If you use a custom bootloader, refer to the relevant documentation for help on this part. -- Boot to EFI shell and invoke elilo choosing the kernel image built - in first step. - If some or all EFI runtime services don't work, you can try following kernel command line parameters to turn off some or all EFI runtime services. diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst index ae5c69e48b11..cec05ac464c1 100644 --- a/Documentation/arch/x86/xstate.rst +++ b/Documentation/arch/x86/xstate.rst @@ -138,7 +138,7 @@ Note this example does not include the sigaltstack preparation. Dynamic features in signal frames --------------------------------- -Dynamcally enabled features are not written to the signal frame upon signal +Dynamically enabled features are not written to the signal frame upon signal entry if the feature is in its initial configuration. This differs from non-dynamic features which are always written regardless of their configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV |