summaryrefslogtreecommitdiff
path: root/Documentation/arch/x86
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/arch/x86')
-rw-r--r--Documentation/arch/x86/amd-memory-encryption.rst163
-rw-r--r--Documentation/arch/x86/amd_hsmp.rst74
-rw-r--r--Documentation/arch/x86/boot.rst371
-rw-r--r--Documentation/arch/x86/buslock.rst3
-rw-r--r--Documentation/arch/x86/cpuinfo.rst2
-rw-r--r--Documentation/arch/x86/exception-tables.rst2
-rw-r--r--Documentation/arch/x86/mds.rst2
-rw-r--r--Documentation/arch/x86/pti.rst6
-rw-r--r--Documentation/arch/x86/resctrl.rst53
-rw-r--r--Documentation/arch/x86/sva.rst4
-rw-r--r--Documentation/arch/x86/topology.rst28
-rw-r--r--Documentation/arch/x86/x86_64/boot-options.rst319
-rw-r--r--Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst2
-rw-r--r--Documentation/arch/x86/x86_64/fred.rst96
-rw-r--r--Documentation/arch/x86/x86_64/fsgs.rst4
-rw-r--r--Documentation/arch/x86/x86_64/index.rst2
-rw-r--r--Documentation/arch/x86/x86_64/mm.rst35
-rw-r--r--Documentation/arch/x86/x86_64/uefi.rst37
-rw-r--r--Documentation/arch/x86/xstate.rst2
19 files changed, 638 insertions, 567 deletions
diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst
index 07caa8fff852..bd840df708ea 100644
--- a/Documentation/arch/x86/amd-memory-encryption.rst
+++ b/Documentation/arch/x86/amd-memory-encryption.rst
@@ -87,14 +87,14 @@ The state of SME in the Linux kernel can be documented as follows:
kernel is non-zero).
SME can also be enabled and activated in the BIOS. If SME is enabled and
-activated in the BIOS, then all memory accesses will be encrypted and it will
-not be necessary to activate the Linux memory encryption support. If the BIOS
-merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG), then Linux can activate
-memory encryption by default (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y) or
-by supplying mem_encrypt=on on the kernel command line. However, if BIOS does
-not enable SME, then Linux will not be able to activate memory encryption, even
-if configured to do so by default or the mem_encrypt=on command line parameter
-is specified.
+activated in the BIOS, then all memory accesses will be encrypted and it
+will not be necessary to activate the Linux memory encryption support.
+
+If the BIOS merely enables SME (sets bit 23 of the MSR_AMD64_SYSCFG),
+then memory encryption can be enabled by supplying mem_encrypt=on on the
+kernel command line. However, if BIOS does not enable SME, then Linux
+will not be able to activate memory encryption, even if configured to do
+so by default or the mem_encrypt=on command line parameter is specified.
Secure Nested Paging (SNP)
==========================
@@ -130,4 +130,149 @@ SNP feature support.
More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR
-[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
+Reverse Map Table (RMP)
+=======================
+
+The RMP is a structure in system memory that is used to ensure a one-to-one
+mapping between system physical addresses and guest physical addresses. Each
+page of memory that is potentially assignable to guests has one entry within
+the RMP.
+
+The RMP table can be either contiguous in memory or a collection of segments
+in memory.
+
+Contiguous RMP
+--------------
+
+Support for this form of the RMP is present when support for SEV-SNP is
+present, which can be determined using the CPUID instruction::
+
+ 0x8000001f[eax]:
+ Bit[4] indicates support for SEV-SNP
+
+The location of the RMP is identified to the hardware through two MSRs::
+
+ 0xc0010132 (RMP_BASE):
+ System physical address of the first byte of the RMP
+
+ 0xc0010133 (RMP_END):
+ System physical address of the last byte of the RMP
+
+Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV
+firmware increases the alignment requirement to require a 1MB alignment.
+
+The RMP consists of a 16KB region used for processor bookkeeping followed
+by the RMP entries, which are 16 bytes in size. The size of the RMP
+determines the range of physical memory that the hypervisor can assign to
+SEV-SNP guests. The RMP covers the system physical address from::
+
+ 0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB.
+
+The current Linux support relies on BIOS to allocate/reserve the memory for
+the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR
+values to locate the RMP and determine the size of the RMP. The RMP must
+cover all of system memory in order for Linux to enable SEV-SNP.
+
+Segmented RMP
+-------------
+
+Segmented RMP support is a new way of representing the layout of an RMP.
+Initial RMP support required the RMP table to be contiguous in memory.
+RMP accesses from a NUMA node on which the RMP doesn't reside
+can take longer than accesses from a NUMA node on which the RMP resides.
+Segmented RMP support allows the RMP entries to be located on the same
+node as the memory the RMP is covering, potentially reducing latency
+associated with accessing an RMP entry associated with the memory. Each
+RMP segment covers a specific range of system physical addresses.
+
+Support for this form of the RMP can be determined using the CPUID
+instruction::
+
+ 0x8000001f[eax]:
+ Bit[23] indicates support for segmented RMP
+
+If supported, segmented RMP attributes can be found using the CPUID
+instruction::
+
+ 0x80000025[eax]:
+ Bits[5:0] minimum supported RMP segment size
+ Bits[11:6] maximum supported RMP segment size
+
+ 0x80000025[ebx]:
+ Bits[9:0] number of cacheable RMP segment definitions
+ Bit[10] indicates if the number of cacheable RMP segments
+ is a hard limit
+
+To enable a segmented RMP, a new MSR is available::
+
+ 0xc0010136 (RMP_CFG):
+ Bit[0] indicates if segmented RMP is enabled
+ Bits[13:8] contains the size of memory covered by an RMP
+ segment (expressed as a power of 2)
+
+The RMP segment size defined in the RMP_CFG MSR applies to all segments
+of the RMP. Therefore each RMP segment covers a specific range of system
+physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then
+the RMP segment coverage value is 0x24 => 36, meaning the size of memory
+covered by an RMP segment is 64GB (1 << 36). So the first RMP segment
+covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment
+covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc.
+
+When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping
+area as it does today (16K in size). However, instead of RMP entries
+beginning immediately after the bookkeeping area, there is a 4K RMP
+segment table (RST). Each entry in the RST is 8-bytes in size and represents
+an RMP segment::
+
+ Bits[19:0] mapped size (in GB)
+ The mapped size can be less than the defined segment size.
+ A value of zero, indicates that no RMP exists for the range
+ of system physical addresses associated with this segment.
+ Bits[51:20] segment physical address
+ This address is left shift 20-bits (or just masked when
+ read) to form the physical address of the segment (1MB
+ alignment).
+
+The RST can hold 512 segment entries but can be limited in size to the number
+of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable
+RMP segments is a hard limit (CPUID 0x80000025_EBX[10]).
+
+The current Linux support relies on BIOS to allocate/reserve the memory for
+the segmented RMP (the bookkeeping area, RST, and all segments), build the RST
+and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR
+values to locate the RMP and determine the size and location of the RMP
+segments. The RMP must cover all of system memory in order for Linux to enable
+SEV-SNP.
+
+More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table",
+docID: 24593.
+
+Secure VM Service Module (SVSM)
+===============================
+
+SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which
+defines four privilege levels at which guest software can run. The most
+privileged level is 0 and numerically higher numbers have lesser privileges.
+More details in the AMD64 APM Vol 2, section "15.35.7 Virtual Machine
+Privilege Levels", docID: 24593.
+
+When using that feature, different services can run at different protection
+levels, apart from the guest OS but still within the secure SNP environment.
+They can provide services to the guest, like a vTPM, for example.
+
+When a guest is not running at VMPL0, it needs to communicate with the software
+running at VMPL0 to perform privileged operations or to interact with secure
+services. An example fur such a privileged operation is PVALIDATE which is
+*required* to be executed at VMPL0.
+
+In this scenario, the software running at VMPL0 is usually called a Secure VM
+Service Module (SVSM). Discovery of an SVSM and the API used to communicate
+with it is documented in "Secure VM Service Module for SEV-SNP Guests", docID:
+58019.
+
+(Latest versions of the above-mentioned documents can be found by using
+a search engine like duckduckgo.com and typing in:
+
+ site:amd.com "Secure VM Service Module for SEV-SNP Guests", docID: 58019
+
+for example.)
diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst
index c92bfd55359f..2fd917638e42 100644
--- a/Documentation/arch/x86/amd_hsmp.rst
+++ b/Documentation/arch/x86/amd_hsmp.rst
@@ -4,8 +4,9 @@
AMD HSMP interface
============================================
-Newer Fam19h EPYC server line of processors from AMD support system
-management functionality via HSMP (Host System Management Port).
+Newer Fam19h(model 0x00-0x1f, 0x30-0x3f, 0x90-0x9f, 0xa0-0xaf),
+Fam1Ah(model 0x00-0x1f) EPYC server line of processors from AMD support
+system management functionality via HSMP (Host System Management Port).
The Host System Management Port (HSMP) is an interface to provide
OS-level software with access to system management functions via a
@@ -13,16 +14,28 @@ set of mailbox registers.
More details on the interface can be found in chapter
"7 Host System Management Port (HSMP)" of the family/model PPR
-Eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
+Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
-HSMP interface is supported on EPYC server CPU models only.
+
+HSMP interface is supported on EPYC line of server CPUs and MI300A (APU).
HSMP device
============================================
-amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice
-/dev/hsmp to let user space programs run hsmp mailbox commands.
+amd_hsmp driver under drivers/platforms/x86/amd/hsmp/ has separate driver files
+for ACPI object based probing, platform device based probing and for the common
+code for these two drivers.
+
+Kconfig option CONFIG_AMD_HSMP_PLAT compiles plat.c and creates amd_hsmp.ko.
+Kconfig option CONFIG_AMD_HSMP_ACPI compiles acpi.c and creates hsmp_acpi.ko.
+Selecting any of these two configs automatically selects CONFIG_AMD_HSMP. This
+compiles common code hsmp.c and creates hsmp_common.ko module.
+
+Both the ACPI and plat drivers create the miscdevice /dev/hsmp to let
+user space programs run hsmp mailbox commands.
+
+The ACPI object format supported by the driver is defined below.
$ ls -al /dev/hsmp
crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp
@@ -58,6 +71,51 @@ Note: lseek() is not supported as entire metrics table is read.
Metrics table definitions will be documented as part of Public PPR.
The same is defined in the amd_hsmp.h header.
+ACPI device object format
+=========================
+The ACPI object format expected from the amd_hsmp driver
+for socket with ID00 is given below::
+
+ Device(HSMP)
+ {
+ Name(_HID, "AMDI0097")
+ Name(_UID, "ID00")
+ Name(HSE0, 0x00000001)
+ Name(RBF0, ResourceTemplate()
+ {
+ Memory32Fixed(ReadWrite, 0xxxxxxx, 0x00100000)
+ })
+ Method(_CRS, 0, NotSerialized)
+ {
+ Return(RBF0)
+ }
+ Method(_STA, 0, NotSerialized)
+ {
+ If(LEqual(HSE0, One))
+ {
+ Return(0x0F)
+ }
+ Else
+ {
+ Return(Zero)
+ }
+ }
+ Name(_DSD, Package(2)
+ {
+ Buffer(0x10)
+ {
+ 0x9D, 0x61, 0x4D, 0xB7, 0x07, 0x57, 0xBD, 0x48,
+ 0xA6, 0x9F, 0x4E, 0xA2, 0x87, 0x1F, 0xC2, 0xF6
+ },
+ Package(3)
+ {
+ Package(2) {"MsgIdOffset", 0x00010934},
+ Package(2) {"MsgRspOffset", 0x00010980},
+ Package(2) {"MsgArgOffset", 0x000109E0}
+ }
+ })
+ }
+
An example
==========
@@ -97,8 +155,8 @@ what happened. The transaction returns 0 on success.
More details on the interface and message definitions can be found in chapter
"7 Host System Management Port (HSMP)" of the respective family/model PPR
-eg: https://www.amd.com/system/files/TechDocs/55898_B1_pub_0.50.zip
+eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
User space C-APIs are made available by linking against the esmi library,
-which is provided by the E-SMS project https://developer.amd.com/e-sms/.
+which is provided by the E-SMS project https://www.amd.com/en/developer/e-sms.html.
See: https://github.com/amd/esmi_ib_library
diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
index c513855a54bb..76f53d3450e7 100644
--- a/Documentation/arch/x86/boot.rst
+++ b/Documentation/arch/x86/boot.rst
@@ -77,7 +77,7 @@ Protocol 2.14 BURNT BY INCORRECT COMMIT
Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max.
============= ============================================================
- .. note::
+.. note::
The protocol version number should be changed only if the setup header
is changed. There is no need to update the version number if boot_params
or kernel_info are changed. Additionally, it is recommended to use
@@ -95,27 +95,27 @@ Memory Layout
The traditional memory map for the kernel loader, used for Image or
zImage kernels, typically looks like::
- | |
- 0A0000 +------------------------+
- | Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
- 09A000 +------------------------+
- | Command line |
- | Stack/heap | For use by the kernel real-mode code.
- 098000 +------------------------+
- | Kernel setup | The kernel real-mode code.
- 090200 +------------------------+
- | Kernel boot sector | The kernel legacy boot sector.
- 090000 +------------------------+
- | Protected-mode kernel | The bulk of the kernel image.
- 010000 +------------------------+
- | Boot loader | <- Boot sector entry point 0000:7C00
- 001000 +------------------------+
- | Reserved for MBR/BIOS |
- 000800 +------------------------+
- | Typically used by MBR |
- 000600 +------------------------+
- | BIOS use only |
- 000000 +------------------------+
+ | |
+ 0A0000 +------------------------+
+ | Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
+ 09A000 +------------------------+
+ | Command line |
+ | Stack/heap | For use by the kernel real-mode code.
+ 098000 +------------------------+
+ | Kernel setup | The kernel real-mode code.
+ 090200 +------------------------+
+ | Kernel boot sector | The kernel legacy boot sector.
+ 090000 +------------------------+
+ | Protected-mode kernel | The bulk of the kernel image.
+ 010000 +------------------------+
+ | Boot loader | <- Boot sector entry point 0000:7C00
+ 001000 +------------------------+
+ | Reserved for MBR/BIOS |
+ 000800 +------------------------+
+ | Typically used by MBR |
+ 000600 +------------------------+
+ | BIOS use only |
+ 000000 +------------------------+
When using bzImage, the protected-mode kernel was relocated to
0x100000 ("high memory"), and the kernel real-mode block (boot sector,
@@ -142,28 +142,28 @@ above the 0x9A000 point; too many BIOSes will break above that point.
For a modern bzImage kernel with boot protocol version >= 2.02, a
memory layout like the following is suggested::
- ~ ~
- | Protected-mode kernel |
- 100000 +------------------------+
- | I/O memory hole |
- 0A0000 +------------------------+
- | Reserved for BIOS | Leave as much as possible unused
- ~ ~
- | Command line | (Can also be below the X+10000 mark)
- X+10000 +------------------------+
- | Stack/heap | For use by the kernel real-mode code.
- X+08000 +------------------------+
- | Kernel setup | The kernel real-mode code.
- | Kernel boot sector | The kernel legacy boot sector.
- X +------------------------+
- | Boot loader | <- Boot sector entry point 0000:7C00
- 001000 +------------------------+
- | Reserved for MBR/BIOS |
- 000800 +------------------------+
- | Typically used by MBR |
- 000600 +------------------------+
- | BIOS use only |
- 000000 +------------------------+
+ ~ ~
+ | Protected-mode kernel |
+ 100000 +------------------------+
+ | I/O memory hole |
+ 0A0000 +------------------------+
+ | Reserved for BIOS | Leave as much as possible unused
+ ~ ~
+ | Command line | (Can also be below the X+10000 mark)
+ X+10000 +------------------------+
+ | Stack/heap | For use by the kernel real-mode code.
+ X+08000 +------------------------+
+ | Kernel setup | The kernel real-mode code.
+ | Kernel boot sector | The kernel legacy boot sector.
+ X +------------------------+
+ | Boot loader | <- Boot sector entry point 0000:7C00
+ 001000 +------------------------+
+ | Reserved for MBR/BIOS |
+ 000800 +------------------------+
+ | Typically used by MBR |
+ 000600 +------------------------+
+ | BIOS use only |
+ 000000 +------------------------+
... where the address X is as low as the design of the boot loader permits.
@@ -229,22 +229,22 @@ Offset/Size Proto Name Meaning
=========== ======== ===================== ============================================
.. note::
- (1) For backwards compatibility, if the setup_sects field contains 0, the
- real value is 4.
+ (1) For backwards compatibility, if the setup_sects field contains 0,
+ the real value is 4.
- (2) For boot protocol prior to 2.04, the upper two bytes of the syssize
- field are unusable, which means the size of a bzImage kernel
- cannot be determined.
+ (2) For boot protocol prior to 2.04, the upper two bytes of the syssize
+ field are unusable, which means the size of a bzImage kernel
+ cannot be determined.
- (3) Ignored, but safe to set, for boot protocols 2.02-2.09.
+ (3) Ignored, but safe to set, for boot protocols 2.02-2.09.
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
the boot protocol version is "old". Loading an old kernel, the
following parameters should be assumed::
- Image type = zImage
- initrd not supported
- Real-mode kernel must be located at 0x90000.
+ Image type = zImage
+ initrd not supported
+ Real-mode kernel must be located at 0x90000.
Otherwise, the "version" field contains the protocol version,
e.g. protocol version 2.01 will contain 0x0201 in this field. When
@@ -265,7 +265,7 @@ All general purpose boot loaders should write the fields marked
nonstandard address should fill in the fields marked (reloc); other
boot loaders can ignore those fields.
-The byte order of all fields is littleendian (this is x86, after all.)
+The byte order of all fields is little endian (this is x86, after all.)
============ ===========
Field name: setup_sects
@@ -365,7 +365,7 @@ Offset/size: 0x206/2
Protocol: 2.00+
============ =======
- Contains the boot protocol version, in (major << 8)+minor format,
+ Contains the boot protocol version, in (major << 8) + minor format,
e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
10.17.
@@ -397,17 +397,17 @@ Protocol: 2.00+
If set to a nonzero value, contains a pointer to a NUL-terminated
human-readable kernel version number string, less 0x200. This can
be used to display the kernel version to the user. This value
- should be less than (0x200*setup_sects).
+ should be less than (0x200 * setup_sects).
For example, if this value is set to 0x1c00, the kernel version
number string can be found at offset 0x1e00 in the kernel file.
This is a valid value if and only if the "setup_sects" field
contains the value 15 or higher, as::
- 0x1c00 < 15*0x200 (= 0x1e00) but
- 0x1c00 >= 14*0x200 (= 0x1c00)
+ 0x1c00 < 15 * 0x200 (= 0x1e00) but
+ 0x1c00 >= 14 * 0x200 (= 0x1c00)
- 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
+ 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
============ ==================
Field name: type_of_loader
@@ -427,9 +427,9 @@ Protocol: 2.00+
For example, for T = 0x15, V = 0x234, write::
- type_of_loader <- 0xE4
- ext_loader_type <- 0x05
- ext_loader_ver <- 0x23
+ type_of_loader <- 0xE4
+ ext_loader_type <- 0x05
+ ext_loader_ver <- 0x23
Assigned boot loader ids (hexadecimal):
@@ -686,7 +686,7 @@ Protocol: 2.10+
If a boot loader makes use of this field, it should update the
kernel_alignment field with the alignment unit desired; typically::
- kernel_alignment = 1 << min_alignment
+ kernel_alignment = 1 << min_alignment;
There may be a considerable performance cost with an excessively
misaligned kernel. Therefore, a loader should typically try each
@@ -754,7 +754,7 @@ Protocol: 2.07+
0x00000000 The default x86/PC environment
0x00000001 lguest
0x00000002 Xen
- 0x00000003 Moorestown MID
+ 0x00000003 Intel MID (Moorestown, CloverTrail, Merrifield, Moorefield)
0x00000004 CE4100 TV Platform
========== ==============================
@@ -808,13 +808,13 @@ Protocol: 2.09+
parameters passing mechanism. The definition of struct setup_data is
as follow::
- struct setup_data {
- u64 next;
- u32 type;
- u32 len;
- u8 data[0];
- };
-
+ struct setup_data {
+ __u64 next;
+ __u32 type;
+ __u32 len;
+ __u8 data[];
+ }
+
Where, the next is a 64-bit physical pointer to the next node of
linked list, the next field of the last node is 0; the type is used
to identify the contents of data; the len is the length of data
@@ -834,12 +834,12 @@ Protocol: 2.09+
Thus setup_indirect struct and SETUP_INDIRECT type were introduced in
protocol 2.15::
- struct setup_indirect {
- __u32 type;
- __u32 reserved; /* Reserved, must be set to zero. */
- __u64 len;
- __u64 addr;
- };
+ struct setup_indirect {
+ __u32 type;
+ __u32 reserved; /* Reserved, must be set to zero. */
+ __u64 len;
+ __u64 addr;
+ };
The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be
SETUP_INDIRECT itself since making the setup_indirect a tree structure
@@ -849,17 +849,17 @@ Protocol: 2.09+
Let's give an example how to point to SETUP_E820_EXT data using setup_indirect.
In this case setup_data and setup_indirect will look like this::
- struct setup_data {
- __u64 next = 0 or <addr_of_next_setup_data_struct>;
- __u32 type = SETUP_INDIRECT;
- __u32 len = sizeof(setup_indirect);
- __u8 data[sizeof(setup_indirect)] = struct setup_indirect {
- __u32 type = SETUP_INDIRECT | SETUP_E820_EXT;
- __u32 reserved = 0;
- __u64 len = <len_of_SETUP_E820_EXT_data>;
- __u64 addr = <addr_of_SETUP_E820_EXT_data>;
- }
- }
+ struct setup_data {
+ .next = 0, /* or <addr_of_next_setup_data_struct> */
+ .type = SETUP_INDIRECT,
+ .len = sizeof(setup_indirect),
+ .data[sizeof(setup_indirect)] = (struct setup_indirect) {
+ .type = SETUP_INDIRECT | SETUP_E820_EXT,
+ .reserved = 0,
+ .len = <len_of_SETUP_E820_EXT_data>,
+ .addr = <addr_of_SETUP_E820_EXT_data>,
+ },
+ }
.. note::
SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished
@@ -878,7 +878,8 @@ Protocol: 2.10+
address if possible.
A non-relocatable kernel will unconditionally move itself and to run
- at this address.
+ at this address. A relocatable kernel will move itself to this address if it
+ loaded below this address.
============ =======
Field name: init_size
@@ -895,10 +896,19 @@ Offset/size: 0x260/4
The kernel runtime start address is determined by the following algorithm::
- if (relocatable_kernel)
- runtime_start = align_up(load_address, kernel_alignment)
- else
- runtime_start = pref_address
+ if (relocatable_kernel) {
+ if (load_address < pref_address)
+ load_address = pref_address;
+ runtime_start = align_up(load_address, kernel_alignment);
+ } else {
+ runtime_start = pref_address;
+ }
+
+Hence the necessary memory window location and size can be estimated by
+a boot loader as::
+
+ memory_window_start = runtime_start;
+ memory_window_size = init_size;
============ ===============
Field name: handover_offset
@@ -928,12 +938,12 @@ The kernel_info
===============
The relationships between the headers are analogous to the various data
-sections:
+sections::
setup_header = .data
boot_params/setup_data = .bss
-What is missing from the above list? That's right:
+What is missing from the above list? That's right::
kernel_info = .rodata
@@ -965,22 +975,22 @@ after kernel_info_var_len_data label. Each chunk of variable size data has to
be prefixed with header/magic and its size, e.g.::
kernel_info:
- .ascii "LToP" /* Header, Linux top (structure). */
- .long kernel_info_var_len_data - kernel_info
- .long kernel_info_end - kernel_info
- .long 0x01234567 /* Some fixed size data for the bootloaders. */
+ .ascii "LToP" /* Header, Linux top (structure). */
+ .long kernel_info_var_len_data - kernel_info
+ .long kernel_info_end - kernel_info
+ .long 0x01234567 /* Some fixed size data for the bootloaders. */
kernel_info_var_len_data:
- example_struct: /* Some variable size data for the bootloaders. */
- .ascii "0123" /* Header/Magic. */
- .long example_struct_end - example_struct
- .ascii "Struct"
- .long 0x89012345
+ example_struct: /* Some variable size data for the bootloaders. */
+ .ascii "0123" /* Header/Magic. */
+ .long example_struct_end - example_struct
+ .ascii "Struct"
+ .long 0x89012345
example_struct_end:
- example_strings: /* Some variable size data for the bootloaders. */
- .ascii "ABCD" /* Header/Magic. */
- .long example_strings_end - example_strings
- .asciz "String_0"
- .asciz "String_1"
+ example_strings: /* Some variable size data for the bootloaders. */
+ .ascii "ABCD" /* Header/Magic. */
+ .long example_strings_end - example_strings
+ .asciz "String_0"
+ .asciz "String_1"
example_strings_end:
kernel_info_end:
@@ -1129,67 +1139,63 @@ mode segment.
Such a boot loader should enter the following fields in the header::
- unsigned long base_ptr; /* base address for real-mode segment */
+ unsigned long base_ptr; /* base address for real-mode segment */
- if ( setup_sects == 0 ) {
- setup_sects = 4;
- }
+ if (setup_sects == 0)
+ setup_sects = 4;
- if ( protocol >= 0x0200 ) {
- type_of_loader = <type code>;
- if ( loading_initrd ) {
- ramdisk_image = <initrd_address>;
- ramdisk_size = <initrd_size>;
- }
+ if (protocol >= 0x0200) {
+ type_of_loader = <type code>;
+ if (loading_initrd) {
+ ramdisk_image = <initrd_address>;
+ ramdisk_size = <initrd_size>;
+ }
- if ( protocol >= 0x0202 && loadflags & 0x01 )
- heap_end = 0xe000;
- else
- heap_end = 0x9800;
+ if (protocol >= 0x0202 && loadflags & 0x01)
+ heap_end = 0xe000;
+ else
+ heap_end = 0x9800;
- if ( protocol >= 0x0201 ) {
- heap_end_ptr = heap_end - 0x200;
- loadflags |= 0x80; /* CAN_USE_HEAP */
- }
+ if (protocol >= 0x0201) {
+ heap_end_ptr = heap_end - 0x200;
+ loadflags |= 0x80; /* CAN_USE_HEAP */
+ }
- if ( protocol >= 0x0202 ) {
- cmd_line_ptr = base_ptr + heap_end;
- strcpy(cmd_line_ptr, cmdline);
- } else {
- cmd_line_magic = 0xA33F;
- cmd_line_offset = heap_end;
- setup_move_size = heap_end + strlen(cmdline)+1;
- strcpy(base_ptr+cmd_line_offset, cmdline);
- }
- } else {
- /* Very old kernel */
+ if (protocol >= 0x0202) {
+ cmd_line_ptr = base_ptr + heap_end;
+ strcpy(cmd_line_ptr, cmdline);
+ } else {
+ cmd_line_magic = 0xA33F;
+ cmd_line_offset = heap_end;
+ setup_move_size = heap_end + strlen(cmdline) + 1;
+ strcpy(base_ptr + cmd_line_offset, cmdline);
+ }
+ } else {
+ /* Very old kernel */
- heap_end = 0x9800;
+ heap_end = 0x9800;
- cmd_line_magic = 0xA33F;
- cmd_line_offset = heap_end;
+ cmd_line_magic = 0xA33F;
+ cmd_line_offset = heap_end;
- /* A very old kernel MUST have its real-mode code
- loaded at 0x90000 */
+ /* A very old kernel MUST have its real-mode code loaded at 0x90000 */
+ if (base_ptr != 0x90000) {
+ /* Copy the real-mode kernel */
+ memcpy(0x90000, base_ptr, (setup_sects + 1) * 512);
+ base_ptr = 0x90000; /* Relocated */
+ }
- if ( base_ptr != 0x90000 ) {
- /* Copy the real-mode kernel */
- memcpy(0x90000, base_ptr, (setup_sects+1)*512);
- base_ptr = 0x90000; /* Relocated */
- }
+ strcpy(0x90000 + cmd_line_offset, cmdline);
- strcpy(0x90000+cmd_line_offset, cmdline);
-
- /* It is recommended to clear memory up to the 32K mark */
- memset(0x90000 + (setup_sects+1)*512, 0,
- (64-(setup_sects+1))*512);
- }
+ /* It is recommended to clear memory up to the 32K mark */
+ memset(0x90000 + (setup_sects + 1) * 512, 0, (64 - (setup_sects + 1)) * 512);
+ }
Loading The Rest of The Kernel
==============================
-The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
+The 32-bit (non-real-mode) kernel starts at offset (setup_sects + 1) * 512
in the kernel file (again, if setup_sects == 0 the real value is 4.)
It should be loaded at address 0x10000 for Image/zImage kernels and
0x100000 for bzImage kernels.
@@ -1197,13 +1203,14 @@ It should be loaded at address 0x10000 for Image/zImage kernels and
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
bit (LOAD_HIGH) in the loadflags field is set::
- is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
- load_address = is_bzImage ? 0x100000 : 0x10000;
+ is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
+ load_address = is_bzImage ? 0x100000 : 0x10000;
-Note that Image/zImage kernels can be up to 512K in size, and thus use
-the entire 0x10000-0x90000 range of memory. This means it is pretty
-much a requirement for these kernels to load the real-mode part at
-0x90000. bzImage kernels allow much more flexibility.
+.. note::
+ Image/zImage kernels can be up to 512K in size, and thus use the entire
+ 0x10000-0x90000 range of memory. This means it is pretty much a
+ requirement for these kernels to load the real-mode part at 0x90000.
+ bzImage kernels allow much more flexibility.
Special Command Line Options
============================
@@ -1272,19 +1279,20 @@ es = ss.
In our example from above, we would do::
- /* Note: in the case of the "old" kernel protocol, base_ptr must
- be == 0x90000 at this point; see the previous sample code */
+ /*
+ * Note: in the case of the "old" kernel protocol, base_ptr must
+ * be == 0x90000 at this point; see the previous sample code.
+ */
+ seg = base_ptr >> 4;
- seg = base_ptr >> 4;
+ cli(); /* Enter with interrupts disabled! */
- cli(); /* Enter with interrupts disabled! */
+ /* Set up the real-mode kernel stack */
+ _SS = seg;
+ _SP = heap_end;
- /* Set up the real-mode kernel stack */
- _SS = seg;
- _SP = heap_end;
-
- _DS = _ES = _FS = _GS = seg;
- jmp_far(seg+0x20, 0); /* Run the kernel */
+ _DS = _ES = _FS = _GS = seg;
+ jmp_far(seg + 0x20, 0); /* Run the kernel */
If your boot sector accesses a floppy drive, it is recommended to
switch off the floppy motor before running the kernel, since the
@@ -1339,7 +1347,7 @@ from offset 0x01f1 of kernel image on should be loaded into struct
boot_params and examined. The end of setup header can be calculated as
follow::
- 0x0202 + byte value at offset 0x0201
+ 0x0202 + byte value at offset 0x0201
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
@@ -1375,7 +1383,7 @@ Then, the setup header at offset 0x01f1 of kernel image on should be
loaded into struct boot_params and examined. The end of setup header
can be calculated as follows::
- 0x0202 + byte value at offset 0x0201
+ 0x0202 + byte value at offset 0x0201
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
@@ -1417,7 +1425,7 @@ execution context provided by the EFI firmware.
The function prototype for the handover entry point looks like this::
- efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp)
+ void efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp);
'handle' is the EFI image handle passed to the boot loader by the EFI
firmware, 'table' is the EFI system table - these are the first two
@@ -1432,12 +1440,13 @@ The boot loader *must* fill out the following fields in bp::
All other fields should be zero.
-NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
- entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
- loading protocol (refer to [0] for an example of the bootloader side of
- this), which removes the need for any knowledge on the part of the EFI
- bootloader regarding the internal representation of boot_params or any
- requirements/limitations regarding the placement of the command line
- and ramdisk in memory, or the placement of the kernel image itself.
+.. note::
+ The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
+ entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
+ loading protocol (refer to [0] for an example of the bootloader side of
+ this), which removes the need for any knowledge on the part of the EFI
+ bootloader regarding the internal representation of boot_params or any
+ requirements/limitations regarding the placement of the command line
+ and ramdisk in memory, or the placement of the kernel image itself.
[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0
diff --git a/Documentation/arch/x86/buslock.rst b/Documentation/arch/x86/buslock.rst
index 4c5a4822eeb7..31f1bfdff16f 100644
--- a/Documentation/arch/x86/buslock.rst
+++ b/Documentation/arch/x86/buslock.rst
@@ -26,7 +26,8 @@ Detection
=========
Intel processors may support either or both of the following hardware
-mechanisms to detect split locks and bus locks.
+mechanisms to detect split locks and bus locks. Some AMD processors also
+support bus lock detect.
#AC exception for split lock detection
--------------------------------------
diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst
index 8895784d4784..6ef426a52cdc 100644
--- a/Documentation/arch/x86/cpuinfo.rst
+++ b/Documentation/arch/x86/cpuinfo.rst
@@ -112,7 +112,7 @@ conditions are met, the features are enabled by the set_cpu_cap or
setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS,
the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and
"split_lock_detect" will be displayed. The flag "ring3mwait" will be
-displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors.
+displayed only when running on INTEL_XEON_PHI_[KNL|KNM] processors.
d: Flags can represent purely software features.
------------------------------------------------
diff --git a/Documentation/arch/x86/exception-tables.rst b/Documentation/arch/x86/exception-tables.rst
index efde1fef4fbd..6e7177363f8f 100644
--- a/Documentation/arch/x86/exception-tables.rst
+++ b/Documentation/arch/x86/exception-tables.rst
@@ -297,7 +297,7 @@ vma occurs?
c) execution continues at local label 2 (address of the
instruction immediately after the faulting user access).
-The steps 8a to 8c in a certain way emulate the faulting instruction.
+ The steps a to c above in a certain way emulate the faulting instruction.
That's it, mostly. If you look at our example, you might ask why
we set EAX to -EFAULT in the exception handler code. Well, the
diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst
index c58c72362911..5a2e6c0ef04a 100644
--- a/Documentation/arch/x86/mds.rst
+++ b/Documentation/arch/x86/mds.rst
@@ -162,7 +162,7 @@ Mitigation points
3. It would take a large number of these precisely-timed NMIs to mount
an actual attack. There's presumably not enough bandwidth.
4. The NMI in question occurs after a VERW, i.e. when user state is
- restored and most interesting data is already scrubbed. Whats left
+ restored and most interesting data is already scrubbed. What's left
is only the data that NMI touches, and that may or may not be of
any interest.
diff --git a/Documentation/arch/x86/pti.rst b/Documentation/arch/x86/pti.rst
index e08d35177bc0..57e8392f61d3 100644
--- a/Documentation/arch/x86/pti.rst
+++ b/Documentation/arch/x86/pti.rst
@@ -26,9 +26,9 @@ comments in pti.c).
This approach helps to ensure that side-channel attacks leveraging
the paging structures do not function when PTI is enabled. It can be
-enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
-Once enabled at compile-time, it can be disabled at boot with the
-'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
+enabled by setting CONFIG_MITIGATION_PAGE_TABLE_ISOLATION=y at compile
+time. Once enabled at compile-time, it can be disabled at boot with
+the 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
Page Table Management
=====================
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..6768fc1fad16 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -45,7 +45,7 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MiBps
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@@ -375,11 +375,25 @@ When monitoring is enabled all MON groups will also contain:
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
Resource allocation rules
-------------------------
@@ -446,6 +460,12 @@ during mkdir.
max_threshold_occupancy is a user configurable value to determine the
occupancy at which an RMID can be freed.
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
Schemata files - general concepts
---------------------------------
Each line in the file describes one resource. The line starts with
@@ -478,6 +498,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
Memory bandwidth Allocation and monitoring
==========================================
@@ -526,7 +569,7 @@ threads start using more cores in an rdtgroup, the actual bandwidth may
increase or vary although user specified bandwidth percentage is same.
In order to mitigate this and make the interface more user friendly,
-resctrl added support for specifying the bandwidth in MBps as well. The
+resctrl added support for specifying the bandwidth in MiBps as well. The
kernel underneath would use a software feedback mechanism or a "Software
Controller(mba_sc)" which reads the actual bandwidth using MBM counters
and adjust the memory bandwidth percentages to ensure::
@@ -573,13 +616,13 @@ Memory b/w domain is L3 cache.
MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;...
-Memory bandwidth Allocation specified in MBps
----------------------------------------------
+Memory bandwidth Allocation specified in MiBps
+----------------------------------------------
Memory bandwidth domain is L3 cache.
::
- MB:<cache_id0>=bw_MBps0;<cache_id1>=bw_MBps1;...
+ MB:<cache_id0>=bw_MiBps0;<cache_id1>=bw_MiBps1;...
Slow Memory Bandwidth Allocation (SMBA)
---------------------------------------
diff --git a/Documentation/arch/x86/sva.rst b/Documentation/arch/x86/sva.rst
index 33cb05005982..6a759984d471 100644
--- a/Documentation/arch/x86/sva.rst
+++ b/Documentation/arch/x86/sva.rst
@@ -25,7 +25,7 @@ to cache translations for virtual addresses. The IOMMU driver uses the
mmu_notifier() support to keep the device TLB cache and the CPU cache in
sync. When an ATS lookup fails for a virtual address, the device should
use the PRI in order to request the virtual address to be paged into the
-CPU page tables. The device must use ATS again in order the fetch the
+CPU page tables. The device must use ATS again in order to fetch the
translation before use.
Shared Hardware Workqueues
@@ -216,7 +216,7 @@ submitting work and processing completions.
Single Root I/O Virtualization (SR-IOV) focuses on providing independent
hardware interfaces for virtualizing hardware. Hence, it's required to be
-almost fully functional interface to software supporting the traditional
+an almost fully functional interface to software supporting the traditional
BARs, space for interrupts via MSI-X, its own register layout.
Virtual Functions (VFs) are assisted by the Physical Function (PF)
driver.
diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst
index 08ebf9edbfc1..c12837e61bda 100644
--- a/Documentation/arch/x86/topology.rst
+++ b/Documentation/arch/x86/topology.rst
@@ -47,17 +47,21 @@ AMD nomenclature for package is 'Node'.
Package-related topology information in the kernel:
- - cpuinfo_x86.x86_max_cores:
+ - topology_num_threads_per_package()
- The number of cores in a package. This information is retrieved via CPUID.
+ The number of threads in a package.
- - cpuinfo_x86.x86_max_dies:
+ - topology_num_cores_per_package()
- The number of dies in a package. This information is retrieved via CPUID.
+ The number of cores in a package.
+
+ - topology_max_dies_per_package()
+
+ The maximum number of dies in a package.
- cpuinfo_x86.topo.die_id:
- The physical ID of the die. This information is retrieved via CPUID.
+ The physical ID of the die.
- cpuinfo_x86.topo.pkg_id:
@@ -96,16 +100,6 @@ are SMT- or CMT-type threads.
AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses
"core".
-Core-related topology information in the kernel:
-
- - smp_num_siblings:
-
- The number of threads in a core. The number of threads in a package can be
- calculated by::
-
- threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings
-
-
Threads
=======
A thread is a single scheduling unit. It's the equivalent to a logical Linux
@@ -141,6 +135,10 @@ Thread-related topology information in the kernel:
The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo
"core_id."
+ - topology_logical_core_id();
+
+ The logical core ID to which a thread belongs.
+
System topology examples
diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst
deleted file mode 100644
index 137432d34109..000000000000
--- a/Documentation/arch/x86/x86_64/boot-options.rst
+++ /dev/null
@@ -1,319 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================
-AMD64 Specific Boot Options
-===========================
-
-There are many others (usually documented in driver documentation), but
-only the AMD64 specific ones are listed here.
-
-Machine check
-=============
-Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
-
- mce=off
- Disable machine check
- mce=no_cmci
- Disable CMCI(Corrected Machine Check Interrupt) that
- Intel processor supports. Usually this disablement is
- not recommended, but it might be handy if your hardware
- is misbehaving.
- Note that you'll get more problems without CMCI than with
- due to the shared banks, i.e. you might get duplicated
- error logs.
- mce=dont_log_ce
- Don't make logs for corrected errors. All events reported
- as corrected are silently cleared by OS.
- This option will be useful if you have no interest in any
- of corrected errors.
- mce=ignore_ce
- Disable features for corrected errors, e.g. polling timer
- and CMCI. All events reported as corrected are not cleared
- by OS and remained in its error banks.
- Usually this disablement is not recommended, however if
- there is an agent checking/clearing corrected errors
- (e.g. BIOS or hardware monitoring applications), conflicting
- with OS's error handling, and you cannot deactivate the agent,
- then this option will be a help.
- mce=no_lmce
- Do not opt-in to Local MCE delivery. Use legacy method
- to broadcast MCEs.
- mce=bootlog
- Enable logging of machine checks left over from booting.
- Disabled by default on AMD Fam10h and older because some BIOS
- leave bogus ones.
- If your BIOS doesn't do that it's a good idea to enable though
- to make sure you log even machine check events that result
- in a reboot. On Intel systems it is enabled by default.
- mce=nobootlog
- Disable boot machine check logging.
- mce=monarchtimeout (number)
- monarchtimeout:
- Sets the time in us to wait for other CPUs on machine checks. 0
- to disable.
- mce=bios_cmci_threshold
- Don't overwrite the bios-set CMCI threshold. This boot option
- prevents Linux from overwriting the CMCI threshold set by the
- bios. Without this option, Linux always sets the CMCI
- threshold to 1. Enabling this may make memory predictive failure
- analysis less effective if the bios sets thresholds for memory
- errors since we will not see details for all errors.
- mce=recovery
- Force-enable recoverable machine check code paths
-
- nomce (for compatibility with i386)
- same as mce=off
-
- Everything else is in sysfs now.
-
-APICs
-=====
-
- apic
- Use IO-APIC. Default
-
- noapic
- Don't use the IO-APIC.
-
- disableapic
- Don't use the local APIC
-
- nolapic
- Don't use the local APIC (alias for i386 compatibility)
-
- pirq=...
- See Documentation/arch/x86/i386/IO-APIC.rst
-
- noapictimer
- Don't set up the APIC timer
-
- no_timer_check
- Don't check the IO-APIC timer. This can work around
- problems with incorrect timer initialization on some boards.
-
- apicpmtimer
- Do APIC timer calibration using the pmtimer. Implies
- apicmaintimer. Useful when your PIT timer is totally broken.
-
-Timing
-======
-
- notsc
- Deprecated, use tsc=unstable instead.
-
- nohpet
- Don't use the HPET timer.
-
-Idle loop
-=========
-
- idle=poll
- Don't do power saving in the idle loop using HLT, but poll for rescheduling
- event. This will make the CPUs eat a lot more power, but may be useful
- to get slightly better performance in multiprocessor benchmarks. It also
- makes some profiling using performance counters more accurate.
- Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
- CPUs) this option has no performance advantage over the normal idle loop.
- It may also interact badly with hyperthreading.
-
-Rebooting
-=========
-
- reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old]
- bios
- Use the CPU reboot vector for warm reset
- warm
- Don't set the cold reboot flag
- cold
- Set the cold reboot flag
- triple
- Force a triple fault (init)
- kbd
- Use the keyboard controller. cold reset (default)
- acpi
- Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
- the ACPI reset does not work, the reboot path attempts the reset
- using the keyboard controller.
- efi
- Use efi reset_system runtime service. If EFI is not configured or
- the EFI reset does not work, the reboot path attempts the reset using
- the keyboard controller.
- pci
- Use a write to the PCI config space register 0xcf9 to trigger reboot.
-
- Using warm reset will be much faster especially on big memory
- systems because the BIOS will not go through the memory check.
- Disadvantage is that not all hardware will be completely reinitialized
- on reboot so there may be boot problems on some systems.
-
- reboot=force
- Don't stop other CPUs on reboot. This can make reboot more reliable
- in some cases.
-
- reboot=default
- There are some built-in platform specific "quirks" - you may see:
- "reboot: <name> series board detected. Selecting <type> for reboots."
- In the case where you think the quirk is in error (e.g. you have
- newer BIOS, or newer board) using this option will ignore the built-in
- quirk table, and use the generic default reboot actions.
-
-NUMA
-====
-
- numa=off
- Only set up a single NUMA node spanning all memory.
-
- numa=noacpi
- Don't parse the SRAT table for NUMA setup
-
- numa=nohmat
- Don't parse the HMAT table for NUMA setup, or soft-reserved memory
- partitioning.
-
- numa=fake=<size>[MG]
- If given as a memory unit, fills all system RAM with nodes of
- size interleaved over physical nodes.
-
- numa=fake=<N>
- If given as an integer, fills all system RAM with N fake nodes
- interleaved over physical nodes.
-
- numa=fake=<N>U
- If given as an integer followed by 'U', it will divide each
- physical node into N emulated nodes.
-
-ACPI
-====
-
- acpi=off
- Don't enable ACPI
- acpi=ht
- Use ACPI boot table parsing, but don't enable ACPI interpreter
- acpi=force
- Force ACPI on (currently not needed)
- acpi=strict
- Disable out of spec ACPI workarounds.
- acpi_sci={edge,level,high,low}
- Set up ACPI SCI interrupt.
- acpi=noirq
- Don't route interrupts
- acpi=nocmcff
- Disable firmware first mode for corrected errors. This
- disables parsing the HEST CMC error source to check if
- firmware has set the FF flag. This may result in
- duplicate corrected error reports.
-
-PCI
-===
-
- pci=off
- Don't use PCI
- pci=conf1
- Use conf1 access.
- pci=conf2
- Use conf2 access.
- pci=rom
- Assign ROMs.
- pci=assign-busses
- Assign busses
- pci=irqmask=MASK
- Set PCI interrupt mask to MASK
- pci=lastbus=NUMBER
- Scan up to NUMBER busses, no matter what the mptable says.
- pci=noacpi
- Don't use ACPI to set up PCI interrupt routing.
-
-IOMMU (input/output memory management unit)
-===========================================
-Multiple x86-64 PCI-DMA mapping implementations exist, for example:
-
- 1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all
- (e.g. because you have < 3 GB memory).
- Kernel boot message: "PCI-DMA: Disabling IOMMU"
-
- 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
- Kernel boot message: "PCI-DMA: using GART IOMMU"
-
- 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
- e.g. if there is no hardware IOMMU in the system and it is need because
- you have >3GB memory or told the kernel to us it (iommu=soft))
- Kernel boot message: "PCI-DMA: Using software bounce buffering
- for IO (SWIOTLB)"
-
-::
-
- iommu=[<size>][,noagp][,off][,force][,noforce]
- [,memaper[=<order>]][,merge][,fullflush][,nomerge]
- [,noaperture]
-
-General iommu options:
-
- off
- Don't initialize and use any kind of IOMMU.
- noforce
- Don't force hardware IOMMU usage when it is not needed. (default).
- force
- Force the use of the hardware IOMMU even when it is
- not actually needed (e.g. because < 3 GB memory).
- soft
- Use software bounce buffering (SWIOTLB) (default for
- Intel machines). This can be used to prevent the usage
- of an available hardware IOMMU.
-
-iommu options only relevant to the AMD GART hardware IOMMU:
-
- <size>
- Set the size of the remapping area in bytes.
- allowed
- Overwrite iommu off workarounds for specific chipsets.
- fullflush
- Flush IOMMU on each allocation (default).
- nofullflush
- Don't use IOMMU fullflush.
- memaper[=<order>]
- Allocate an own aperture over RAM with size 32MB<<order.
- (default: order=1, i.e. 64MB)
- merge
- Do scatter-gather (SG) merging. Implies "force" (experimental).
- nomerge
- Don't do scatter-gather (SG) merging.
- noaperture
- Ask the IOMMU not to touch the aperture for AGP.
- noagp
- Don't initialize the AGP driver and use full aperture.
- panic
- Always panic when IOMMU overflows.
-
-iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
-implementation:
-
- swiotlb=<slots>[,force,noforce]
- <slots>
- Prereserve that many 2K slots for the software IO bounce buffering.
- force
- Force all IO through the software TLB.
- noforce
- Do not initialize the software TLB.
-
-
-Miscellaneous
-=============
-
- nogbpages
- Do not use GB pages for kernel direct mappings.
- gbpages
- Use GB pages for kernel direct mappings.
-
-
-AMD SEV (Secure Encrypted Virtualization)
-=========================================
-Options relating to AMD SEV, specified via the following format:
-
-::
-
- sev=option1[,option2]
-
-The available options are:
-
- debug
- Enable debug messages.
diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
index ba74617d4999..970ee94eb551 100644
--- a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
+++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
@@ -18,7 +18,7 @@ For more information on the features of cpusets, see
Documentation/admin-guide/cgroup-v1/cpusets.rst.
There are a number of different configurations you can use for your needs. For
more information on the numa=fake command line option and its various ways of
-configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst.
+configuring fake nodes, see Documentation/admin-guide/kernel-parameters.txt
For the purposes of this introduction, we'll assume a very primitive NUMA
emulation setup of "numa=fake=4*512,". This will split our system memory into
diff --git a/Documentation/arch/x86/x86_64/fred.rst b/Documentation/arch/x86/x86_64/fred.rst
new file mode 100644
index 000000000000..9f57e7b91f7e
--- /dev/null
+++ b/Documentation/arch/x86/x86_64/fred.rst
@@ -0,0 +1,96 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================
+Flexible Return and Event Delivery (FRED)
+=========================================
+
+Overview
+========
+
+The FRED architecture defines simple new transitions that change
+privilege level (ring transitions). The FRED architecture was
+designed with the following goals:
+
+1) Improve overall performance and response time by replacing event
+ delivery through the interrupt descriptor table (IDT event
+ delivery) and event return by the IRET instruction with lower
+ latency transitions.
+
+2) Improve software robustness by ensuring that event delivery
+ establishes the full supervisor context and that event return
+ establishes the full user context.
+
+The new transitions defined by the FRED architecture are FRED event
+delivery and, for returning from events, two FRED return instructions.
+FRED event delivery can effect a transition from ring 3 to ring 0, but
+it is used also to deliver events incident to ring 0. One FRED
+instruction (ERETU) effects a return from ring 0 to ring 3, while the
+other (ERETS) returns while remaining in ring 0. Collectively, FRED
+event delivery and the FRED return instructions are FRED transitions.
+
+In addition to these transitions, the FRED architecture defines a new
+instruction (LKGS) for managing the state of the GS segment register.
+The LKGS instruction can be used by 64-bit operating systems that do
+not use the new FRED transitions.
+
+Furthermore, the FRED architecture is easy to extend for future CPU
+architectures.
+
+Software based event dispatching
+================================
+
+FRED operates differently from IDT in terms of event handling. Instead
+of directly dispatching an event to its handler based on the event
+vector, FRED requires the software to dispatch an event to its handler
+based on both the event's type and vector. Therefore, an event dispatch
+framework must be implemented to facilitate the event-to-handler
+dispatch process. The FRED event dispatch framework takes control
+once an event is delivered, and employs a two-level dispatch.
+
+The first level dispatching is event type based, and the second level
+dispatching is event vector based.
+
+Full supervisor/user context
+============================
+
+FRED event delivery atomically save and restore full supervisor/user
+context upon event delivery and return. Thus it avoids the problem of
+transient states due to %cr2 and/or %dr6, and it is no longer needed
+to handle all the ugly corner cases caused by half baked entry states.
+
+FRED allows explicit unblock of NMI with new event return instructions
+ERETS/ERETU, avoiding the mess caused by IRET which unconditionally
+unblocks NMI, e.g., when an exception happens during NMI handling.
+
+FRED always restores the full value of %rsp, thus ESPFIX is no longer
+needed when FRED is enabled.
+
+LKGS
+====
+
+LKGS behaves like the MOV to GS instruction except that it loads the
+base address into the IA32_KERNEL_GS_BASE MSR instead of the GS
+segment’s descriptor cache. With LKGS, it ends up with avoiding
+mucking with kernel GS, i.e., an operating system can always operate
+with its own GS base address.
+
+Because FRED event delivery from ring 3 and ERETU both swap the value
+of the GS base address and that of the IA32_KERNEL_GS_BASE MSR, plus
+the introduction of LKGS instruction, the SWAPGS instruction is no
+longer needed when FRED is enabled, thus is disallowed (#UD).
+
+Stack levels
+============
+
+4 stack levels 0~3 are introduced to replace the nonreentrant IST for
+event handling, and each stack level should be configured to use a
+dedicated stack.
+
+The current stack level could be unchanged or go higher upon FRED
+event delivery. If unchanged, the CPU keeps using the current event
+stack. If higher, the CPU switches to a new event stack specified by
+the MSR of the new stack level, i.e., MSR_IA32_FRED_RSP[123].
+
+Only execution of a FRED return instruction ERET[US], could lower the
+current stack level, causing the CPU to switch back to the stack it was
+on before a previous event delivery that promoted the stack level.
diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst
index 50960e09e1f6..d07e445dac5c 100644
--- a/Documentation/arch/x86/x86_64/fsgs.rst
+++ b/Documentation/arch/x86/x86_64/fsgs.rst
@@ -125,7 +125,7 @@ FSGSBASE instructions enablement
FSGSBASE instructions compiler support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE
+GCC version 4.6.4 and newer provide intrinsics for the FSGSBASE
instructions. Clang 5 supports them as well.
=================== ===========================
@@ -135,7 +135,7 @@ instructions. Clang 5 supports them as well.
_writegsbase_u64() Write the GS base register
=================== ===========================
-To utilize these instrinsics <immintrin.h> must be included in the source
+To utilize these intrinsics <immintrin.h> must be included in the source
code and the compiler option -mfsgsbase has to be added.
Compiler support for FS/GS based addressing
diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst
index a56070fc8e77..a0261957a08a 100644
--- a/Documentation/arch/x86/x86_64/index.rst
+++ b/Documentation/arch/x86/x86_64/index.rst
@@ -7,7 +7,6 @@ x86_64 Support
.. toctree::
:maxdepth: 2
- boot-options
uefi
mm
5level-paging
@@ -15,3 +14,4 @@ x86_64 Support
cpu-hotplug-spec
machinecheck
fsgs
+ fred
diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
index 35e5e18c83d0..f2db178b353f 100644
--- a/Documentation/arch/x86/x86_64/mm.rst
+++ b/Documentation/arch/x86/x86_64/mm.rst
@@ -29,15 +29,27 @@ Complete virtual memory map with 4-level page tables
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
- 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
+ 0000000000000000 | 0 | 00007fffffffefff | ~128 TB | user-space virtual memory, different per mm
+ 00007ffffffff000 | ~128 TB | 00007fffffffffff | 4 kB | ... guard hole
__________________|____________|__________________|_________|___________________________________________________________
| | | |
- 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -128 TB
+ 0000800000000000 | +128 TB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -8 EB
| | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM relaxes canonicallity check allowing to create aliases
+ | | | | for userspace memory here.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 8000000000000000 | -8 EB | ffff7fffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -128 TB
+ | | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM_SUP relaxes canonicallity check allowing to create
+ | | | | aliases for kernel memory here.
____________________________________________________________|___________________________________________________________
| | | |
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
@@ -88,16 +100,27 @@ Complete virtual memory map with 5-level page tables
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
- 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
+ 0000000000000000 | 0 | 00fffffffffff000 | ~64 PB | user-space virtual memory, different per mm
+ 00fffffffffff000 | ~64 PB | 00ffffffffffffff | 4 kB | ... guard hole
__________________|____________|__________________|_________|___________________________________________________________
| | | |
- 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -64 PB
+ 0100000000000000 | +64 PB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -8EB TB
| | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM relaxes canonicallity check allowing to create aliases
+ | | | | for userspace memory here.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
+ 8000000000000000 | -8 EB | feffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -64 PB
+ | | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM_SUP relaxes canonicallity check allowing to create
+ | | | | aliases for kernel memory here.
+ ____________________________________________________________|___________________________________________________________
| | | |
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst
index fbc30c9a071d..e84592dbd6c1 100644
--- a/Documentation/arch/x86/x86_64/uefi.rst
+++ b/Documentation/arch/x86/x86_64/uefi.rst
@@ -12,14 +12,20 @@ with EFI firmware and specifications are listed below.
1. UEFI specification: http://www.uefi.org
-2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
- support. Elilo with x86_64 support can be used.
+2. Booting Linux kernel on UEFI x86_64 platform can either be
+ done using the <Documentation/admin-guide/efi-stub.rst> or using a
+ separate bootloader.
3. x86_64 platform with EFI/UEFI firmware.
Mechanics
---------
+Refer to <Documentation/admin-guide/efi-stub.rst> to learn how to use the EFI stub.
+
+Below are general EFI setup guidelines on the x86_64 platform,
+regardless of whether you use the EFI stub or a separate bootloader.
+
- Build the kernel with the following configuration::
CONFIG_FB_EFI=y
@@ -31,16 +37,27 @@ Mechanics
CONFIG_EFI=y
CONFIG_EFIVAR_FS=y or m # optional
-- Create a VFAT partition on the disk
-- Copy the following to the VFAT partition:
+- Create a VFAT partition on the disk with the EFI System flag
+ You can do this with fdisk with the following commands:
+
+ 1. g - initialize a GPT partition table
+ 2. n - create a new partition
+ 3. t - change the partition type to "EFI System" (number 1)
+ 4. w - write and save the changes
+
+ Afterwards, initialize the VFAT filesystem by running mkfs::
+
+ mkfs.fat /dev/<your-partition>
+
+- Copy the boot files to the VFAT partition:
+ If you use the EFI stub method, the kernel acts also as an EFI executable.
+
+ You can just copy the bzImage to the EFI/boot/bootx64.efi path on the partition
+ so that it will automatically get booted, see the <Documentation/admin-guide/efi-stub.rst> page
+ for additional instructions regarding passage of kernel parameters and initramfs.
- elilo bootloader with x86_64 support, elilo configuration file,
- kernel image built in first step and corresponding
- initrd. Instructions on building elilo and its dependencies
- can be found in the elilo sourceforge project.
+ If you use a custom bootloader, refer to the relevant documentation for help on this part.
-- Boot to EFI shell and invoke elilo choosing the kernel image built
- in first step.
- If some or all EFI runtime services don't work, you can try following
kernel command line parameters to turn off some or all EFI runtime
services.
diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst
index ae5c69e48b11..cec05ac464c1 100644
--- a/Documentation/arch/x86/xstate.rst
+++ b/Documentation/arch/x86/xstate.rst
@@ -138,7 +138,7 @@ Note this example does not include the sigaltstack preparation.
Dynamic features in signal frames
---------------------------------
-Dynamcally enabled features are not written to the signal frame upon signal
+Dynamically enabled features are not written to the signal frame upon signal
entry if the feature is in its initial configuration. This differs from
non-dynamic features which are always written regardless of their
configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV