summaryrefslogtreecommitdiff
path: root/Documentation/arch/x86
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/arch/x86')
-rw-r--r--Documentation/arch/x86/amd-memory-encryption.rst147
-rw-r--r--Documentation/arch/x86/amd_hsmp.rst67
-rw-r--r--Documentation/arch/x86/boot.rst368
-rw-r--r--Documentation/arch/x86/buslock.rst3
-rw-r--r--Documentation/arch/x86/cpuinfo.rst2
-rw-r--r--Documentation/arch/x86/exception-tables.rst2
-rw-r--r--Documentation/arch/x86/mds.rst2
-rw-r--r--Documentation/arch/x86/resctrl.rst43
-rw-r--r--Documentation/arch/x86/topology.rst4
-rw-r--r--Documentation/arch/x86/x86_64/boot-options.rst319
-rw-r--r--Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst2
-rw-r--r--Documentation/arch/x86/x86_64/fsgs.rst4
-rw-r--r--Documentation/arch/x86/x86_64/index.rst1
-rw-r--r--Documentation/arch/x86/x86_64/mm.rst35
-rw-r--r--Documentation/arch/x86/x86_64/uefi.rst37
-rw-r--r--Documentation/arch/x86/xstate.rst2
16 files changed, 508 insertions, 530 deletions
diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst
index 414bc7402ae7..bd840df708ea 100644
--- a/Documentation/arch/x86/amd-memory-encryption.rst
+++ b/Documentation/arch/x86/amd-memory-encryption.rst
@@ -130,4 +130,149 @@ SNP feature support.
More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR
-[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf
+Reverse Map Table (RMP)
+=======================
+
+The RMP is a structure in system memory that is used to ensure a one-to-one
+mapping between system physical addresses and guest physical addresses. Each
+page of memory that is potentially assignable to guests has one entry within
+the RMP.
+
+The RMP table can be either contiguous in memory or a collection of segments
+in memory.
+
+Contiguous RMP
+--------------
+
+Support for this form of the RMP is present when support for SEV-SNP is
+present, which can be determined using the CPUID instruction::
+
+ 0x8000001f[eax]:
+ Bit[4] indicates support for SEV-SNP
+
+The location of the RMP is identified to the hardware through two MSRs::
+
+ 0xc0010132 (RMP_BASE):
+ System physical address of the first byte of the RMP
+
+ 0xc0010133 (RMP_END):
+ System physical address of the last byte of the RMP
+
+Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV
+firmware increases the alignment requirement to require a 1MB alignment.
+
+The RMP consists of a 16KB region used for processor bookkeeping followed
+by the RMP entries, which are 16 bytes in size. The size of the RMP
+determines the range of physical memory that the hypervisor can assign to
+SEV-SNP guests. The RMP covers the system physical address from::
+
+ 0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB.
+
+The current Linux support relies on BIOS to allocate/reserve the memory for
+the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR
+values to locate the RMP and determine the size of the RMP. The RMP must
+cover all of system memory in order for Linux to enable SEV-SNP.
+
+Segmented RMP
+-------------
+
+Segmented RMP support is a new way of representing the layout of an RMP.
+Initial RMP support required the RMP table to be contiguous in memory.
+RMP accesses from a NUMA node on which the RMP doesn't reside
+can take longer than accesses from a NUMA node on which the RMP resides.
+Segmented RMP support allows the RMP entries to be located on the same
+node as the memory the RMP is covering, potentially reducing latency
+associated with accessing an RMP entry associated with the memory. Each
+RMP segment covers a specific range of system physical addresses.
+
+Support for this form of the RMP can be determined using the CPUID
+instruction::
+
+ 0x8000001f[eax]:
+ Bit[23] indicates support for segmented RMP
+
+If supported, segmented RMP attributes can be found using the CPUID
+instruction::
+
+ 0x80000025[eax]:
+ Bits[5:0] minimum supported RMP segment size
+ Bits[11:6] maximum supported RMP segment size
+
+ 0x80000025[ebx]:
+ Bits[9:0] number of cacheable RMP segment definitions
+ Bit[10] indicates if the number of cacheable RMP segments
+ is a hard limit
+
+To enable a segmented RMP, a new MSR is available::
+
+ 0xc0010136 (RMP_CFG):
+ Bit[0] indicates if segmented RMP is enabled
+ Bits[13:8] contains the size of memory covered by an RMP
+ segment (expressed as a power of 2)
+
+The RMP segment size defined in the RMP_CFG MSR applies to all segments
+of the RMP. Therefore each RMP segment covers a specific range of system
+physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then
+the RMP segment coverage value is 0x24 => 36, meaning the size of memory
+covered by an RMP segment is 64GB (1 << 36). So the first RMP segment
+covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment
+covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc.
+
+When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping
+area as it does today (16K in size). However, instead of RMP entries
+beginning immediately after the bookkeeping area, there is a 4K RMP
+segment table (RST). Each entry in the RST is 8-bytes in size and represents
+an RMP segment::
+
+ Bits[19:0] mapped size (in GB)
+ The mapped size can be less than the defined segment size.
+ A value of zero, indicates that no RMP exists for the range
+ of system physical addresses associated with this segment.
+ Bits[51:20] segment physical address
+ This address is left shift 20-bits (or just masked when
+ read) to form the physical address of the segment (1MB
+ alignment).
+
+The RST can hold 512 segment entries but can be limited in size to the number
+of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable
+RMP segments is a hard limit (CPUID 0x80000025_EBX[10]).
+
+The current Linux support relies on BIOS to allocate/reserve the memory for
+the segmented RMP (the bookkeeping area, RST, and all segments), build the RST
+and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR
+values to locate the RMP and determine the size and location of the RMP
+segments. The RMP must cover all of system memory in order for Linux to enable
+SEV-SNP.
+
+More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table",
+docID: 24593.
+
+Secure VM Service Module (SVSM)
+===============================
+
+SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which
+defines four privilege levels at which guest software can run. The most
+privileged level is 0 and numerically higher numbers have lesser privileges.
+More details in the AMD64 APM Vol 2, section "15.35.7 Virtual Machine
+Privilege Levels", docID: 24593.
+
+When using that feature, different services can run at different protection
+levels, apart from the guest OS but still within the secure SNP environment.
+They can provide services to the guest, like a vTPM, for example.
+
+When a guest is not running at VMPL0, it needs to communicate with the software
+running at VMPL0 to perform privileged operations or to interact with secure
+services. An example fur such a privileged operation is PVALIDATE which is
+*required* to be executed at VMPL0.
+
+In this scenario, the software running at VMPL0 is usually called a Secure VM
+Service Module (SVSM). Discovery of an SVSM and the API used to communicate
+with it is documented in "Secure VM Service Module for SEV-SNP Guests", docID:
+58019.
+
+(Latest versions of the above-mentioned documents can be found by using
+a search engine like duckduckgo.com and typing in:
+
+ site:amd.com "Secure VM Service Module for SEV-SNP Guests", docID: 58019
+
+for example.)
diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst
index 1e499ecf5f4e..2fd917638e42 100644
--- a/Documentation/arch/x86/amd_hsmp.rst
+++ b/Documentation/arch/x86/amd_hsmp.rst
@@ -4,8 +4,9 @@
AMD HSMP interface
============================================
-Newer Fam19h EPYC server line of processors from AMD support system
-management functionality via HSMP (Host System Management Port).
+Newer Fam19h(model 0x00-0x1f, 0x30-0x3f, 0x90-0x9f, 0xa0-0xaf),
+Fam1Ah(model 0x00-0x1f) EPYC server line of processors from AMD support
+system management functionality via HSMP (Host System Management Port).
The Host System Management Port (HSMP) is an interface to provide
OS-level software with access to system management functions via a
@@ -16,14 +17,25 @@ More details on the interface can be found in chapter
Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip
-HSMP interface is supported on EPYC server CPU models only.
+HSMP interface is supported on EPYC line of server CPUs and MI300A (APU).
HSMP device
============================================
-amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice
-/dev/hsmp to let user space programs run hsmp mailbox commands.
+amd_hsmp driver under drivers/platforms/x86/amd/hsmp/ has separate driver files
+for ACPI object based probing, platform device based probing and for the common
+code for these two drivers.
+
+Kconfig option CONFIG_AMD_HSMP_PLAT compiles plat.c and creates amd_hsmp.ko.
+Kconfig option CONFIG_AMD_HSMP_ACPI compiles acpi.c and creates hsmp_acpi.ko.
+Selecting any of these two configs automatically selects CONFIG_AMD_HSMP. This
+compiles common code hsmp.c and creates hsmp_common.ko module.
+
+Both the ACPI and plat drivers create the miscdevice /dev/hsmp to let
+user space programs run hsmp mailbox commands.
+
+The ACPI object format supported by the driver is defined below.
$ ls -al /dev/hsmp
crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp
@@ -59,6 +71,51 @@ Note: lseek() is not supported as entire metrics table is read.
Metrics table definitions will be documented as part of Public PPR.
The same is defined in the amd_hsmp.h header.
+ACPI device object format
+=========================
+The ACPI object format expected from the amd_hsmp driver
+for socket with ID00 is given below::
+
+ Device(HSMP)
+ {
+ Name(_HID, "AMDI0097")
+ Name(_UID, "ID00")
+ Name(HSE0, 0x00000001)
+ Name(RBF0, ResourceTemplate()
+ {
+ Memory32Fixed(ReadWrite, 0xxxxxxx, 0x00100000)
+ })
+ Method(_CRS, 0, NotSerialized)
+ {
+ Return(RBF0)
+ }
+ Method(_STA, 0, NotSerialized)
+ {
+ If(LEqual(HSE0, One))
+ {
+ Return(0x0F)
+ }
+ Else
+ {
+ Return(Zero)
+ }
+ }
+ Name(_DSD, Package(2)
+ {
+ Buffer(0x10)
+ {
+ 0x9D, 0x61, 0x4D, 0xB7, 0x07, 0x57, 0xBD, 0x48,
+ 0xA6, 0x9F, 0x4E, 0xA2, 0x87, 0x1F, 0xC2, 0xF6
+ },
+ Package(3)
+ {
+ Package(2) {"MsgIdOffset", 0x00010934},
+ Package(2) {"MsgRspOffset", 0x00010980},
+ Package(2) {"MsgArgOffset", 0x000109E0}
+ }
+ })
+ }
+
An example
==========
diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst
index 4fd492cb4970..76f53d3450e7 100644
--- a/Documentation/arch/x86/boot.rst
+++ b/Documentation/arch/x86/boot.rst
@@ -77,7 +77,7 @@ Protocol 2.14 BURNT BY INCORRECT COMMIT
Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max.
============= ============================================================
- .. note::
+.. note::
The protocol version number should be changed only if the setup header
is changed. There is no need to update the version number if boot_params
or kernel_info are changed. Additionally, it is recommended to use
@@ -95,27 +95,27 @@ Memory Layout
The traditional memory map for the kernel loader, used for Image or
zImage kernels, typically looks like::
- | |
- 0A0000 +------------------------+
- | Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
- 09A000 +------------------------+
- | Command line |
- | Stack/heap | For use by the kernel real-mode code.
- 098000 +------------------------+
- | Kernel setup | The kernel real-mode code.
- 090200 +------------------------+
- | Kernel boot sector | The kernel legacy boot sector.
- 090000 +------------------------+
- | Protected-mode kernel | The bulk of the kernel image.
- 010000 +------------------------+
- | Boot loader | <- Boot sector entry point 0000:7C00
- 001000 +------------------------+
- | Reserved for MBR/BIOS |
- 000800 +------------------------+
- | Typically used by MBR |
- 000600 +------------------------+
- | BIOS use only |
- 000000 +------------------------+
+ | |
+ 0A0000 +------------------------+
+ | Reserved for BIOS | Do not use. Reserved for BIOS EBDA.
+ 09A000 +------------------------+
+ | Command line |
+ | Stack/heap | For use by the kernel real-mode code.
+ 098000 +------------------------+
+ | Kernel setup | The kernel real-mode code.
+ 090200 +------------------------+
+ | Kernel boot sector | The kernel legacy boot sector.
+ 090000 +------------------------+
+ | Protected-mode kernel | The bulk of the kernel image.
+ 010000 +------------------------+
+ | Boot loader | <- Boot sector entry point 0000:7C00
+ 001000 +------------------------+
+ | Reserved for MBR/BIOS |
+ 000800 +------------------------+
+ | Typically used by MBR |
+ 000600 +------------------------+
+ | BIOS use only |
+ 000000 +------------------------+
When using bzImage, the protected-mode kernel was relocated to
0x100000 ("high memory"), and the kernel real-mode block (boot sector,
@@ -142,28 +142,28 @@ above the 0x9A000 point; too many BIOSes will break above that point.
For a modern bzImage kernel with boot protocol version >= 2.02, a
memory layout like the following is suggested::
- ~ ~
- | Protected-mode kernel |
- 100000 +------------------------+
- | I/O memory hole |
- 0A0000 +------------------------+
- | Reserved for BIOS | Leave as much as possible unused
- ~ ~
- | Command line | (Can also be below the X+10000 mark)
- X+10000 +------------------------+
- | Stack/heap | For use by the kernel real-mode code.
- X+08000 +------------------------+
- | Kernel setup | The kernel real-mode code.
- | Kernel boot sector | The kernel legacy boot sector.
- X +------------------------+
- | Boot loader | <- Boot sector entry point 0000:7C00
- 001000 +------------------------+
- | Reserved for MBR/BIOS |
- 000800 +------------------------+
- | Typically used by MBR |
- 000600 +------------------------+
- | BIOS use only |
- 000000 +------------------------+
+ ~ ~
+ | Protected-mode kernel |
+ 100000 +------------------------+
+ | I/O memory hole |
+ 0A0000 +------------------------+
+ | Reserved for BIOS | Leave as much as possible unused
+ ~ ~
+ | Command line | (Can also be below the X+10000 mark)
+ X+10000 +------------------------+
+ | Stack/heap | For use by the kernel real-mode code.
+ X+08000 +------------------------+
+ | Kernel setup | The kernel real-mode code.
+ | Kernel boot sector | The kernel legacy boot sector.
+ X +------------------------+
+ | Boot loader | <- Boot sector entry point 0000:7C00
+ 001000 +------------------------+
+ | Reserved for MBR/BIOS |
+ 000800 +------------------------+
+ | Typically used by MBR |
+ 000600 +------------------------+
+ | BIOS use only |
+ 000000 +------------------------+
... where the address X is as low as the design of the boot loader permits.
@@ -229,22 +229,22 @@ Offset/Size Proto Name Meaning
=========== ======== ===================== ============================================
.. note::
- (1) For backwards compatibility, if the setup_sects field contains 0, the
- real value is 4.
+ (1) For backwards compatibility, if the setup_sects field contains 0,
+ the real value is 4.
- (2) For boot protocol prior to 2.04, the upper two bytes of the syssize
- field are unusable, which means the size of a bzImage kernel
- cannot be determined.
+ (2) For boot protocol prior to 2.04, the upper two bytes of the syssize
+ field are unusable, which means the size of a bzImage kernel
+ cannot be determined.
- (3) Ignored, but safe to set, for boot protocols 2.02-2.09.
+ (3) Ignored, but safe to set, for boot protocols 2.02-2.09.
If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
the boot protocol version is "old". Loading an old kernel, the
following parameters should be assumed::
- Image type = zImage
- initrd not supported
- Real-mode kernel must be located at 0x90000.
+ Image type = zImage
+ initrd not supported
+ Real-mode kernel must be located at 0x90000.
Otherwise, the "version" field contains the protocol version,
e.g. protocol version 2.01 will contain 0x0201 in this field. When
@@ -265,7 +265,7 @@ All general purpose boot loaders should write the fields marked
nonstandard address should fill in the fields marked (reloc); other
boot loaders can ignore those fields.
-The byte order of all fields is littleendian (this is x86, after all.)
+The byte order of all fields is little endian (this is x86, after all.)
============ ===========
Field name: setup_sects
@@ -365,7 +365,7 @@ Offset/size: 0x206/2
Protocol: 2.00+
============ =======
- Contains the boot protocol version, in (major << 8)+minor format,
+ Contains the boot protocol version, in (major << 8) + minor format,
e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
10.17.
@@ -397,17 +397,17 @@ Protocol: 2.00+
If set to a nonzero value, contains a pointer to a NUL-terminated
human-readable kernel version number string, less 0x200. This can
be used to display the kernel version to the user. This value
- should be less than (0x200*setup_sects).
+ should be less than (0x200 * setup_sects).
For example, if this value is set to 0x1c00, the kernel version
number string can be found at offset 0x1e00 in the kernel file.
This is a valid value if and only if the "setup_sects" field
contains the value 15 or higher, as::
- 0x1c00 < 15*0x200 (= 0x1e00) but
- 0x1c00 >= 14*0x200 (= 0x1c00)
+ 0x1c00 < 15 * 0x200 (= 0x1e00) but
+ 0x1c00 >= 14 * 0x200 (= 0x1c00)
- 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
+ 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15.
============ ==================
Field name: type_of_loader
@@ -427,9 +427,9 @@ Protocol: 2.00+
For example, for T = 0x15, V = 0x234, write::
- type_of_loader <- 0xE4
- ext_loader_type <- 0x05
- ext_loader_ver <- 0x23
+ type_of_loader <- 0xE4
+ ext_loader_type <- 0x05
+ ext_loader_ver <- 0x23
Assigned boot loader ids (hexadecimal):
@@ -686,7 +686,7 @@ Protocol: 2.10+
If a boot loader makes use of this field, it should update the
kernel_alignment field with the alignment unit desired; typically::
- kernel_alignment = 1 << min_alignment
+ kernel_alignment = 1 << min_alignment;
There may be a considerable performance cost with an excessively
misaligned kernel. Therefore, a loader should typically try each
@@ -754,7 +754,7 @@ Protocol: 2.07+
0x00000000 The default x86/PC environment
0x00000001 lguest
0x00000002 Xen
- 0x00000003 Moorestown MID
+ 0x00000003 Intel MID (Moorestown, CloverTrail, Merrifield, Moorefield)
0x00000004 CE4100 TV Platform
========== ==============================
@@ -808,13 +808,13 @@ Protocol: 2.09+
parameters passing mechanism. The definition of struct setup_data is
as follow::
- struct setup_data {
- u64 next;
- u32 type;
- u32 len;
- u8 data[0];
- };
-
+ struct setup_data {
+ __u64 next;
+ __u32 type;
+ __u32 len;
+ __u8 data[];
+ }
+
Where, the next is a 64-bit physical pointer to the next node of
linked list, the next field of the last node is 0; the type is used
to identify the contents of data; the len is the length of data
@@ -834,12 +834,12 @@ Protocol: 2.09+
Thus setup_indirect struct and SETUP_INDIRECT type were introduced in
protocol 2.15::
- struct setup_indirect {
- __u32 type;
- __u32 reserved; /* Reserved, must be set to zero. */
- __u64 len;
- __u64 addr;
- };
+ struct setup_indirect {
+ __u32 type;
+ __u32 reserved; /* Reserved, must be set to zero. */
+ __u64 len;
+ __u64 addr;
+ };
The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be
SETUP_INDIRECT itself since making the setup_indirect a tree structure
@@ -849,17 +849,17 @@ Protocol: 2.09+
Let's give an example how to point to SETUP_E820_EXT data using setup_indirect.
In this case setup_data and setup_indirect will look like this::
- struct setup_data {
- __u64 next = 0 or <addr_of_next_setup_data_struct>;
- __u32 type = SETUP_INDIRECT;
- __u32 len = sizeof(setup_indirect);
- __u8 data[sizeof(setup_indirect)] = struct setup_indirect {
- __u32 type = SETUP_INDIRECT | SETUP_E820_EXT;
- __u32 reserved = 0;
- __u64 len = <len_of_SETUP_E820_EXT_data>;
- __u64 addr = <addr_of_SETUP_E820_EXT_data>;
- }
- }
+ struct setup_data {
+ .next = 0, /* or <addr_of_next_setup_data_struct> */
+ .type = SETUP_INDIRECT,
+ .len = sizeof(setup_indirect),
+ .data[sizeof(setup_indirect)] = (struct setup_indirect) {
+ .type = SETUP_INDIRECT | SETUP_E820_EXT,
+ .reserved = 0,
+ .len = <len_of_SETUP_E820_EXT_data>,
+ .addr = <addr_of_SETUP_E820_EXT_data>,
+ },
+ }
.. note::
SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished
@@ -896,10 +896,19 @@ Offset/size: 0x260/4
The kernel runtime start address is determined by the following algorithm::
- if (relocatable_kernel)
- runtime_start = align_up(load_address, kernel_alignment)
- else
- runtime_start = pref_address
+ if (relocatable_kernel) {
+ if (load_address < pref_address)
+ load_address = pref_address;
+ runtime_start = align_up(load_address, kernel_alignment);
+ } else {
+ runtime_start = pref_address;
+ }
+
+Hence the necessary memory window location and size can be estimated by
+a boot loader as::
+
+ memory_window_start = runtime_start;
+ memory_window_size = init_size;
============ ===============
Field name: handover_offset
@@ -929,12 +938,12 @@ The kernel_info
===============
The relationships between the headers are analogous to the various data
-sections:
+sections::
setup_header = .data
boot_params/setup_data = .bss
-What is missing from the above list? That's right:
+What is missing from the above list? That's right::
kernel_info = .rodata
@@ -966,22 +975,22 @@ after kernel_info_var_len_data label. Each chunk of variable size data has to
be prefixed with header/magic and its size, e.g.::
kernel_info:
- .ascii "LToP" /* Header, Linux top (structure). */
- .long kernel_info_var_len_data - kernel_info
- .long kernel_info_end - kernel_info
- .long 0x01234567 /* Some fixed size data for the bootloaders. */
+ .ascii "LToP" /* Header, Linux top (structure). */
+ .long kernel_info_var_len_data - kernel_info
+ .long kernel_info_end - kernel_info
+ .long 0x01234567 /* Some fixed size data for the bootloaders. */
kernel_info_var_len_data:
- example_struct: /* Some variable size data for the bootloaders. */
- .ascii "0123" /* Header/Magic. */
- .long example_struct_end - example_struct
- .ascii "Struct"
- .long 0x89012345
+ example_struct: /* Some variable size data for the bootloaders. */
+ .ascii "0123" /* Header/Magic. */
+ .long example_struct_end - example_struct
+ .ascii "Struct"
+ .long 0x89012345
example_struct_end:
- example_strings: /* Some variable size data for the bootloaders. */
- .ascii "ABCD" /* Header/Magic. */
- .long example_strings_end - example_strings
- .asciz "String_0"
- .asciz "String_1"
+ example_strings: /* Some variable size data for the bootloaders. */
+ .ascii "ABCD" /* Header/Magic. */
+ .long example_strings_end - example_strings
+ .asciz "String_0"
+ .asciz "String_1"
example_strings_end:
kernel_info_end:
@@ -1130,67 +1139,63 @@ mode segment.
Such a boot loader should enter the following fields in the header::
- unsigned long base_ptr; /* base address for real-mode segment */
+ unsigned long base_ptr; /* base address for real-mode segment */
- if ( setup_sects == 0 ) {
- setup_sects = 4;
- }
+ if (setup_sects == 0)
+ setup_sects = 4;
- if ( protocol >= 0x0200 ) {
- type_of_loader = <type code>;
- if ( loading_initrd ) {
- ramdisk_image = <initrd_address>;
- ramdisk_size = <initrd_size>;
- }
+ if (protocol >= 0x0200) {
+ type_of_loader = <type code>;
+ if (loading_initrd) {
+ ramdisk_image = <initrd_address>;
+ ramdisk_size = <initrd_size>;
+ }
- if ( protocol >= 0x0202 && loadflags & 0x01 )
- heap_end = 0xe000;
- else
- heap_end = 0x9800;
+ if (protocol >= 0x0202 && loadflags & 0x01)
+ heap_end = 0xe000;
+ else
+ heap_end = 0x9800;
- if ( protocol >= 0x0201 ) {
- heap_end_ptr = heap_end - 0x200;
- loadflags |= 0x80; /* CAN_USE_HEAP */
- }
+ if (protocol >= 0x0201) {
+ heap_end_ptr = heap_end - 0x200;
+ loadflags |= 0x80; /* CAN_USE_HEAP */
+ }
- if ( protocol >= 0x0202 ) {
- cmd_line_ptr = base_ptr + heap_end;
- strcpy(cmd_line_ptr, cmdline);
- } else {
- cmd_line_magic = 0xA33F;
- cmd_line_offset = heap_end;
- setup_move_size = heap_end + strlen(cmdline)+1;
- strcpy(base_ptr+cmd_line_offset, cmdline);
- }
- } else {
- /* Very old kernel */
+ if (protocol >= 0x0202) {
+ cmd_line_ptr = base_ptr + heap_end;
+ strcpy(cmd_line_ptr, cmdline);
+ } else {
+ cmd_line_magic = 0xA33F;
+ cmd_line_offset = heap_end;
+ setup_move_size = heap_end + strlen(cmdline) + 1;
+ strcpy(base_ptr + cmd_line_offset, cmdline);
+ }
+ } else {
+ /* Very old kernel */
- heap_end = 0x9800;
+ heap_end = 0x9800;
- cmd_line_magic = 0xA33F;
- cmd_line_offset = heap_end;
+ cmd_line_magic = 0xA33F;
+ cmd_line_offset = heap_end;
- /* A very old kernel MUST have its real-mode code
- loaded at 0x90000 */
+ /* A very old kernel MUST have its real-mode code loaded at 0x90000 */
+ if (base_ptr != 0x90000) {
+ /* Copy the real-mode kernel */
+ memcpy(0x90000, base_ptr, (setup_sects + 1) * 512);
+ base_ptr = 0x90000; /* Relocated */
+ }
- if ( base_ptr != 0x90000 ) {
- /* Copy the real-mode kernel */
- memcpy(0x90000, base_ptr, (setup_sects+1)*512);
- base_ptr = 0x90000; /* Relocated */
- }
+ strcpy(0x90000 + cmd_line_offset, cmdline);
- strcpy(0x90000+cmd_line_offset, cmdline);
-
- /* It is recommended to clear memory up to the 32K mark */
- memset(0x90000 + (setup_sects+1)*512, 0,
- (64-(setup_sects+1))*512);
- }
+ /* It is recommended to clear memory up to the 32K mark */
+ memset(0x90000 + (setup_sects + 1) * 512, 0, (64 - (setup_sects + 1)) * 512);
+ }
Loading The Rest of The Kernel
==============================
-The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
+The 32-bit (non-real-mode) kernel starts at offset (setup_sects + 1) * 512
in the kernel file (again, if setup_sects == 0 the real value is 4.)
It should be loaded at address 0x10000 for Image/zImage kernels and
0x100000 for bzImage kernels.
@@ -1198,13 +1203,14 @@ It should be loaded at address 0x10000 for Image/zImage kernels and
The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
bit (LOAD_HIGH) in the loadflags field is set::
- is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
- load_address = is_bzImage ? 0x100000 : 0x10000;
+ is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
+ load_address = is_bzImage ? 0x100000 : 0x10000;
-Note that Image/zImage kernels can be up to 512K in size, and thus use
-the entire 0x10000-0x90000 range of memory. This means it is pretty
-much a requirement for these kernels to load the real-mode part at
-0x90000. bzImage kernels allow much more flexibility.
+.. note::
+ Image/zImage kernels can be up to 512K in size, and thus use the entire
+ 0x10000-0x90000 range of memory. This means it is pretty much a
+ requirement for these kernels to load the real-mode part at 0x90000.
+ bzImage kernels allow much more flexibility.
Special Command Line Options
============================
@@ -1273,19 +1279,20 @@ es = ss.
In our example from above, we would do::
- /* Note: in the case of the "old" kernel protocol, base_ptr must
- be == 0x90000 at this point; see the previous sample code */
+ /*
+ * Note: in the case of the "old" kernel protocol, base_ptr must
+ * be == 0x90000 at this point; see the previous sample code.
+ */
+ seg = base_ptr >> 4;
- seg = base_ptr >> 4;
+ cli(); /* Enter with interrupts disabled! */
- cli(); /* Enter with interrupts disabled! */
+ /* Set up the real-mode kernel stack */
+ _SS = seg;
+ _SP = heap_end;
- /* Set up the real-mode kernel stack */
- _SS = seg;
- _SP = heap_end;
-
- _DS = _ES = _FS = _GS = seg;
- jmp_far(seg+0x20, 0); /* Run the kernel */
+ _DS = _ES = _FS = _GS = seg;
+ jmp_far(seg + 0x20, 0); /* Run the kernel */
If your boot sector accesses a floppy drive, it is recommended to
switch off the floppy motor before running the kernel, since the
@@ -1340,7 +1347,7 @@ from offset 0x01f1 of kernel image on should be loaded into struct
boot_params and examined. The end of setup header can be calculated as
follow::
- 0x0202 + byte value at offset 0x0201
+ 0x0202 + byte value at offset 0x0201
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
@@ -1376,7 +1383,7 @@ Then, the setup header at offset 0x01f1 of kernel image on should be
loaded into struct boot_params and examined. The end of setup header
can be calculated as follows::
- 0x0202 + byte value at offset 0x0201
+ 0x0202 + byte value at offset 0x0201
In addition to read/modify/write the setup header of the struct
boot_params as that of 16-bit boot protocol, the boot loader should
@@ -1418,7 +1425,7 @@ execution context provided by the EFI firmware.
The function prototype for the handover entry point looks like this::
- efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp)
+ void efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp);
'handle' is the EFI image handle passed to the boot loader by the EFI
firmware, 'table' is the EFI system table - these are the first two
@@ -1433,12 +1440,13 @@ The boot loader *must* fill out the following fields in bp::
All other fields should be zero.
-NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
- entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
- loading protocol (refer to [0] for an example of the bootloader side of
- this), which removes the need for any knowledge on the part of the EFI
- bootloader regarding the internal representation of boot_params or any
- requirements/limitations regarding the placement of the command line
- and ramdisk in memory, or the placement of the kernel image itself.
+.. note::
+ The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF
+ entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd
+ loading protocol (refer to [0] for an example of the bootloader side of
+ this), which removes the need for any knowledge on the part of the EFI
+ bootloader regarding the internal representation of boot_params or any
+ requirements/limitations regarding the placement of the command line
+ and ramdisk in memory, or the placement of the kernel image itself.
[0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0
diff --git a/Documentation/arch/x86/buslock.rst b/Documentation/arch/x86/buslock.rst
index 4c5a4822eeb7..31f1bfdff16f 100644
--- a/Documentation/arch/x86/buslock.rst
+++ b/Documentation/arch/x86/buslock.rst
@@ -26,7 +26,8 @@ Detection
=========
Intel processors may support either or both of the following hardware
-mechanisms to detect split locks and bus locks.
+mechanisms to detect split locks and bus locks. Some AMD processors also
+support bus lock detect.
#AC exception for split lock detection
--------------------------------------
diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst
index 8895784d4784..6ef426a52cdc 100644
--- a/Documentation/arch/x86/cpuinfo.rst
+++ b/Documentation/arch/x86/cpuinfo.rst
@@ -112,7 +112,7 @@ conditions are met, the features are enabled by the set_cpu_cap or
setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS,
the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and
"split_lock_detect" will be displayed. The flag "ring3mwait" will be
-displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors.
+displayed only when running on INTEL_XEON_PHI_[KNL|KNM] processors.
d: Flags can represent purely software features.
------------------------------------------------
diff --git a/Documentation/arch/x86/exception-tables.rst b/Documentation/arch/x86/exception-tables.rst
index efde1fef4fbd..6e7177363f8f 100644
--- a/Documentation/arch/x86/exception-tables.rst
+++ b/Documentation/arch/x86/exception-tables.rst
@@ -297,7 +297,7 @@ vma occurs?
c) execution continues at local label 2 (address of the
instruction immediately after the faulting user access).
-The steps 8a to 8c in a certain way emulate the faulting instruction.
+ The steps a to c above in a certain way emulate the faulting instruction.
That's it, mostly. If you look at our example, you might ask why
we set EAX to -EFAULT in the exception handler code. Well, the
diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst
index c58c72362911..5a2e6c0ef04a 100644
--- a/Documentation/arch/x86/mds.rst
+++ b/Documentation/arch/x86/mds.rst
@@ -162,7 +162,7 @@ Mitigation points
3. It would take a large number of these precisely-timed NMIs to mount
an actual attack. There's presumably not enough bandwidth.
4. The NMI in question occurs after a VERW, i.e. when user state is
- restored and most interesting data is already scrubbed. Whats left
+ restored and most interesting data is already scrubbed. What's left
is only the data that NMI touches, and that may or may not be of
any interest.
diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index 6c245582d8fb..6768fc1fad16 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -375,11 +375,25 @@ When monitoring is enabled all MON groups will also contain:
all tasks in the group. In CTRL_MON groups these files provide
the sum for all tasks in the CTRL_MON group and all tasks in
MON groups. Please see example section for more details on usage.
+ On systems with Sub-NUMA Cluster (SNC) enabled there are extra
+ directories for each node (located within the "mon_L3_XX" directory
+ for the L3 cache they occupy). These are named "mon_sub_L3_YY"
+ where "YY" is the node number.
"mon_hw_id":
Available only with debug option. The identifier used by hardware
for the monitor group. On x86 this is the RMID.
+When the "mba_MBps" mount option is used all CTRL_MON groups will also contain:
+
+"mba_MBps_event":
+ Reading this file shows which memory bandwidth event is used
+ as input to the software feedback loop that keeps memory bandwidth
+ below the value specified in the schemata file. Writing the
+ name of one of the supported memory bandwidth events found in
+ /sys/fs/resctrl/info/L3_MON/mon_features changes the input
+ event.
+
Resource allocation rules
-------------------------
@@ -446,6 +460,12 @@ during mkdir.
max_threshold_occupancy is a user configurable value to determine the
occupancy at which an RMID can be freed.
+The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes
+for a subset of RMID that are not immediately available for allocation.
+This can't be relied on to produce output every second, it may be necessary
+to attempt to create an empty monitor group to force an update. Output may
+only be produced if creation of a control or monitor group fails.
+
Schemata files - general concepts
---------------------------------
Each line in the file describes one resource. The line starts with
@@ -478,6 +498,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask
each bit represents 5% of the capacity of the cache. You could partition
the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000.
+Notes on Sub-NUMA Cluster mode
+==============================
+When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA
+nodes much more readily than between regular NUMA nodes since the CPUs
+on Sub-NUMA nodes share the same L3 cache and the system may report
+the NUMA distance between Sub-NUMA nodes with a lower value than used
+for regular NUMA nodes.
+
+The top-level monitoring files in each "mon_L3_XX" directory provide
+the sum of data across all SNC nodes sharing an L3 cache instance.
+Users who bind tasks to the CPUs of a specific Sub-NUMA node can read
+the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the
+"mon_sub_L3_YY" directories to get node local data.
+
+Memory bandwidth allocation is still performed at the L3 cache
+level. I.e. throttling controls are applied to all SNC nodes.
+
+L3 cache allocation bitmaps also apply to all SNC nodes. But note that
+the amount of L3 cache represented by each bit is divided by the number
+of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit
+allocation masks each bit normally represents 10MB. With SNC mode enabled
+with two SNC nodes per L3 cache, each bit only represents 5MB.
+
Memory bandwidth Allocation and monitoring
==========================================
diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst
index 7352ab89a55a..c12837e61bda 100644
--- a/Documentation/arch/x86/topology.rst
+++ b/Documentation/arch/x86/topology.rst
@@ -135,6 +135,10 @@ Thread-related topology information in the kernel:
The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo
"core_id."
+ - topology_logical_core_id();
+
+ The logical core ID to which a thread belongs.
+
System topology examples
diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst
deleted file mode 100644
index 137432d34109..000000000000
--- a/Documentation/arch/x86/x86_64/boot-options.rst
+++ /dev/null
@@ -1,319 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===========================
-AMD64 Specific Boot Options
-===========================
-
-There are many others (usually documented in driver documentation), but
-only the AMD64 specific ones are listed here.
-
-Machine check
-=============
-Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
-
- mce=off
- Disable machine check
- mce=no_cmci
- Disable CMCI(Corrected Machine Check Interrupt) that
- Intel processor supports. Usually this disablement is
- not recommended, but it might be handy if your hardware
- is misbehaving.
- Note that you'll get more problems without CMCI than with
- due to the shared banks, i.e. you might get duplicated
- error logs.
- mce=dont_log_ce
- Don't make logs for corrected errors. All events reported
- as corrected are silently cleared by OS.
- This option will be useful if you have no interest in any
- of corrected errors.
- mce=ignore_ce
- Disable features for corrected errors, e.g. polling timer
- and CMCI. All events reported as corrected are not cleared
- by OS and remained in its error banks.
- Usually this disablement is not recommended, however if
- there is an agent checking/clearing corrected errors
- (e.g. BIOS or hardware monitoring applications), conflicting
- with OS's error handling, and you cannot deactivate the agent,
- then this option will be a help.
- mce=no_lmce
- Do not opt-in to Local MCE delivery. Use legacy method
- to broadcast MCEs.
- mce=bootlog
- Enable logging of machine checks left over from booting.
- Disabled by default on AMD Fam10h and older because some BIOS
- leave bogus ones.
- If your BIOS doesn't do that it's a good idea to enable though
- to make sure you log even machine check events that result
- in a reboot. On Intel systems it is enabled by default.
- mce=nobootlog
- Disable boot machine check logging.
- mce=monarchtimeout (number)
- monarchtimeout:
- Sets the time in us to wait for other CPUs on machine checks. 0
- to disable.
- mce=bios_cmci_threshold
- Don't overwrite the bios-set CMCI threshold. This boot option
- prevents Linux from overwriting the CMCI threshold set by the
- bios. Without this option, Linux always sets the CMCI
- threshold to 1. Enabling this may make memory predictive failure
- analysis less effective if the bios sets thresholds for memory
- errors since we will not see details for all errors.
- mce=recovery
- Force-enable recoverable machine check code paths
-
- nomce (for compatibility with i386)
- same as mce=off
-
- Everything else is in sysfs now.
-
-APICs
-=====
-
- apic
- Use IO-APIC. Default
-
- noapic
- Don't use the IO-APIC.
-
- disableapic
- Don't use the local APIC
-
- nolapic
- Don't use the local APIC (alias for i386 compatibility)
-
- pirq=...
- See Documentation/arch/x86/i386/IO-APIC.rst
-
- noapictimer
- Don't set up the APIC timer
-
- no_timer_check
- Don't check the IO-APIC timer. This can work around
- problems with incorrect timer initialization on some boards.
-
- apicpmtimer
- Do APIC timer calibration using the pmtimer. Implies
- apicmaintimer. Useful when your PIT timer is totally broken.
-
-Timing
-======
-
- notsc
- Deprecated, use tsc=unstable instead.
-
- nohpet
- Don't use the HPET timer.
-
-Idle loop
-=========
-
- idle=poll
- Don't do power saving in the idle loop using HLT, but poll for rescheduling
- event. This will make the CPUs eat a lot more power, but may be useful
- to get slightly better performance in multiprocessor benchmarks. It also
- makes some profiling using performance counters more accurate.
- Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
- CPUs) this option has no performance advantage over the normal idle loop.
- It may also interact badly with hyperthreading.
-
-Rebooting
-=========
-
- reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old]
- bios
- Use the CPU reboot vector for warm reset
- warm
- Don't set the cold reboot flag
- cold
- Set the cold reboot flag
- triple
- Force a triple fault (init)
- kbd
- Use the keyboard controller. cold reset (default)
- acpi
- Use the ACPI RESET_REG in the FADT. If ACPI is not configured or
- the ACPI reset does not work, the reboot path attempts the reset
- using the keyboard controller.
- efi
- Use efi reset_system runtime service. If EFI is not configured or
- the EFI reset does not work, the reboot path attempts the reset using
- the keyboard controller.
- pci
- Use a write to the PCI config space register 0xcf9 to trigger reboot.
-
- Using warm reset will be much faster especially on big memory
- systems because the BIOS will not go through the memory check.
- Disadvantage is that not all hardware will be completely reinitialized
- on reboot so there may be boot problems on some systems.
-
- reboot=force
- Don't stop other CPUs on reboot. This can make reboot more reliable
- in some cases.
-
- reboot=default
- There are some built-in platform specific "quirks" - you may see:
- "reboot: <name> series board detected. Selecting <type> for reboots."
- In the case where you think the quirk is in error (e.g. you have
- newer BIOS, or newer board) using this option will ignore the built-in
- quirk table, and use the generic default reboot actions.
-
-NUMA
-====
-
- numa=off
- Only set up a single NUMA node spanning all memory.
-
- numa=noacpi
- Don't parse the SRAT table for NUMA setup
-
- numa=nohmat
- Don't parse the HMAT table for NUMA setup, or soft-reserved memory
- partitioning.
-
- numa=fake=<size>[MG]
- If given as a memory unit, fills all system RAM with nodes of
- size interleaved over physical nodes.
-
- numa=fake=<N>
- If given as an integer, fills all system RAM with N fake nodes
- interleaved over physical nodes.
-
- numa=fake=<N>U
- If given as an integer followed by 'U', it will divide each
- physical node into N emulated nodes.
-
-ACPI
-====
-
- acpi=off
- Don't enable ACPI
- acpi=ht
- Use ACPI boot table parsing, but don't enable ACPI interpreter
- acpi=force
- Force ACPI on (currently not needed)
- acpi=strict
- Disable out of spec ACPI workarounds.
- acpi_sci={edge,level,high,low}
- Set up ACPI SCI interrupt.
- acpi=noirq
- Don't route interrupts
- acpi=nocmcff
- Disable firmware first mode for corrected errors. This
- disables parsing the HEST CMC error source to check if
- firmware has set the FF flag. This may result in
- duplicate corrected error reports.
-
-PCI
-===
-
- pci=off
- Don't use PCI
- pci=conf1
- Use conf1 access.
- pci=conf2
- Use conf2 access.
- pci=rom
- Assign ROMs.
- pci=assign-busses
- Assign busses
- pci=irqmask=MASK
- Set PCI interrupt mask to MASK
- pci=lastbus=NUMBER
- Scan up to NUMBER busses, no matter what the mptable says.
- pci=noacpi
- Don't use ACPI to set up PCI interrupt routing.
-
-IOMMU (input/output memory management unit)
-===========================================
-Multiple x86-64 PCI-DMA mapping implementations exist, for example:
-
- 1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all
- (e.g. because you have < 3 GB memory).
- Kernel boot message: "PCI-DMA: Disabling IOMMU"
-
- 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
- Kernel boot message: "PCI-DMA: using GART IOMMU"
-
- 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
- e.g. if there is no hardware IOMMU in the system and it is need because
- you have >3GB memory or told the kernel to us it (iommu=soft))
- Kernel boot message: "PCI-DMA: Using software bounce buffering
- for IO (SWIOTLB)"
-
-::
-
- iommu=[<size>][,noagp][,off][,force][,noforce]
- [,memaper[=<order>]][,merge][,fullflush][,nomerge]
- [,noaperture]
-
-General iommu options:
-
- off
- Don't initialize and use any kind of IOMMU.
- noforce
- Don't force hardware IOMMU usage when it is not needed. (default).
- force
- Force the use of the hardware IOMMU even when it is
- not actually needed (e.g. because < 3 GB memory).
- soft
- Use software bounce buffering (SWIOTLB) (default for
- Intel machines). This can be used to prevent the usage
- of an available hardware IOMMU.
-
-iommu options only relevant to the AMD GART hardware IOMMU:
-
- <size>
- Set the size of the remapping area in bytes.
- allowed
- Overwrite iommu off workarounds for specific chipsets.
- fullflush
- Flush IOMMU on each allocation (default).
- nofullflush
- Don't use IOMMU fullflush.
- memaper[=<order>]
- Allocate an own aperture over RAM with size 32MB<<order.
- (default: order=1, i.e. 64MB)
- merge
- Do scatter-gather (SG) merging. Implies "force" (experimental).
- nomerge
- Don't do scatter-gather (SG) merging.
- noaperture
- Ask the IOMMU not to touch the aperture for AGP.
- noagp
- Don't initialize the AGP driver and use full aperture.
- panic
- Always panic when IOMMU overflows.
-
-iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
-implementation:
-
- swiotlb=<slots>[,force,noforce]
- <slots>
- Prereserve that many 2K slots for the software IO bounce buffering.
- force
- Force all IO through the software TLB.
- noforce
- Do not initialize the software TLB.
-
-
-Miscellaneous
-=============
-
- nogbpages
- Do not use GB pages for kernel direct mappings.
- gbpages
- Use GB pages for kernel direct mappings.
-
-
-AMD SEV (Secure Encrypted Virtualization)
-=========================================
-Options relating to AMD SEV, specified via the following format:
-
-::
-
- sev=option1[,option2]
-
-The available options are:
-
- debug
- Enable debug messages.
diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
index ba74617d4999..970ee94eb551 100644
--- a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
+++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst
@@ -18,7 +18,7 @@ For more information on the features of cpusets, see
Documentation/admin-guide/cgroup-v1/cpusets.rst.
There are a number of different configurations you can use for your needs. For
more information on the numa=fake command line option and its various ways of
-configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst.
+configuring fake nodes, see Documentation/admin-guide/kernel-parameters.txt
For the purposes of this introduction, we'll assume a very primitive NUMA
emulation setup of "numa=fake=4*512,". This will split our system memory into
diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst
index 50960e09e1f6..d07e445dac5c 100644
--- a/Documentation/arch/x86/x86_64/fsgs.rst
+++ b/Documentation/arch/x86/x86_64/fsgs.rst
@@ -125,7 +125,7 @@ FSGSBASE instructions enablement
FSGSBASE instructions compiler support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE
+GCC version 4.6.4 and newer provide intrinsics for the FSGSBASE
instructions. Clang 5 supports them as well.
=================== ===========================
@@ -135,7 +135,7 @@ instructions. Clang 5 supports them as well.
_writegsbase_u64() Write the GS base register
=================== ===========================
-To utilize these instrinsics <immintrin.h> must be included in the source
+To utilize these intrinsics <immintrin.h> must be included in the source
code and the compiler option -mfsgsbase has to be added.
Compiler support for FS/GS based addressing
diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst
index ad15e9bd623f..a0261957a08a 100644
--- a/Documentation/arch/x86/x86_64/index.rst
+++ b/Documentation/arch/x86/x86_64/index.rst
@@ -7,7 +7,6 @@ x86_64 Support
.. toctree::
:maxdepth: 2
- boot-options
uefi
mm
5level-paging
diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst
index 35e5e18c83d0..f2db178b353f 100644
--- a/Documentation/arch/x86/x86_64/mm.rst
+++ b/Documentation/arch/x86/x86_64/mm.rst
@@ -29,15 +29,27 @@ Complete virtual memory map with 4-level page tables
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
- 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm
+ 0000000000000000 | 0 | 00007fffffffefff | ~128 TB | user-space virtual memory, different per mm
+ 00007ffffffff000 | ~128 TB | 00007fffffffffff | 4 kB | ... guard hole
__________________|____________|__________________|_________|___________________________________________________________
| | | |
- 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -128 TB
+ 0000800000000000 | +128 TB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -8 EB
| | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM relaxes canonicallity check allowing to create aliases
+ | | | | for userspace memory here.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
+ __________________|____________|__________________|_________|___________________________________________________________
+ | | | |
+ 8000000000000000 | -8 EB | ffff7fffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -128 TB
+ | | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM_SUP relaxes canonicallity check allowing to create
+ | | | | aliases for kernel memory here.
____________________________________________________________|___________________________________________________________
| | | |
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
@@ -88,16 +100,27 @@ Complete virtual memory map with 5-level page tables
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
- 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm
+ 0000000000000000 | 0 | 00fffffffffff000 | ~64 PB | user-space virtual memory, different per mm
+ 00fffffffffff000 | ~64 PB | 00ffffffffffffff | 4 kB | ... guard hole
__________________|____________|__________________|_________|___________________________________________________________
| | | |
- 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical
- | | | | virtual memory addresses up to the -64 PB
+ 0100000000000000 | +64 PB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -8EB TB
| | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM relaxes canonicallity check allowing to create aliases
+ | | | | for userspace memory here.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
+ 8000000000000000 | -8 EB | feffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
+ | | | | virtual memory addresses up to the -64 PB
+ | | | | starting offset of kernel mappings.
+ | | | |
+ | | | | LAM_SUP relaxes canonicallity check allowing to create
+ | | | | aliases for kernel memory here.
+ ____________________________________________________________|___________________________________________________________
| | | |
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst
index fbc30c9a071d..e84592dbd6c1 100644
--- a/Documentation/arch/x86/x86_64/uefi.rst
+++ b/Documentation/arch/x86/x86_64/uefi.rst
@@ -12,14 +12,20 @@ with EFI firmware and specifications are listed below.
1. UEFI specification: http://www.uefi.org
-2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
- support. Elilo with x86_64 support can be used.
+2. Booting Linux kernel on UEFI x86_64 platform can either be
+ done using the <Documentation/admin-guide/efi-stub.rst> or using a
+ separate bootloader.
3. x86_64 platform with EFI/UEFI firmware.
Mechanics
---------
+Refer to <Documentation/admin-guide/efi-stub.rst> to learn how to use the EFI stub.
+
+Below are general EFI setup guidelines on the x86_64 platform,
+regardless of whether you use the EFI stub or a separate bootloader.
+
- Build the kernel with the following configuration::
CONFIG_FB_EFI=y
@@ -31,16 +37,27 @@ Mechanics
CONFIG_EFI=y
CONFIG_EFIVAR_FS=y or m # optional
-- Create a VFAT partition on the disk
-- Copy the following to the VFAT partition:
+- Create a VFAT partition on the disk with the EFI System flag
+ You can do this with fdisk with the following commands:
+
+ 1. g - initialize a GPT partition table
+ 2. n - create a new partition
+ 3. t - change the partition type to "EFI System" (number 1)
+ 4. w - write and save the changes
+
+ Afterwards, initialize the VFAT filesystem by running mkfs::
+
+ mkfs.fat /dev/<your-partition>
+
+- Copy the boot files to the VFAT partition:
+ If you use the EFI stub method, the kernel acts also as an EFI executable.
+
+ You can just copy the bzImage to the EFI/boot/bootx64.efi path on the partition
+ so that it will automatically get booted, see the <Documentation/admin-guide/efi-stub.rst> page
+ for additional instructions regarding passage of kernel parameters and initramfs.
- elilo bootloader with x86_64 support, elilo configuration file,
- kernel image built in first step and corresponding
- initrd. Instructions on building elilo and its dependencies
- can be found in the elilo sourceforge project.
+ If you use a custom bootloader, refer to the relevant documentation for help on this part.
-- Boot to EFI shell and invoke elilo choosing the kernel image built
- in first step.
- If some or all EFI runtime services don't work, you can try following
kernel command line parameters to turn off some or all EFI runtime
services.
diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst
index ae5c69e48b11..cec05ac464c1 100644
--- a/Documentation/arch/x86/xstate.rst
+++ b/Documentation/arch/x86/xstate.rst
@@ -138,7 +138,7 @@ Note this example does not include the sigaltstack preparation.
Dynamic features in signal frames
---------------------------------
-Dynamcally enabled features are not written to the signal frame upon signal
+Dynamically enabled features are not written to the signal frame upon signal
entry if the feature is in its initial configuration. This differs from
non-dynamic features which are always written regardless of their
configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV