diff options
Diffstat (limited to 'Documentation/arch/x86')
-rw-r--r-- | Documentation/arch/x86/amd-memory-encryption.rst | 147 | ||||
-rw-r--r-- | Documentation/arch/x86/amd_hsmp.rst | 67 | ||||
-rw-r--r-- | Documentation/arch/x86/boot.rst | 368 | ||||
-rw-r--r-- | Documentation/arch/x86/buslock.rst | 3 | ||||
-rw-r--r-- | Documentation/arch/x86/cpuinfo.rst | 2 | ||||
-rw-r--r-- | Documentation/arch/x86/exception-tables.rst | 2 | ||||
-rw-r--r-- | Documentation/arch/x86/mds.rst | 2 | ||||
-rw-r--r-- | Documentation/arch/x86/resctrl.rst | 43 | ||||
-rw-r--r-- | Documentation/arch/x86/topology.rst | 4 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/boot-options.rst | 319 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst | 2 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/fsgs.rst | 4 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/index.rst | 1 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/mm.rst | 35 | ||||
-rw-r--r-- | Documentation/arch/x86/x86_64/uefi.rst | 37 | ||||
-rw-r--r-- | Documentation/arch/x86/xstate.rst | 2 |
16 files changed, 508 insertions, 530 deletions
diff --git a/Documentation/arch/x86/amd-memory-encryption.rst b/Documentation/arch/x86/amd-memory-encryption.rst index 414bc7402ae7..bd840df708ea 100644 --- a/Documentation/arch/x86/amd-memory-encryption.rst +++ b/Documentation/arch/x86/amd-memory-encryption.rst @@ -130,4 +130,149 @@ SNP feature support. More details in AMD64 APM[1] Vol 2: 15.34.10 SEV_STATUS MSR -[1] https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf +Reverse Map Table (RMP) +======================= + +The RMP is a structure in system memory that is used to ensure a one-to-one +mapping between system physical addresses and guest physical addresses. Each +page of memory that is potentially assignable to guests has one entry within +the RMP. + +The RMP table can be either contiguous in memory or a collection of segments +in memory. + +Contiguous RMP +-------------- + +Support for this form of the RMP is present when support for SEV-SNP is +present, which can be determined using the CPUID instruction:: + + 0x8000001f[eax]: + Bit[4] indicates support for SEV-SNP + +The location of the RMP is identified to the hardware through two MSRs:: + + 0xc0010132 (RMP_BASE): + System physical address of the first byte of the RMP + + 0xc0010133 (RMP_END): + System physical address of the last byte of the RMP + +Hardware requires that RMP_BASE and (RPM_END + 1) be 8KB aligned, but SEV +firmware increases the alignment requirement to require a 1MB alignment. + +The RMP consists of a 16KB region used for processor bookkeeping followed +by the RMP entries, which are 16 bytes in size. The size of the RMP +determines the range of physical memory that the hypervisor can assign to +SEV-SNP guests. The RMP covers the system physical address from:: + + 0 to ((RMP_END + 1 - RMP_BASE - 16KB) / 16B) x 4KB. + +The current Linux support relies on BIOS to allocate/reserve the memory for +the RMP and to set RMP_BASE and RMP_END appropriately. Linux uses the MSR +values to locate the RMP and determine the size of the RMP. The RMP must +cover all of system memory in order for Linux to enable SEV-SNP. + +Segmented RMP +------------- + +Segmented RMP support is a new way of representing the layout of an RMP. +Initial RMP support required the RMP table to be contiguous in memory. +RMP accesses from a NUMA node on which the RMP doesn't reside +can take longer than accesses from a NUMA node on which the RMP resides. +Segmented RMP support allows the RMP entries to be located on the same +node as the memory the RMP is covering, potentially reducing latency +associated with accessing an RMP entry associated with the memory. Each +RMP segment covers a specific range of system physical addresses. + +Support for this form of the RMP can be determined using the CPUID +instruction:: + + 0x8000001f[eax]: + Bit[23] indicates support for segmented RMP + +If supported, segmented RMP attributes can be found using the CPUID +instruction:: + + 0x80000025[eax]: + Bits[5:0] minimum supported RMP segment size + Bits[11:6] maximum supported RMP segment size + + 0x80000025[ebx]: + Bits[9:0] number of cacheable RMP segment definitions + Bit[10] indicates if the number of cacheable RMP segments + is a hard limit + +To enable a segmented RMP, a new MSR is available:: + + 0xc0010136 (RMP_CFG): + Bit[0] indicates if segmented RMP is enabled + Bits[13:8] contains the size of memory covered by an RMP + segment (expressed as a power of 2) + +The RMP segment size defined in the RMP_CFG MSR applies to all segments +of the RMP. Therefore each RMP segment covers a specific range of system +physical addresses. For example, if the RMP_CFG MSR value is 0x2401, then +the RMP segment coverage value is 0x24 => 36, meaning the size of memory +covered by an RMP segment is 64GB (1 << 36). So the first RMP segment +covers physical addresses from 0 to 0xF_FFFF_FFFF, the second RMP segment +covers physical addresses from 0x10_0000_0000 to 0x1F_FFFF_FFFF, etc. + +When a segmented RMP is enabled, RMP_BASE points to the RMP bookkeeping +area as it does today (16K in size). However, instead of RMP entries +beginning immediately after the bookkeeping area, there is a 4K RMP +segment table (RST). Each entry in the RST is 8-bytes in size and represents +an RMP segment:: + + Bits[19:0] mapped size (in GB) + The mapped size can be less than the defined segment size. + A value of zero, indicates that no RMP exists for the range + of system physical addresses associated with this segment. + Bits[51:20] segment physical address + This address is left shift 20-bits (or just masked when + read) to form the physical address of the segment (1MB + alignment). + +The RST can hold 512 segment entries but can be limited in size to the number +of cacheable RMP segments (CPUID 0x80000025_EBX[9:0]) if the number of cacheable +RMP segments is a hard limit (CPUID 0x80000025_EBX[10]). + +The current Linux support relies on BIOS to allocate/reserve the memory for +the segmented RMP (the bookkeeping area, RST, and all segments), build the RST +and to set RMP_BASE, RMP_END, and RMP_CFG appropriately. Linux uses the MSR +values to locate the RMP and determine the size and location of the RMP +segments. The RMP must cover all of system memory in order for Linux to enable +SEV-SNP. + +More details in the AMD64 APM Vol 2, section "15.36.3 Reverse Map Table", +docID: 24593. + +Secure VM Service Module (SVSM) +=============================== + +SNP provides a feature called Virtual Machine Privilege Levels (VMPL) which +defines four privilege levels at which guest software can run. The most +privileged level is 0 and numerically higher numbers have lesser privileges. +More details in the AMD64 APM Vol 2, section "15.35.7 Virtual Machine +Privilege Levels", docID: 24593. + +When using that feature, different services can run at different protection +levels, apart from the guest OS but still within the secure SNP environment. +They can provide services to the guest, like a vTPM, for example. + +When a guest is not running at VMPL0, it needs to communicate with the software +running at VMPL0 to perform privileged operations or to interact with secure +services. An example fur such a privileged operation is PVALIDATE which is +*required* to be executed at VMPL0. + +In this scenario, the software running at VMPL0 is usually called a Secure VM +Service Module (SVSM). Discovery of an SVSM and the API used to communicate +with it is documented in "Secure VM Service Module for SEV-SNP Guests", docID: +58019. + +(Latest versions of the above-mentioned documents can be found by using +a search engine like duckduckgo.com and typing in: + + site:amd.com "Secure VM Service Module for SEV-SNP Guests", docID: 58019 + +for example.) diff --git a/Documentation/arch/x86/amd_hsmp.rst b/Documentation/arch/x86/amd_hsmp.rst index 1e499ecf5f4e..2fd917638e42 100644 --- a/Documentation/arch/x86/amd_hsmp.rst +++ b/Documentation/arch/x86/amd_hsmp.rst @@ -4,8 +4,9 @@ AMD HSMP interface ============================================ -Newer Fam19h EPYC server line of processors from AMD support system -management functionality via HSMP (Host System Management Port). +Newer Fam19h(model 0x00-0x1f, 0x30-0x3f, 0x90-0x9f, 0xa0-0xaf), +Fam1Ah(model 0x00-0x1f) EPYC server line of processors from AMD support +system management functionality via HSMP (Host System Management Port). The Host System Management Port (HSMP) is an interface to provide OS-level software with access to system management functions via a @@ -16,14 +17,25 @@ More details on the interface can be found in chapter Eg: https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/55898_B1_pub_0_50.zip -HSMP interface is supported on EPYC server CPU models only. +HSMP interface is supported on EPYC line of server CPUs and MI300A (APU). HSMP device ============================================ -amd_hsmp driver under the drivers/platforms/x86/ creates miscdevice -/dev/hsmp to let user space programs run hsmp mailbox commands. +amd_hsmp driver under drivers/platforms/x86/amd/hsmp/ has separate driver files +for ACPI object based probing, platform device based probing and for the common +code for these two drivers. + +Kconfig option CONFIG_AMD_HSMP_PLAT compiles plat.c and creates amd_hsmp.ko. +Kconfig option CONFIG_AMD_HSMP_ACPI compiles acpi.c and creates hsmp_acpi.ko. +Selecting any of these two configs automatically selects CONFIG_AMD_HSMP. This +compiles common code hsmp.c and creates hsmp_common.ko module. + +Both the ACPI and plat drivers create the miscdevice /dev/hsmp to let +user space programs run hsmp mailbox commands. + +The ACPI object format supported by the driver is defined below. $ ls -al /dev/hsmp crw-r--r-- 1 root root 10, 123 Jan 21 21:41 /dev/hsmp @@ -59,6 +71,51 @@ Note: lseek() is not supported as entire metrics table is read. Metrics table definitions will be documented as part of Public PPR. The same is defined in the amd_hsmp.h header. +ACPI device object format +========================= +The ACPI object format expected from the amd_hsmp driver +for socket with ID00 is given below:: + + Device(HSMP) + { + Name(_HID, "AMDI0097") + Name(_UID, "ID00") + Name(HSE0, 0x00000001) + Name(RBF0, ResourceTemplate() + { + Memory32Fixed(ReadWrite, 0xxxxxxx, 0x00100000) + }) + Method(_CRS, 0, NotSerialized) + { + Return(RBF0) + } + Method(_STA, 0, NotSerialized) + { + If(LEqual(HSE0, One)) + { + Return(0x0F) + } + Else + { + Return(Zero) + } + } + Name(_DSD, Package(2) + { + Buffer(0x10) + { + 0x9D, 0x61, 0x4D, 0xB7, 0x07, 0x57, 0xBD, 0x48, + 0xA6, 0x9F, 0x4E, 0xA2, 0x87, 0x1F, 0xC2, 0xF6 + }, + Package(3) + { + Package(2) {"MsgIdOffset", 0x00010934}, + Package(2) {"MsgRspOffset", 0x00010980}, + Package(2) {"MsgArgOffset", 0x000109E0} + } + }) + } + An example ========== diff --git a/Documentation/arch/x86/boot.rst b/Documentation/arch/x86/boot.rst index 4fd492cb4970..76f53d3450e7 100644 --- a/Documentation/arch/x86/boot.rst +++ b/Documentation/arch/x86/boot.rst @@ -77,7 +77,7 @@ Protocol 2.14 BURNT BY INCORRECT COMMIT Protocol 2.15 (Kernel 5.5) Added the kernel_info and kernel_info.setup_type_max. ============= ============================================================ - .. note:: +.. note:: The protocol version number should be changed only if the setup header is changed. There is no need to update the version number if boot_params or kernel_info are changed. Additionally, it is recommended to use @@ -95,27 +95,27 @@ Memory Layout The traditional memory map for the kernel loader, used for Image or zImage kernels, typically looks like:: - | | - 0A0000 +------------------------+ - | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. - 09A000 +------------------------+ - | Command line | - | Stack/heap | For use by the kernel real-mode code. - 098000 +------------------------+ - | Kernel setup | The kernel real-mode code. - 090200 +------------------------+ - | Kernel boot sector | The kernel legacy boot sector. - 090000 +------------------------+ - | Protected-mode kernel | The bulk of the kernel image. - 010000 +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 - 001000 +------------------------+ - | Reserved for MBR/BIOS | - 000800 +------------------------+ - | Typically used by MBR | - 000600 +------------------------+ - | BIOS use only | - 000000 +------------------------+ + | | + 0A0000 +------------------------+ + | Reserved for BIOS | Do not use. Reserved for BIOS EBDA. + 09A000 +------------------------+ + | Command line | + | Stack/heap | For use by the kernel real-mode code. + 098000 +------------------------+ + | Kernel setup | The kernel real-mode code. + 090200 +------------------------+ + | Kernel boot sector | The kernel legacy boot sector. + 090000 +------------------------+ + | Protected-mode kernel | The bulk of the kernel image. + 010000 +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ When using bzImage, the protected-mode kernel was relocated to 0x100000 ("high memory"), and the kernel real-mode block (boot sector, @@ -142,28 +142,28 @@ above the 0x9A000 point; too many BIOSes will break above that point. For a modern bzImage kernel with boot protocol version >= 2.02, a memory layout like the following is suggested:: - ~ ~ - | Protected-mode kernel | - 100000 +------------------------+ - | I/O memory hole | - 0A0000 +------------------------+ - | Reserved for BIOS | Leave as much as possible unused - ~ ~ - | Command line | (Can also be below the X+10000 mark) - X+10000 +------------------------+ - | Stack/heap | For use by the kernel real-mode code. - X+08000 +------------------------+ - | Kernel setup | The kernel real-mode code. - | Kernel boot sector | The kernel legacy boot sector. - X +------------------------+ - | Boot loader | <- Boot sector entry point 0000:7C00 - 001000 +------------------------+ - | Reserved for MBR/BIOS | - 000800 +------------------------+ - | Typically used by MBR | - 000600 +------------------------+ - | BIOS use only | - 000000 +------------------------+ + ~ ~ + | Protected-mode kernel | + 100000 +------------------------+ + | I/O memory hole | + 0A0000 +------------------------+ + | Reserved for BIOS | Leave as much as possible unused + ~ ~ + | Command line | (Can also be below the X+10000 mark) + X+10000 +------------------------+ + | Stack/heap | For use by the kernel real-mode code. + X+08000 +------------------------+ + | Kernel setup | The kernel real-mode code. + | Kernel boot sector | The kernel legacy boot sector. + X +------------------------+ + | Boot loader | <- Boot sector entry point 0000:7C00 + 001000 +------------------------+ + | Reserved for MBR/BIOS | + 000800 +------------------------+ + | Typically used by MBR | + 000600 +------------------------+ + | BIOS use only | + 000000 +------------------------+ ... where the address X is as low as the design of the boot loader permits. @@ -229,22 +229,22 @@ Offset/Size Proto Name Meaning =========== ======== ===================== ============================================ .. note:: - (1) For backwards compatibility, if the setup_sects field contains 0, the - real value is 4. + (1) For backwards compatibility, if the setup_sects field contains 0, + the real value is 4. - (2) For boot protocol prior to 2.04, the upper two bytes of the syssize - field are unusable, which means the size of a bzImage kernel - cannot be determined. + (2) For boot protocol prior to 2.04, the upper two bytes of the syssize + field are unusable, which means the size of a bzImage kernel + cannot be determined. - (3) Ignored, but safe to set, for boot protocols 2.02-2.09. + (3) Ignored, but safe to set, for boot protocols 2.02-2.09. If the "HdrS" (0x53726448) magic number is not found at offset 0x202, the boot protocol version is "old". Loading an old kernel, the following parameters should be assumed:: - Image type = zImage - initrd not supported - Real-mode kernel must be located at 0x90000. + Image type = zImage + initrd not supported + Real-mode kernel must be located at 0x90000. Otherwise, the "version" field contains the protocol version, e.g. protocol version 2.01 will contain 0x0201 in this field. When @@ -265,7 +265,7 @@ All general purpose boot loaders should write the fields marked nonstandard address should fill in the fields marked (reloc); other boot loaders can ignore those fields. -The byte order of all fields is littleendian (this is x86, after all.) +The byte order of all fields is little endian (this is x86, after all.) ============ =========== Field name: setup_sects @@ -365,7 +365,7 @@ Offset/size: 0x206/2 Protocol: 2.00+ ============ ======= - Contains the boot protocol version, in (major << 8)+minor format, + Contains the boot protocol version, in (major << 8) + minor format, e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version 10.17. @@ -397,17 +397,17 @@ Protocol: 2.00+ If set to a nonzero value, contains a pointer to a NUL-terminated human-readable kernel version number string, less 0x200. This can be used to display the kernel version to the user. This value - should be less than (0x200*setup_sects). + should be less than (0x200 * setup_sects). For example, if this value is set to 0x1c00, the kernel version number string can be found at offset 0x1e00 in the kernel file. This is a valid value if and only if the "setup_sects" field contains the value 15 or higher, as:: - 0x1c00 < 15*0x200 (= 0x1e00) but - 0x1c00 >= 14*0x200 (= 0x1c00) + 0x1c00 < 15 * 0x200 (= 0x1e00) but + 0x1c00 >= 14 * 0x200 (= 0x1c00) - 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. + 0x1c00 >> 9 = 14, So the minimum value for setup_secs is 15. ============ ================== Field name: type_of_loader @@ -427,9 +427,9 @@ Protocol: 2.00+ For example, for T = 0x15, V = 0x234, write:: - type_of_loader <- 0xE4 - ext_loader_type <- 0x05 - ext_loader_ver <- 0x23 + type_of_loader <- 0xE4 + ext_loader_type <- 0x05 + ext_loader_ver <- 0x23 Assigned boot loader ids (hexadecimal): @@ -686,7 +686,7 @@ Protocol: 2.10+ If a boot loader makes use of this field, it should update the kernel_alignment field with the alignment unit desired; typically:: - kernel_alignment = 1 << min_alignment + kernel_alignment = 1 << min_alignment; There may be a considerable performance cost with an excessively misaligned kernel. Therefore, a loader should typically try each @@ -754,7 +754,7 @@ Protocol: 2.07+ 0x00000000 The default x86/PC environment 0x00000001 lguest 0x00000002 Xen - 0x00000003 Moorestown MID + 0x00000003 Intel MID (Moorestown, CloverTrail, Merrifield, Moorefield) 0x00000004 CE4100 TV Platform ========== ============================== @@ -808,13 +808,13 @@ Protocol: 2.09+ parameters passing mechanism. The definition of struct setup_data is as follow:: - struct setup_data { - u64 next; - u32 type; - u32 len; - u8 data[0]; - }; - + struct setup_data { + __u64 next; + __u32 type; + __u32 len; + __u8 data[]; + } + Where, the next is a 64-bit physical pointer to the next node of linked list, the next field of the last node is 0; the type is used to identify the contents of data; the len is the length of data @@ -834,12 +834,12 @@ Protocol: 2.09+ Thus setup_indirect struct and SETUP_INDIRECT type were introduced in protocol 2.15:: - struct setup_indirect { - __u32 type; - __u32 reserved; /* Reserved, must be set to zero. */ - __u64 len; - __u64 addr; - }; + struct setup_indirect { + __u32 type; + __u32 reserved; /* Reserved, must be set to zero. */ + __u64 len; + __u64 addr; + }; The type member is a SETUP_INDIRECT | SETUP_* type. However, it cannot be SETUP_INDIRECT itself since making the setup_indirect a tree structure @@ -849,17 +849,17 @@ Protocol: 2.09+ Let's give an example how to point to SETUP_E820_EXT data using setup_indirect. In this case setup_data and setup_indirect will look like this:: - struct setup_data { - __u64 next = 0 or <addr_of_next_setup_data_struct>; - __u32 type = SETUP_INDIRECT; - __u32 len = sizeof(setup_indirect); - __u8 data[sizeof(setup_indirect)] = struct setup_indirect { - __u32 type = SETUP_INDIRECT | SETUP_E820_EXT; - __u32 reserved = 0; - __u64 len = <len_of_SETUP_E820_EXT_data>; - __u64 addr = <addr_of_SETUP_E820_EXT_data>; - } - } + struct setup_data { + .next = 0, /* or <addr_of_next_setup_data_struct> */ + .type = SETUP_INDIRECT, + .len = sizeof(setup_indirect), + .data[sizeof(setup_indirect)] = (struct setup_indirect) { + .type = SETUP_INDIRECT | SETUP_E820_EXT, + .reserved = 0, + .len = <len_of_SETUP_E820_EXT_data>, + .addr = <addr_of_SETUP_E820_EXT_data>, + }, + } .. note:: SETUP_INDIRECT | SETUP_NONE objects cannot be properly distinguished @@ -896,10 +896,19 @@ Offset/size: 0x260/4 The kernel runtime start address is determined by the following algorithm:: - if (relocatable_kernel) - runtime_start = align_up(load_address, kernel_alignment) - else - runtime_start = pref_address + if (relocatable_kernel) { + if (load_address < pref_address) + load_address = pref_address; + runtime_start = align_up(load_address, kernel_alignment); + } else { + runtime_start = pref_address; + } + +Hence the necessary memory window location and size can be estimated by +a boot loader as:: + + memory_window_start = runtime_start; + memory_window_size = init_size; ============ =============== Field name: handover_offset @@ -929,12 +938,12 @@ The kernel_info =============== The relationships between the headers are analogous to the various data -sections: +sections:: setup_header = .data boot_params/setup_data = .bss -What is missing from the above list? That's right: +What is missing from the above list? That's right:: kernel_info = .rodata @@ -966,22 +975,22 @@ after kernel_info_var_len_data label. Each chunk of variable size data has to be prefixed with header/magic and its size, e.g.:: kernel_info: - .ascii "LToP" /* Header, Linux top (structure). */ - .long kernel_info_var_len_data - kernel_info - .long kernel_info_end - kernel_info - .long 0x01234567 /* Some fixed size data for the bootloaders. */ + .ascii "LToP" /* Header, Linux top (structure). */ + .long kernel_info_var_len_data - kernel_info + .long kernel_info_end - kernel_info + .long 0x01234567 /* Some fixed size data for the bootloaders. */ kernel_info_var_len_data: - example_struct: /* Some variable size data for the bootloaders. */ - .ascii "0123" /* Header/Magic. */ - .long example_struct_end - example_struct - .ascii "Struct" - .long 0x89012345 + example_struct: /* Some variable size data for the bootloaders. */ + .ascii "0123" /* Header/Magic. */ + .long example_struct_end - example_struct + .ascii "Struct" + .long 0x89012345 example_struct_end: - example_strings: /* Some variable size data for the bootloaders. */ - .ascii "ABCD" /* Header/Magic. */ - .long example_strings_end - example_strings - .asciz "String_0" - .asciz "String_1" + example_strings: /* Some variable size data for the bootloaders. */ + .ascii "ABCD" /* Header/Magic. */ + .long example_strings_end - example_strings + .asciz "String_0" + .asciz "String_1" example_strings_end: kernel_info_end: @@ -1130,67 +1139,63 @@ mode segment. Such a boot loader should enter the following fields in the header:: - unsigned long base_ptr; /* base address for real-mode segment */ + unsigned long base_ptr; /* base address for real-mode segment */ - if ( setup_sects == 0 ) { - setup_sects = 4; - } + if (setup_sects == 0) + setup_sects = 4; - if ( protocol >= 0x0200 ) { - type_of_loader = <type code>; - if ( loading_initrd ) { - ramdisk_image = <initrd_address>; - ramdisk_size = <initrd_size>; - } + if (protocol >= 0x0200) { + type_of_loader = <type code>; + if (loading_initrd) { + ramdisk_image = <initrd_address>; + ramdisk_size = <initrd_size>; + } - if ( protocol >= 0x0202 && loadflags & 0x01 ) - heap_end = 0xe000; - else - heap_end = 0x9800; + if (protocol >= 0x0202 && loadflags & 0x01) + heap_end = 0xe000; + else + heap_end = 0x9800; - if ( protocol >= 0x0201 ) { - heap_end_ptr = heap_end - 0x200; - loadflags |= 0x80; /* CAN_USE_HEAP */ - } + if (protocol >= 0x0201) { + heap_end_ptr = heap_end - 0x200; + loadflags |= 0x80; /* CAN_USE_HEAP */ + } - if ( protocol >= 0x0202 ) { - cmd_line_ptr = base_ptr + heap_end; - strcpy(cmd_line_ptr, cmdline); - } else { - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; - setup_move_size = heap_end + strlen(cmdline)+1; - strcpy(base_ptr+cmd_line_offset, cmdline); - } - } else { - /* Very old kernel */ + if (protocol >= 0x0202) { + cmd_line_ptr = base_ptr + heap_end; + strcpy(cmd_line_ptr, cmdline); + } else { + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; + setup_move_size = heap_end + strlen(cmdline) + 1; + strcpy(base_ptr + cmd_line_offset, cmdline); + } + } else { + /* Very old kernel */ - heap_end = 0x9800; + heap_end = 0x9800; - cmd_line_magic = 0xA33F; - cmd_line_offset = heap_end; + cmd_line_magic = 0xA33F; + cmd_line_offset = heap_end; - /* A very old kernel MUST have its real-mode code - loaded at 0x90000 */ + /* A very old kernel MUST have its real-mode code loaded at 0x90000 */ + if (base_ptr != 0x90000) { + /* Copy the real-mode kernel */ + memcpy(0x90000, base_ptr, (setup_sects + 1) * 512); + base_ptr = 0x90000; /* Relocated */ + } - if ( base_ptr != 0x90000 ) { - /* Copy the real-mode kernel */ - memcpy(0x90000, base_ptr, (setup_sects+1)*512); - base_ptr = 0x90000; /* Relocated */ - } + strcpy(0x90000 + cmd_line_offset, cmdline); - strcpy(0x90000+cmd_line_offset, cmdline); - - /* It is recommended to clear memory up to the 32K mark */ - memset(0x90000 + (setup_sects+1)*512, 0, - (64-(setup_sects+1))*512); - } + /* It is recommended to clear memory up to the 32K mark */ + memset(0x90000 + (setup_sects + 1) * 512, 0, (64 - (setup_sects + 1)) * 512); + } Loading The Rest of The Kernel ============================== -The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512 +The 32-bit (non-real-mode) kernel starts at offset (setup_sects + 1) * 512 in the kernel file (again, if setup_sects == 0 the real value is 4.) It should be loaded at address 0x10000 for Image/zImage kernels and 0x100000 for bzImage kernels. @@ -1198,13 +1203,14 @@ It should be loaded at address 0x10000 for Image/zImage kernels and The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01 bit (LOAD_HIGH) in the loadflags field is set:: - is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); - load_address = is_bzImage ? 0x100000 : 0x10000; + is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01); + load_address = is_bzImage ? 0x100000 : 0x10000; -Note that Image/zImage kernels can be up to 512K in size, and thus use -the entire 0x10000-0x90000 range of memory. This means it is pretty -much a requirement for these kernels to load the real-mode part at -0x90000. bzImage kernels allow much more flexibility. +.. note:: + Image/zImage kernels can be up to 512K in size, and thus use the entire + 0x10000-0x90000 range of memory. This means it is pretty much a + requirement for these kernels to load the real-mode part at 0x90000. + bzImage kernels allow much more flexibility. Special Command Line Options ============================ @@ -1273,19 +1279,20 @@ es = ss. In our example from above, we would do:: - /* Note: in the case of the "old" kernel protocol, base_ptr must - be == 0x90000 at this point; see the previous sample code */ + /* + * Note: in the case of the "old" kernel protocol, base_ptr must + * be == 0x90000 at this point; see the previous sample code. + */ + seg = base_ptr >> 4; - seg = base_ptr >> 4; + cli(); /* Enter with interrupts disabled! */ - cli(); /* Enter with interrupts disabled! */ + /* Set up the real-mode kernel stack */ + _SS = seg; + _SP = heap_end; - /* Set up the real-mode kernel stack */ - _SS = seg; - _SP = heap_end; - - _DS = _ES = _FS = _GS = seg; - jmp_far(seg+0x20, 0); /* Run the kernel */ + _DS = _ES = _FS = _GS = seg; + jmp_far(seg + 0x20, 0); /* Run the kernel */ If your boot sector accesses a floppy drive, it is recommended to switch off the floppy motor before running the kernel, since the @@ -1340,7 +1347,7 @@ from offset 0x01f1 of kernel image on should be loaded into struct boot_params and examined. The end of setup header can be calculated as follow:: - 0x0202 + byte value at offset 0x0201 + 0x0202 + byte value at offset 0x0201 In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should @@ -1376,7 +1383,7 @@ Then, the setup header at offset 0x01f1 of kernel image on should be loaded into struct boot_params and examined. The end of setup header can be calculated as follows:: - 0x0202 + byte value at offset 0x0201 + 0x0202 + byte value at offset 0x0201 In addition to read/modify/write the setup header of the struct boot_params as that of 16-bit boot protocol, the boot loader should @@ -1418,7 +1425,7 @@ execution context provided by the EFI firmware. The function prototype for the handover entry point looks like this:: - efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp) + void efi_stub_entry(void *handle, efi_system_table_t *table, struct boot_params *bp); 'handle' is the EFI image handle passed to the boot loader by the EFI firmware, 'table' is the EFI system table - these are the first two @@ -1433,12 +1440,13 @@ The boot loader *must* fill out the following fields in bp:: All other fields should be zero. -NOTE: The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF - entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd - loading protocol (refer to [0] for an example of the bootloader side of - this), which removes the need for any knowledge on the part of the EFI - bootloader regarding the internal representation of boot_params or any - requirements/limitations regarding the placement of the command line - and ramdisk in memory, or the placement of the kernel image itself. +.. note:: + The EFI Handover Protocol is deprecated in favour of the ordinary PE/COFF + entry point, combined with the LINUX_EFI_INITRD_MEDIA_GUID based initrd + loading protocol (refer to [0] for an example of the bootloader side of + this), which removes the need for any knowledge on the part of the EFI + bootloader regarding the internal representation of boot_params or any + requirements/limitations regarding the placement of the command line + and ramdisk in memory, or the placement of the kernel image itself. [0] https://github.com/u-boot/u-boot/commit/ec80b4735a593961fe701cc3a5d717d4739b0fd0 diff --git a/Documentation/arch/x86/buslock.rst b/Documentation/arch/x86/buslock.rst index 4c5a4822eeb7..31f1bfdff16f 100644 --- a/Documentation/arch/x86/buslock.rst +++ b/Documentation/arch/x86/buslock.rst @@ -26,7 +26,8 @@ Detection ========= Intel processors may support either or both of the following hardware -mechanisms to detect split locks and bus locks. +mechanisms to detect split locks and bus locks. Some AMD processors also +support bus lock detect. #AC exception for split lock detection -------------------------------------- diff --git a/Documentation/arch/x86/cpuinfo.rst b/Documentation/arch/x86/cpuinfo.rst index 8895784d4784..6ef426a52cdc 100644 --- a/Documentation/arch/x86/cpuinfo.rst +++ b/Documentation/arch/x86/cpuinfo.rst @@ -112,7 +112,7 @@ conditions are met, the features are enabled by the set_cpu_cap or setup_force_cpu_cap macros. For example, if bit 5 is set in MSR_IA32_CORE_CAPS, the feature X86_FEATURE_SPLIT_LOCK_DETECT will be enabled and "split_lock_detect" will be displayed. The flag "ring3mwait" will be -displayed only when running on INTEL_FAM6_XEON_PHI_[KNL|KNM] processors. +displayed only when running on INTEL_XEON_PHI_[KNL|KNM] processors. d: Flags can represent purely software features. ------------------------------------------------ diff --git a/Documentation/arch/x86/exception-tables.rst b/Documentation/arch/x86/exception-tables.rst index efde1fef4fbd..6e7177363f8f 100644 --- a/Documentation/arch/x86/exception-tables.rst +++ b/Documentation/arch/x86/exception-tables.rst @@ -297,7 +297,7 @@ vma occurs? c) execution continues at local label 2 (address of the instruction immediately after the faulting user access). -The steps 8a to 8c in a certain way emulate the faulting instruction. + The steps a to c above in a certain way emulate the faulting instruction. That's it, mostly. If you look at our example, you might ask why we set EAX to -EFAULT in the exception handler code. Well, the diff --git a/Documentation/arch/x86/mds.rst b/Documentation/arch/x86/mds.rst index c58c72362911..5a2e6c0ef04a 100644 --- a/Documentation/arch/x86/mds.rst +++ b/Documentation/arch/x86/mds.rst @@ -162,7 +162,7 @@ Mitigation points 3. It would take a large number of these precisely-timed NMIs to mount an actual attack. There's presumably not enough bandwidth. 4. The NMI in question occurs after a VERW, i.e. when user state is - restored and most interesting data is already scrubbed. Whats left + restored and most interesting data is already scrubbed. What's left is only the data that NMI touches, and that may or may not be of any interest. diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst index 6c245582d8fb..6768fc1fad16 100644 --- a/Documentation/arch/x86/resctrl.rst +++ b/Documentation/arch/x86/resctrl.rst @@ -375,11 +375,25 @@ When monitoring is enabled all MON groups will also contain: all tasks in the group. In CTRL_MON groups these files provide the sum for all tasks in the CTRL_MON group and all tasks in MON groups. Please see example section for more details on usage. + On systems with Sub-NUMA Cluster (SNC) enabled there are extra + directories for each node (located within the "mon_L3_XX" directory + for the L3 cache they occupy). These are named "mon_sub_L3_YY" + where "YY" is the node number. "mon_hw_id": Available only with debug option. The identifier used by hardware for the monitor group. On x86 this is the RMID. +When the "mba_MBps" mount option is used all CTRL_MON groups will also contain: + +"mba_MBps_event": + Reading this file shows which memory bandwidth event is used + as input to the software feedback loop that keeps memory bandwidth + below the value specified in the schemata file. Writing the + name of one of the supported memory bandwidth events found in + /sys/fs/resctrl/info/L3_MON/mon_features changes the input + event. + Resource allocation rules ------------------------- @@ -446,6 +460,12 @@ during mkdir. max_threshold_occupancy is a user configurable value to determine the occupancy at which an RMID can be freed. +The mon_llc_occupancy_limbo tracepoint gives the precise occupancy in bytes +for a subset of RMID that are not immediately available for allocation. +This can't be relied on to produce output every second, it may be necessary +to attempt to create an empty monitor group to force an update. Output may +only be produced if creation of a control or monitor group fails. + Schemata files - general concepts --------------------------------- Each line in the file describes one resource. The line starts with @@ -478,6 +498,29 @@ if non-contiguous 1s value is supported. On a system with a 20-bit mask each bit represents 5% of the capacity of the cache. You could partition the cache into four equal parts with masks: 0x1f, 0x3e0, 0x7c00, 0xf8000. +Notes on Sub-NUMA Cluster mode +============================== +When SNC mode is enabled, Linux may load balance tasks between Sub-NUMA +nodes much more readily than between regular NUMA nodes since the CPUs +on Sub-NUMA nodes share the same L3 cache and the system may report +the NUMA distance between Sub-NUMA nodes with a lower value than used +for regular NUMA nodes. + +The top-level monitoring files in each "mon_L3_XX" directory provide +the sum of data across all SNC nodes sharing an L3 cache instance. +Users who bind tasks to the CPUs of a specific Sub-NUMA node can read +the "llc_occupancy", "mbm_total_bytes", and "mbm_local_bytes" in the +"mon_sub_L3_YY" directories to get node local data. + +Memory bandwidth allocation is still performed at the L3 cache +level. I.e. throttling controls are applied to all SNC nodes. + +L3 cache allocation bitmaps also apply to all SNC nodes. But note that +the amount of L3 cache represented by each bit is divided by the number +of SNC nodes per L3 cache. E.g. with a 100MB cache on a system with 10-bit +allocation masks each bit normally represents 10MB. With SNC mode enabled +with two SNC nodes per L3 cache, each bit only represents 5MB. + Memory bandwidth Allocation and monitoring ========================================== diff --git a/Documentation/arch/x86/topology.rst b/Documentation/arch/x86/topology.rst index 7352ab89a55a..c12837e61bda 100644 --- a/Documentation/arch/x86/topology.rst +++ b/Documentation/arch/x86/topology.rst @@ -135,6 +135,10 @@ Thread-related topology information in the kernel: The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo "core_id." + - topology_logical_core_id(); + + The logical core ID to which a thread belongs. + System topology examples diff --git a/Documentation/arch/x86/x86_64/boot-options.rst b/Documentation/arch/x86/x86_64/boot-options.rst deleted file mode 100644 index 137432d34109..000000000000 --- a/Documentation/arch/x86/x86_64/boot-options.rst +++ /dev/null @@ -1,319 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=========================== -AMD64 Specific Boot Options -=========================== - -There are many others (usually documented in driver documentation), but -only the AMD64 specific ones are listed here. - -Machine check -============= -Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. - - mce=off - Disable machine check - mce=no_cmci - Disable CMCI(Corrected Machine Check Interrupt) that - Intel processor supports. Usually this disablement is - not recommended, but it might be handy if your hardware - is misbehaving. - Note that you'll get more problems without CMCI than with - due to the shared banks, i.e. you might get duplicated - error logs. - mce=dont_log_ce - Don't make logs for corrected errors. All events reported - as corrected are silently cleared by OS. - This option will be useful if you have no interest in any - of corrected errors. - mce=ignore_ce - Disable features for corrected errors, e.g. polling timer - and CMCI. All events reported as corrected are not cleared - by OS and remained in its error banks. - Usually this disablement is not recommended, however if - there is an agent checking/clearing corrected errors - (e.g. BIOS or hardware monitoring applications), conflicting - with OS's error handling, and you cannot deactivate the agent, - then this option will be a help. - mce=no_lmce - Do not opt-in to Local MCE delivery. Use legacy method - to broadcast MCEs. - mce=bootlog - Enable logging of machine checks left over from booting. - Disabled by default on AMD Fam10h and older because some BIOS - leave bogus ones. - If your BIOS doesn't do that it's a good idea to enable though - to make sure you log even machine check events that result - in a reboot. On Intel systems it is enabled by default. - mce=nobootlog - Disable boot machine check logging. - mce=monarchtimeout (number) - monarchtimeout: - Sets the time in us to wait for other CPUs on machine checks. 0 - to disable. - mce=bios_cmci_threshold - Don't overwrite the bios-set CMCI threshold. This boot option - prevents Linux from overwriting the CMCI threshold set by the - bios. Without this option, Linux always sets the CMCI - threshold to 1. Enabling this may make memory predictive failure - analysis less effective if the bios sets thresholds for memory - errors since we will not see details for all errors. - mce=recovery - Force-enable recoverable machine check code paths - - nomce (for compatibility with i386) - same as mce=off - - Everything else is in sysfs now. - -APICs -===== - - apic - Use IO-APIC. Default - - noapic - Don't use the IO-APIC. - - disableapic - Don't use the local APIC - - nolapic - Don't use the local APIC (alias for i386 compatibility) - - pirq=... - See Documentation/arch/x86/i386/IO-APIC.rst - - noapictimer - Don't set up the APIC timer - - no_timer_check - Don't check the IO-APIC timer. This can work around - problems with incorrect timer initialization on some boards. - - apicpmtimer - Do APIC timer calibration using the pmtimer. Implies - apicmaintimer. Useful when your PIT timer is totally broken. - -Timing -====== - - notsc - Deprecated, use tsc=unstable instead. - - nohpet - Don't use the HPET timer. - -Idle loop -========= - - idle=poll - Don't do power saving in the idle loop using HLT, but poll for rescheduling - event. This will make the CPUs eat a lot more power, but may be useful - to get slightly better performance in multiprocessor benchmarks. It also - makes some profiling using performance counters more accurate. - Please note that on systems with MONITOR/MWAIT support (like Intel EM64T - CPUs) this option has no performance advantage over the normal idle loop. - It may also interact badly with hyperthreading. - -Rebooting -========= - - reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] | p[ci] [, [w]arm | [c]old] - bios - Use the CPU reboot vector for warm reset - warm - Don't set the cold reboot flag - cold - Set the cold reboot flag - triple - Force a triple fault (init) - kbd - Use the keyboard controller. cold reset (default) - acpi - Use the ACPI RESET_REG in the FADT. If ACPI is not configured or - the ACPI reset does not work, the reboot path attempts the reset - using the keyboard controller. - efi - Use efi reset_system runtime service. If EFI is not configured or - the EFI reset does not work, the reboot path attempts the reset using - the keyboard controller. - pci - Use a write to the PCI config space register 0xcf9 to trigger reboot. - - Using warm reset will be much faster especially on big memory - systems because the BIOS will not go through the memory check. - Disadvantage is that not all hardware will be completely reinitialized - on reboot so there may be boot problems on some systems. - - reboot=force - Don't stop other CPUs on reboot. This can make reboot more reliable - in some cases. - - reboot=default - There are some built-in platform specific "quirks" - you may see: - "reboot: <name> series board detected. Selecting <type> for reboots." - In the case where you think the quirk is in error (e.g. you have - newer BIOS, or newer board) using this option will ignore the built-in - quirk table, and use the generic default reboot actions. - -NUMA -==== - - numa=off - Only set up a single NUMA node spanning all memory. - - numa=noacpi - Don't parse the SRAT table for NUMA setup - - numa=nohmat - Don't parse the HMAT table for NUMA setup, or soft-reserved memory - partitioning. - - numa=fake=<size>[MG] - If given as a memory unit, fills all system RAM with nodes of - size interleaved over physical nodes. - - numa=fake=<N> - If given as an integer, fills all system RAM with N fake nodes - interleaved over physical nodes. - - numa=fake=<N>U - If given as an integer followed by 'U', it will divide each - physical node into N emulated nodes. - -ACPI -==== - - acpi=off - Don't enable ACPI - acpi=ht - Use ACPI boot table parsing, but don't enable ACPI interpreter - acpi=force - Force ACPI on (currently not needed) - acpi=strict - Disable out of spec ACPI workarounds. - acpi_sci={edge,level,high,low} - Set up ACPI SCI interrupt. - acpi=noirq - Don't route interrupts - acpi=nocmcff - Disable firmware first mode for corrected errors. This - disables parsing the HEST CMC error source to check if - firmware has set the FF flag. This may result in - duplicate corrected error reports. - -PCI -=== - - pci=off - Don't use PCI - pci=conf1 - Use conf1 access. - pci=conf2 - Use conf2 access. - pci=rom - Assign ROMs. - pci=assign-busses - Assign busses - pci=irqmask=MASK - Set PCI interrupt mask to MASK - pci=lastbus=NUMBER - Scan up to NUMBER busses, no matter what the mptable says. - pci=noacpi - Don't use ACPI to set up PCI interrupt routing. - -IOMMU (input/output memory management unit) -=========================================== -Multiple x86-64 PCI-DMA mapping implementations exist, for example: - - 1. <kernel/dma/direct.c>: use no hardware/software IOMMU at all - (e.g. because you have < 3 GB memory). - Kernel boot message: "PCI-DMA: Disabling IOMMU" - - 2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU. - Kernel boot message: "PCI-DMA: using GART IOMMU" - - 3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used - e.g. if there is no hardware IOMMU in the system and it is need because - you have >3GB memory or told the kernel to us it (iommu=soft)) - Kernel boot message: "PCI-DMA: Using software bounce buffering - for IO (SWIOTLB)" - -:: - - iommu=[<size>][,noagp][,off][,force][,noforce] - [,memaper[=<order>]][,merge][,fullflush][,nomerge] - [,noaperture] - -General iommu options: - - off - Don't initialize and use any kind of IOMMU. - noforce - Don't force hardware IOMMU usage when it is not needed. (default). - force - Force the use of the hardware IOMMU even when it is - not actually needed (e.g. because < 3 GB memory). - soft - Use software bounce buffering (SWIOTLB) (default for - Intel machines). This can be used to prevent the usage - of an available hardware IOMMU. - -iommu options only relevant to the AMD GART hardware IOMMU: - - <size> - Set the size of the remapping area in bytes. - allowed - Overwrite iommu off workarounds for specific chipsets. - fullflush - Flush IOMMU on each allocation (default). - nofullflush - Don't use IOMMU fullflush. - memaper[=<order>] - Allocate an own aperture over RAM with size 32MB<<order. - (default: order=1, i.e. 64MB) - merge - Do scatter-gather (SG) merging. Implies "force" (experimental). - nomerge - Don't do scatter-gather (SG) merging. - noaperture - Ask the IOMMU not to touch the aperture for AGP. - noagp - Don't initialize the AGP driver and use full aperture. - panic - Always panic when IOMMU overflows. - -iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU -implementation: - - swiotlb=<slots>[,force,noforce] - <slots> - Prereserve that many 2K slots for the software IO bounce buffering. - force - Force all IO through the software TLB. - noforce - Do not initialize the software TLB. - - -Miscellaneous -============= - - nogbpages - Do not use GB pages for kernel direct mappings. - gbpages - Use GB pages for kernel direct mappings. - - -AMD SEV (Secure Encrypted Virtualization) -========================================= -Options relating to AMD SEV, specified via the following format: - -:: - - sev=option1[,option2] - -The available options are: - - debug - Enable debug messages. diff --git a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst index ba74617d4999..970ee94eb551 100644 --- a/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst +++ b/Documentation/arch/x86/x86_64/fake-numa-for-cpusets.rst @@ -18,7 +18,7 @@ For more information on the features of cpusets, see Documentation/admin-guide/cgroup-v1/cpusets.rst. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of -configuring fake nodes, see Documentation/arch/x86/x86_64/boot-options.rst. +configuring fake nodes, see Documentation/admin-guide/kernel-parameters.txt For the purposes of this introduction, we'll assume a very primitive NUMA emulation setup of "numa=fake=4*512,". This will split our system memory into diff --git a/Documentation/arch/x86/x86_64/fsgs.rst b/Documentation/arch/x86/x86_64/fsgs.rst index 50960e09e1f6..d07e445dac5c 100644 --- a/Documentation/arch/x86/x86_64/fsgs.rst +++ b/Documentation/arch/x86/x86_64/fsgs.rst @@ -125,7 +125,7 @@ FSGSBASE instructions enablement FSGSBASE instructions compiler support ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -GCC version 4.6.4 and newer provide instrinsics for the FSGSBASE +GCC version 4.6.4 and newer provide intrinsics for the FSGSBASE instructions. Clang 5 supports them as well. =================== =========================== @@ -135,7 +135,7 @@ instructions. Clang 5 supports them as well. _writegsbase_u64() Write the GS base register =================== =========================== -To utilize these instrinsics <immintrin.h> must be included in the source +To utilize these intrinsics <immintrin.h> must be included in the source code and the compiler option -mfsgsbase has to be added. Compiler support for FS/GS based addressing diff --git a/Documentation/arch/x86/x86_64/index.rst b/Documentation/arch/x86/x86_64/index.rst index ad15e9bd623f..a0261957a08a 100644 --- a/Documentation/arch/x86/x86_64/index.rst +++ b/Documentation/arch/x86/x86_64/index.rst @@ -7,7 +7,6 @@ x86_64 Support .. toctree:: :maxdepth: 2 - boot-options uefi mm 5level-paging diff --git a/Documentation/arch/x86/x86_64/mm.rst b/Documentation/arch/x86/x86_64/mm.rst index 35e5e18c83d0..f2db178b353f 100644 --- a/Documentation/arch/x86/x86_64/mm.rst +++ b/Documentation/arch/x86/x86_64/mm.rst @@ -29,15 +29,27 @@ Complete virtual memory map with 4-level page tables Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | - 0000000000000000 | 0 | 00007fffffffffff | 128 TB | user-space virtual memory, different per mm + 0000000000000000 | 0 | 00007fffffffefff | ~128 TB | user-space virtual memory, different per mm + 00007ffffffff000 | ~128 TB | 00007fffffffffff | 4 kB | ... guard hole __________________|____________|__________________|_________|___________________________________________________________ | | | | - 0000800000000000 | +128 TB | ffff7fffffffffff | ~16M TB | ... huge, almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -128 TB + 0000800000000000 | +128 TB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -8 EB | | | | starting offset of kernel mappings. + | | | | + | | | | LAM relaxes canonicallity check allowing to create aliases + | | | | for userspace memory here. __________________|____________|__________________|_________|___________________________________________________________ | | Kernel-space virtual memory, shared between all processes: + __________________|____________|__________________|_________|___________________________________________________________ + | | | | + 8000000000000000 | -8 EB | ffff7fffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -128 TB + | | | | starting offset of kernel mappings. + | | | | + | | | | LAM_SUP relaxes canonicallity check allowing to create + | | | | aliases for kernel memory here. ____________________________________________________________|___________________________________________________________ | | | | ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor @@ -88,16 +100,27 @@ Complete virtual memory map with 5-level page tables Start addr | Offset | End addr | Size | VM area description ======================================================================================================================== | | | | - 0000000000000000 | 0 | 00ffffffffffffff | 64 PB | user-space virtual memory, different per mm + 0000000000000000 | 0 | 00fffffffffff000 | ~64 PB | user-space virtual memory, different per mm + 00fffffffffff000 | ~64 PB | 00ffffffffffffff | 4 kB | ... guard hole __________________|____________|__________________|_________|___________________________________________________________ | | | | - 0100000000000000 | +64 PB | feffffffffffffff | ~16K PB | ... huge, still almost 64 bits wide hole of non-canonical - | | | | virtual memory addresses up to the -64 PB + 0100000000000000 | +64 PB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -8EB TB | | | | starting offset of kernel mappings. + | | | | + | | | | LAM relaxes canonicallity check allowing to create aliases + | | | | for userspace memory here. __________________|____________|__________________|_________|___________________________________________________________ | | Kernel-space virtual memory, shared between all processes: ____________________________________________________________|___________________________________________________________ + 8000000000000000 | -8 EB | feffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical + | | | | virtual memory addresses up to the -64 PB + | | | | starting offset of kernel mappings. + | | | | + | | | | LAM_SUP relaxes canonicallity check allowing to create + | | | | aliases for kernel memory here. + ____________________________________________________________|___________________________________________________________ | | | | ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI diff --git a/Documentation/arch/x86/x86_64/uefi.rst b/Documentation/arch/x86/x86_64/uefi.rst index fbc30c9a071d..e84592dbd6c1 100644 --- a/Documentation/arch/x86/x86_64/uefi.rst +++ b/Documentation/arch/x86/x86_64/uefi.rst @@ -12,14 +12,20 @@ with EFI firmware and specifications are listed below. 1. UEFI specification: http://www.uefi.org -2. Booting Linux kernel on UEFI x86_64 platform requires bootloader - support. Elilo with x86_64 support can be used. +2. Booting Linux kernel on UEFI x86_64 platform can either be + done using the <Documentation/admin-guide/efi-stub.rst> or using a + separate bootloader. 3. x86_64 platform with EFI/UEFI firmware. Mechanics --------- +Refer to <Documentation/admin-guide/efi-stub.rst> to learn how to use the EFI stub. + +Below are general EFI setup guidelines on the x86_64 platform, +regardless of whether you use the EFI stub or a separate bootloader. + - Build the kernel with the following configuration:: CONFIG_FB_EFI=y @@ -31,16 +37,27 @@ Mechanics CONFIG_EFI=y CONFIG_EFIVAR_FS=y or m # optional -- Create a VFAT partition on the disk -- Copy the following to the VFAT partition: +- Create a VFAT partition on the disk with the EFI System flag + You can do this with fdisk with the following commands: + + 1. g - initialize a GPT partition table + 2. n - create a new partition + 3. t - change the partition type to "EFI System" (number 1) + 4. w - write and save the changes + + Afterwards, initialize the VFAT filesystem by running mkfs:: + + mkfs.fat /dev/<your-partition> + +- Copy the boot files to the VFAT partition: + If you use the EFI stub method, the kernel acts also as an EFI executable. + + You can just copy the bzImage to the EFI/boot/bootx64.efi path on the partition + so that it will automatically get booted, see the <Documentation/admin-guide/efi-stub.rst> page + for additional instructions regarding passage of kernel parameters and initramfs. - elilo bootloader with x86_64 support, elilo configuration file, - kernel image built in first step and corresponding - initrd. Instructions on building elilo and its dependencies - can be found in the elilo sourceforge project. + If you use a custom bootloader, refer to the relevant documentation for help on this part. -- Boot to EFI shell and invoke elilo choosing the kernel image built - in first step. - If some or all EFI runtime services don't work, you can try following kernel command line parameters to turn off some or all EFI runtime services. diff --git a/Documentation/arch/x86/xstate.rst b/Documentation/arch/x86/xstate.rst index ae5c69e48b11..cec05ac464c1 100644 --- a/Documentation/arch/x86/xstate.rst +++ b/Documentation/arch/x86/xstate.rst @@ -138,7 +138,7 @@ Note this example does not include the sigaltstack preparation. Dynamic features in signal frames --------------------------------- -Dynamcally enabled features are not written to the signal frame upon signal +Dynamically enabled features are not written to the signal frame upon signal entry if the feature is in its initial configuration. This differs from non-dynamic features which are always written regardless of their configuration. Signal handlers can examine the XSAVE buffer's XSTATE_BV |