diff options
Diffstat (limited to 'Documentation/admin-guide')
21 files changed, 521 insertions, 330 deletions
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst index b90dafcc8237..8632a1db36e4 100644 --- a/Documentation/admin-guide/cputopology.rst +++ b/Documentation/admin-guide/cputopology.rst @@ -2,87 +2,10 @@ How CPU topology info is exported via sysfs =========================================== -Export CPU topology info via sysfs. Items (attributes) are similar -to /proc/cpuinfo output of some architectures. They reside in -/sys/devices/system/cpu/cpuX/topology/: - -physical_package_id: - - physical package id of cpuX. Typically corresponds to a physical - socket number, but the actual value is architecture and platform - dependent. - -die_id: - - the CPU die ID of cpuX. Typically it is the hardware platform's - identifier (rather than the kernel's). The actual value is - architecture and platform dependent. - -core_id: - - the CPU core ID of cpuX. Typically it is the hardware platform's - identifier (rather than the kernel's). The actual value is - architecture and platform dependent. - -book_id: - - the book ID of cpuX. Typically it is the hardware platform's - identifier (rather than the kernel's). The actual value is - architecture and platform dependent. - -drawer_id: - - the drawer ID of cpuX. Typically it is the hardware platform's - identifier (rather than the kernel's). The actual value is - architecture and platform dependent. - -core_cpus: - - internal kernel map of CPUs within the same core. - (deprecated name: "thread_siblings") - -core_cpus_list: - - human-readable list of CPUs within the same core. - (deprecated name: "thread_siblings_list"); - -package_cpus: - - internal kernel map of the CPUs sharing the same physical_package_id. - (deprecated name: "core_siblings") - -package_cpus_list: - - human-readable list of CPUs sharing the same physical_package_id. - (deprecated name: "core_siblings_list") - -die_cpus: - - internal kernel map of CPUs within the same die. - -die_cpus_list: - - human-readable list of CPUs within the same die. - -book_siblings: - - internal kernel map of cpuX's hardware threads within the same - book_id. - -book_siblings_list: - - human-readable list of cpuX's hardware threads within the same - book_id. - -drawer_siblings: - - internal kernel map of cpuX's hardware threads within the same - drawer_id. - -drawer_siblings_list: - - human-readable list of cpuX's hardware threads within the same - drawer_id. +CPU topology info is exported via sysfs. Items (attributes) are similar +to /proc/cpuinfo output of some architectures. They reside in +/sys/devices/system/cpu/cpuX/topology/. Please refer to the ABI file: +Documentation/ABI/stable/sysfs-devices-system-cpu. Architecture-neutral, drivers/base/topology.c, exports these attributes. However, the book and drawer related sysfs files will only be created if diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index d2795ca6821e..4c559e08d11e 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -392,7 +392,7 @@ When mounting an ext4 filesystem, the following option are accepted: dax Use direct access (no page cache). See - Documentation/filesystems/dax.txt. Note that this option is + Documentation/filesystems/dax.rst. Note that this option is incompatible with data=journal. inlinecrypt diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst new file mode 100644 index 000000000000..7b410aef9c5c --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst @@ -0,0 +1,223 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=============== +Core Scheduling +=============== +Core scheduling support allows userspace to define groups of tasks that can +share a core. These groups can be specified either for security usecases (one +group of tasks don't trust another), or for performance usecases (some +workloads may benefit from running on the same core as they don't need the same +hardware resources of the shared core, or may prefer different cores if they +do share hardware resource needs). This document only describes the security +usecase. + +Security usecase +---------------- +A cross-HT attack involves the attacker and victim running on different Hyper +Threads of the same core. MDS and L1TF are examples of such attacks. The only +full mitigation of cross-HT attacks is to disable Hyper Threading (HT). Core +scheduling is a scheduler feature that can mitigate some (not all) cross-HT +attacks. It allows HT to be turned on safely by ensuring that only tasks in a +user-designated trusted group can share a core. This increase in core sharing +can also improve performance, however it is not guaranteed that performance +will always improve, though that is seen to be the case with a number of real +world workloads. In theory, core scheduling aims to perform at least as good as +when Hyper Threading is disabled. In practice, this is mostly the case though +not always: as synchronizing scheduling decisions across 2 or more CPUs in a +core involves additional overhead - especially when the system is lightly +loaded. When ``total_threads <= N_CPUS/2``, the extra overhead may cause core +scheduling to perform more poorly compared to SMT-disabled, where N_CPUS is the +total number of CPUs. Please measure the performance of your workloads always. + +Usage +----- +Core scheduling support is enabled via the ``CONFIG_SCHED_CORE`` config option. +Using this feature, userspace defines groups of tasks that can be co-scheduled +on the same core. The core scheduler uses this information to make sure that +tasks that are not in the same group never run simultaneously on a core, while +doing its best to satisfy the system's scheduling requirements. + +Core scheduling can be enabled via the ``PR_SCHED_CORE`` prctl interface. +This interface provides support for the creation of core scheduling groups, as +well as admission and removal of tasks from created groups:: + + #include <sys/prctl.h> + + int prctl(int option, unsigned long arg2, unsigned long arg3, + unsigned long arg4, unsigned long arg5); + +option: + ``PR_SCHED_CORE`` + +arg2: + Command for operation, must be one off: + + - ``PR_SCHED_CORE_GET`` -- get core_sched cookie of ``pid``. + - ``PR_SCHED_CORE_CREATE`` -- create a new unique cookie for ``pid``. + - ``PR_SCHED_CORE_SHARE_TO`` -- push core_sched cookie to ``pid``. + - ``PR_SCHED_CORE_SHARE_FROM`` -- pull core_sched cookie from ``pid``. + +arg3: + ``pid`` of the task for which the operation applies. + +arg4: + ``pid_type`` for which the operation applies. It is of type ``enum pid_type``. + For example, if arg4 is ``PIDTYPE_TGID``, then the operation of this command + will be performed for all tasks in the task group of ``pid``. + +arg5: + userspace pointer to an unsigned long for storing the cookie returned by + ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. + +In order for a process to push a cookie to, or pull a cookie from a process, it +is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the +process. + +Building hierarchies of tasks +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The simplest way to build hierarchies of threads/processes which share a +cookie and thus a core is to rely on the fact that the core-sched cookie is +inherited across forks/clones and execs, thus setting a cookie for the +'initial' script/executable/daemon will place every spawned child in the +same core-sched group. + +Cookie Transferral +~~~~~~~~~~~~~~~~~~ +Transferring a cookie between the current and other tasks is possible using +PR_SCHED_CORE_SHARE_FROM and PR_SCHED_CORE_SHARE_TO to inherit a cookie from a +specified task or a share a cookie with a task. In combination this allows a +simple helper program to pull a cookie from a task in an existing core +scheduling group and share it with already running tasks. + +Design/Implementation +--------------------- +Each task that is tagged is assigned a cookie internally in the kernel. As +mentioned in `Usage`_, tasks with the same cookie value are assumed to trust +each other and share a core. + +The basic idea is that, every schedule event tries to select tasks for all the +siblings of a core such that all the selected tasks running on a core are +trusted (same cookie) at any point in time. Kernel threads are assumed trusted. +The idle task is considered special, as it trusts everything and everything +trusts it. + +During a schedule() event on any sibling of a core, the highest priority task on +the sibling's core is picked and assigned to the sibling calling schedule(), if +the sibling has the task enqueued. For rest of the siblings in the core, +highest priority task with the same cookie is selected if there is one runnable +in their individual run queues. If a task with same cookie is not available, +the idle task is selected. Idle task is globally trusted. + +Once a task has been selected for all the siblings in the core, an IPI is sent to +siblings for whom a new task was selected. Siblings on receiving the IPI will +switch to the new task immediately. If an idle task is selected for a sibling, +then the sibling is considered to be in a `forced idle` state. I.e., it may +have tasks on its on runqueue to run, however it will still have to run idle. +More on this in the next section. + +Forced-idling of hyperthreads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The scheduler tries its best to find tasks that trust each other such that all +tasks selected to be scheduled are of the highest priority in a core. However, +it is possible that some runqueues had tasks that were incompatible with the +highest priority ones in the core. Favoring security over fairness, one or more +siblings could be forced to select a lower priority task if the highest +priority task is not trusted with respect to the core wide highest priority +task. If a sibling does not have a trusted task to run, it will be forced idle +by the scheduler (idle thread is scheduled to run). + +When the highest priority task is selected to run, a reschedule-IPI is sent to +the sibling to force it into idle. This results in 4 cases which need to be +considered depending on whether a VM or a regular usermode process was running +on either HT:: + + HT1 (attack) HT2 (victim) + A idle -> user space user space -> idle + B idle -> user space guest -> idle + C idle -> guest user space -> idle + D idle -> guest guest -> idle + +Note that for better performance, we do not wait for the destination CPU +(victim) to enter idle mode. This is because the sending of the IPI would bring +the destination CPU immediately into kernel mode from user space, or VMEXIT +in the case of guests. At best, this would only leak some scheduler metadata +which may not be worth protecting. It is also possible that the IPI is received +too late on some architectures, but this has not been observed in the case of +x86. + +Trust model +~~~~~~~~~~~ +Core scheduling maintains trust relationships amongst groups of tasks by +assigning them a tag that is the same cookie value. +When a system with core scheduling boots, all tasks are considered to trust +each other. This is because the core scheduler does not have information about +trust relationships until userspace uses the above mentioned interfaces, to +communicate them. In other words, all tasks have a default cookie value of 0. +and are considered system-wide trusted. The forced-idling of siblings running +cookie-0 tasks is also avoided. + +Once userspace uses the above mentioned interfaces to group sets of tasks, tasks +within such groups are considered to trust each other, but do not trust those +outside. Tasks outside the group also don't trust tasks within. + +Limitations of core-scheduling +------------------------------ +Core scheduling tries to guarantee that only trusted tasks run concurrently on a +core. But there could be small window of time during which untrusted tasks run +concurrently or kernel could be running concurrently with a task not trusted by +kernel. + +IPI processing delays +~~~~~~~~~~~~~~~~~~~~~ +Core scheduling selects only trusted tasks to run together. IPI is used to notify +the siblings to switch to the new task. But there could be hardware delays in +receiving of the IPI on some arch (on x86, this has not been observed). This may +cause an attacker task to start running on a CPU before its siblings receive the +IPI. Even though cache is flushed on entry to user mode, victim tasks on siblings +may populate data in the cache and micro architectural buffers after the attacker +starts to run and this is a possibility for data leak. + +Open cross-HT issues that core scheduling does not solve +-------------------------------------------------------- +1. For MDS +~~~~~~~~~~ +Core scheduling cannot protect against MDS attacks between an HT running in +user mode and another running in kernel mode. Even though both HTs run tasks +which trust each other, kernel memory is still considered untrusted. Such +attacks are possible for any combination of sibling CPU modes (host or guest mode). + +2. For L1TF +~~~~~~~~~~~ +Core scheduling cannot protect against an L1TF guest attacker exploiting a +guest or host victim. This is because the guest attacker can craft invalid +PTEs which are not inverted due to a vulnerable guest kernel. The only +solution is to disable EPT (Extended Page Tables). + +For both MDS and L1TF, if the guest vCPU is configured to not trust each +other (by tagging separately), then the guest to guest attacks would go away. +Or it could be a system admin policy which considers guest to guest attacks as +a guest problem. + +Another approach to resolve these would be to make every untrusted task on the +system to not trust every other untrusted task. While this could reduce +parallelism of the untrusted tasks, it would still solve the above issues while +allowing system processes (trusted tasks) to share a core. + +3. Protecting the kernel (IRQ, syscall, VMEXIT) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Unfortunately, core scheduling does not protect kernel contexts running on +sibling hyperthreads from one another. Prototypes of mitigations have been posted +to LKML to solve this, but it is debatable whether such windows are practically +exploitable, and whether the performance overhead of the prototypes are worth +it (not to mention, the added code complexity). + +Other Use cases +--------------- +The main use case for Core scheduling is mitigating the cross-HT vulnerabilities +with SMT enabled. There are other use cases where this feature could be used: + +- Isolating tasks that needs a whole core: Examples include realtime tasks, tasks + that uses SIMD instructions etc. +- Gang scheduling: Requirements for a group of tasks that needs to be scheduled + together could also be realized using core scheduling. One example is vCPUs of + a VM. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index ca4dbdd9016d..f12cda55538b 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -15,3 +15,4 @@ are configurable at compile, boot or run time. tsx_async_abort multihit.rst special-register-buffer-data-sampling.rst + core-scheduling.rst diff --git a/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst index 3b1ce68d2456..966c9b3296ea 100644 --- a/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst +++ b/Documentation/admin-guide/hw-vuln/special-register-buffer-data-sampling.rst @@ -3,7 +3,8 @@ SRBDS - Special Register Buffer Data Sampling ============================================= -SRBDS is a hardware vulnerability that allows MDS :doc:`mds` techniques to +SRBDS is a hardware vulnerability that allows MDS +Documentation/admin-guide/hw-vuln/mds.rst techniques to infer values returned from special register accesses. Special register accesses are accesses to off core registers. According to Intel's evaluation, the special register reads that have a security expectation of privacy are diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 75a9dd98e76e..cb30ca3df27c 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -2,7 +2,7 @@ Documentation for Kdump - The kexec-based Crash Dumping Solution ================================================================ -This document includes overview, setup and installation, and analysis +This document includes overview, setup, installation, and analysis information. Overview @@ -13,9 +13,9 @@ dump of the system kernel's memory needs to be taken (for example, when the system panics). The system kernel's memory image is preserved across the reboot and is accessible to the dump-capture kernel. -You can use common commands, such as cp and scp, to copy the -memory image to a dump file on the local disk, or across the network to -a remote system. +You can use common commands, such as cp, scp or makedumpfile to copy +the memory image to a dump file on the local disk, or across the network +to a remote system. Kdump and kexec are currently supported on the x86, x86_64, ppc64, ia64, s390x, arm and arm64 architectures. @@ -26,13 +26,15 @@ the dump-capture kernel. This ensures that ongoing Direct Memory Access The kexec -p command loads the dump-capture kernel into this reserved memory. -On x86 machines, the first 640 KB of physical memory is needed to boot, -regardless of where the kernel loads. Therefore, kexec backs up this -region just before rebooting into the dump-capture kernel. +On x86 machines, the first 640 KB of physical memory is needed for boot, +regardless of where the kernel loads. For simpler handling, the whole +low 1M is reserved to avoid any later kernel or device driver writing +data into this area. Like this, the low 1M can be reused as system RAM +by kdump kernel without extra handling. -Similarly on PPC64 machines first 32KB of physical memory is needed for -booting regardless of where the kernel is loaded and to support 64K page -size kexec backs up the first 64KB memory. +On PPC64 machines first 32KB of physical memory is needed for booting +regardless of where the kernel is loaded and to support 64K page size +kexec backs up the first 64KB memory. For s390x, when kdump is triggered, the crashkernel region is exchanged with the region [0, crashkernel region size] and then the kdump kernel @@ -46,14 +48,14 @@ passed to the dump-capture kernel through the elfcorehdr= boot parameter. Optionally the size of the ELF header can also be passed when using the elfcorehdr=[size[KMG]@]offset[KMG] syntax. - With the dump-capture kernel, you can access the memory image through /proc/vmcore. This exports the dump as an ELF-format file that you can -write out using file copy commands such as cp or scp. Further, you can -use analysis tools such as the GNU Debugger (GDB) and the Crash tool to -debug the dump file. This method ensures that the dump pages are correctly -ordered. - +write out using file copy commands such as cp or scp. You can also use +makedumpfile utility to analyze and write out filtered contents with +options, e.g with '-d 31' it will only write out kernel data. Further, +you can use analysis tools such as the GNU Debugger (GDB) and the Crash +tool to debug the dump file. This method ensures that the dump pages are +correctly ordered. Setup and Installation ====================== @@ -125,9 +127,18 @@ dump-capture kernels for enabling kdump support. System kernel config options ---------------------------- -1) Enable "kexec system call" in "Processor type and features.":: +1) Enable "kexec system call" or "kexec file based system call" in + "Processor type and features.":: + + CONFIG_KEXEC=y or CONFIG_KEXEC_FILE=y + + And both of them will select KEXEC_CORE:: - CONFIG_KEXEC=y + CONFIG_KEXEC_CORE=y + + Subsequently, CRASH_CORE is selected by KEXEC_CORE:: + + CONFIG_CRASH_CORE=y 2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo filesystems." This is usually enabled by default:: @@ -175,17 +186,19 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64) CONFIG_HIGHMEM4G -2) On i386 and x86_64, disable symmetric multi-processing support - under "Processor type and features":: +2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel + command line when loading the dump-capture kernel because one + CPU is enough for kdump kernel to dump vmcore on most of systems. - CONFIG_SMP=n + However, you can also specify nr_cpus=X to enable multiple processors + in kdump kernel. In this case, "disable_cpu_apicid=" is needed to + tell kdump kernel which cpu is 1st kernel's BSP. Please refer to + admin-guide/kernel-parameters.txt for more details. - (If CONFIG_SMP=y, then specify maxcpus=1 on the kernel command line - when loading the dump-capture kernel, see section "Load the Dump-capture - Kernel".) + With CONFIG_SMP=n, the above things are not related. -3) If one wants to build and use a relocatable kernel, - Enable "Build a relocatable kernel" support under "Processor type and +3) A relocatable kernel is suggested to be built by default. If not yet, + enable "Build a relocatable kernel" support under "Processor type and features":: CONFIG_RELOCATABLE=y @@ -232,7 +245,7 @@ Dump-capture kernel config options (Arch Dependent, ia64) as a dump-capture kernel if desired. The crashkernel region can be automatically placed by the system - kernel at run time. This is done by specifying the base address as 0, + kernel at runtime. This is done by specifying the base address as 0, or omitting it all together:: crashkernel=256M@0 @@ -241,10 +254,6 @@ Dump-capture kernel config options (Arch Dependent, ia64) crashkernel=256M - If the start address is specified, note that the start address of the - kernel will be aligned to 64Mb, so if the start address is not then - any space below the alignment point will be wasted. - Dump-capture kernel config options (Arch Dependent, arm) ---------------------------------------------------------- @@ -260,46 +269,82 @@ Dump-capture kernel config options (Arch Dependent, arm64) on non-VHE systems even if it is configured. This is because the CPU will not be reset to EL2 on panic. -Extended crashkernel syntax +crashkernel syntax =========================== +1) crashkernel=size@offset -While the "crashkernel=size[@offset]" syntax is sufficient for most -configurations, sometimes it's handy to have the reserved memory dependent -on the value of System RAM -- that's mostly for distributors that pre-setup -the kernel command line to avoid a unbootable system after some memory has -been removed from the machine. + Here 'size' specifies how much memory to reserve for the dump-capture kernel + and 'offset' specifies the beginning of this reserved memory. For example, + "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory + starting at physical address 0x01000000 (16MB) for the dump-capture kernel. -The syntax is:: + The crashkernel region can be automatically placed by the system + kernel at run time. This is done by specifying the base address as 0, + or omitting it all together:: - crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset] - range=start-[end] + crashkernel=256M@0 -For example:: + or:: - crashkernel=512M-2G:64M,2G-:128M + crashkernel=256M -This would mean: + If the start address is specified, note that the start address of the + kernel will be aligned to a value (which is Arch dependent), so if the + start address is not then any space below the alignment point will be + wasted. - 1) if the RAM is smaller than 512M, then don't reserve anything - (this is the "rescue" case) - 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M - 3) if the RAM size is larger than 2G, then reserve 128M +2) range1:size1[,range2:size2,...][@offset] + While the "crashkernel=size[@offset]" syntax is sufficient for most + configurations, sometimes it's handy to have the reserved memory dependent + on the value of System RAM -- that's mostly for distributors that pre-setup + the kernel command line to avoid a unbootable system after some memory has + been removed from the machine. + The syntax is:: -Boot into System Kernel -======================= + crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset] + range=start-[end] + + For example:: + + crashkernel=512M-2G:64M,2G-:128M + This would mean: + + 1) if the RAM is smaller than 512M, then don't reserve anything + (this is the "rescue" case) + 2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M + 3) if the RAM size is larger than 2G, then reserve 128M + +3) crashkernel=size,high and crashkernel=size,low + + If memory above 4G is preferred, crashkernel=size,high can be used to + fulfill that. With it, physical memory is allowed to be allocated from top, + so could be above 4G if system has more than 4G RAM installed. Otherwise, + memory region will be allocated below 4G if available. + + When crashkernel=X,high is passed, kernel could allocate physical memory + region above 4G, low memory under 4G is needed in this case. There are + three ways to get low memory: + + 1) Kernel will allocate at least 256M memory below 4G automatically + if crashkernel=Y,low is not specified. + 2) Let user specify low memory size instead. + 3) Specified value 0 will disable low memory allocation:: + + crashkernel=0,low + +Boot into System Kernel +----------------------- 1) Update the boot loader (such as grub, yaboot, or lilo) configuration files as necessary. -2) Boot the system kernel with the boot parameter "crashkernel=Y@X", - where Y specifies how much memory to reserve for the dump-capture kernel - and X specifies the beginning of this reserved memory. For example, - "crashkernel=64M@16M" tells the system kernel to reserve 64 MB of memory - starting at physical address 0x01000000 (16MB) for the dump-capture kernel. +2) Boot the system kernel with the boot parameter "crashkernel=Y@X". - On x86 and x86_64, use "crashkernel=64M@16M". + On x86 and x86_64, use "crashkernel=Y[@X]". Most of the time, the + start address 'X' is not necessary, kernel will search a suitable + area. Unless an explicit start address is expected. On ppc64, use "crashkernel=128M@32M". @@ -331,8 +376,8 @@ of dump-capture kernel. Following is the summary. For i386 and x86_64: - - Use vmlinux if kernel is not relocatable. - Use bzImage/vmlinuz if kernel is relocatable. + - Use vmlinux if kernel is not relocatable. For ppc64: @@ -392,7 +437,7 @@ loading dump-capture kernel. For i386, x86_64 and ia64: - "1 irqpoll maxcpus=1 reset_devices" + "1 irqpoll nr_cpus=1 reset_devices" For ppc64: @@ -400,7 +445,7 @@ For ppc64: For s390x: - "1 maxcpus=1 cgroup_disable=memory" + "1 nr_cpus=1 cgroup_disable=memory" For arm: @@ -408,7 +453,7 @@ For arm: For arm64: - "1 maxcpus=1 reset_devices" + "1 nr_cpus=1 reset_devices" Notes on loading the dump-capture kernel: @@ -488,6 +533,10 @@ the following command:: cp /proc/vmcore <dump-file> +You can also use makedumpfile utility to write out the dump file +with specified options to filter out unwanted contents, e.g:: + + makedumpfile -l --message-level 1 -d 31 /proc/vmcore <dump-file> Analysis ======== @@ -535,8 +584,7 @@ This will cause a kdump to occur at the add_taint()->panic() call. Contact ======= -- Vivek Goyal (vgoyal@redhat.com) -- Maneesh Soni (maneesh@in.ibm.com) +- kexec@lists.infradead.org GDB macros ========== diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index cb89dbdedc46..2991f6e692bd 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -113,7 +113,7 @@ the GPE dispatcher. This facility can be used to prevent such uncontrolled GPE floodings. - Format: <byte> + Format: <byte> or <bitmap-list> acpi_no_auto_serialize [HW,ACPI] Disable auto-serialization of AML methods @@ -581,6 +581,28 @@ loops can be debugged more effectively on production systems. + clocksource.max_cswd_read_retries= [KNL] + Number of clocksource_watchdog() retries due to + external delays before the clock will be marked + unstable. Defaults to three retries, that is, + four attempts to read the clock under test. + + clocksource.verify_n_cpus= [KNL] + Limit the number of CPUs checked for clocksources + marked with CLOCK_SOURCE_VERIFY_PERCPU that + are marked unstable due to excessive skew. + A negative value says to check all CPUs, while + zero says not to check any. Values larger than + nr_cpu_ids are silently truncated to nr_cpu_ids. + The actual CPUs are chosen randomly, with + no replacement if the same CPU is chosen twice. + + clocksource-wdtest.holdoff= [KNL] + Set the time in seconds that the clocksource + watchdog test waits before commencing its tests. + Defaults to zero when built as a module and to + 10 seconds when built into the kernel. + clearcpuid=BITNUM[,BITNUM...] [X86] Disable CPUID feature X for the kernel. See arch/x86/include/asm/cpufeatures.h for the valid bit @@ -3244,7 +3266,7 @@ noclflush [BUGS=X86] Don't use the CLFLUSH instruction - nodelayacct [KNL] Disable per-task delay accounting + delayacct [KNL] Enable per-task delay accounting nodsp [SH] Disable hardware DSP at boot time. @@ -3513,6 +3535,9 @@ nr_uarts= [SERIAL] maximum number of UARTs to be registered. + numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86] Disable NUMA, Only + set up a single NUMA node spanning all memory. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable @@ -3566,6 +3591,12 @@ off: turn off poisoning (default) on: turn on poisoning + page_reporting.page_reporting_order= + [KNL] Minimal page reporting order + Format: <integer> + Adjust the minimal page reporting order. The page + reporting is disabled when it exceeds (MAX_ORDER-1). + panic= [KNL] Kernel behaviour on panic: delay <timeout> timeout > 0: seconds before rebooting timeout = 0: wait forever @@ -4775,11 +4806,6 @@ Reserves a hole at the top of the kernel virtual address space. - reservelow= [X86] - Format: nn[K] - Set the amount of memory to reserve for BIOS at - the bottom of the address space. - reset_devices [KNL] Force drivers to reset the underlying device during initialization. @@ -5283,6 +5309,14 @@ exception. Default behavior is by #AC if both features are enabled in hardware. + ratelimit:N - + Set system wide rate limit to N bus locks + per second for bus lock detection. + 0 < N <= 1000. + + N/A for split lock detection. + + If an #AC exception is hit in the kernel or in firmware (i.e. not while executing in user mode) the kernel will oops in either "warn" or "fatal" diff --git a/Documentation/admin-guide/lockup-watchdogs.rst b/Documentation/admin-guide/lockup-watchdogs.rst index 290840c160af..3e09284a8b9b 100644 --- a/Documentation/admin-guide/lockup-watchdogs.rst +++ b/Documentation/admin-guide/lockup-watchdogs.rst @@ -39,7 +39,7 @@ in principle, they should work in any architecture where these subsystems are present. A periodic hrtimer runs to generate interrupts and kick the watchdog -task. An NMI perf event is generated every "watchdog_thresh" +job. An NMI perf event is generated every "watchdog_thresh" (compile-time initialized to 10 and configurable through sysctl of the same name) seconds to check for hardlockups. If any CPU in the system does not receive any hrtimer interrupt during that time the @@ -47,7 +47,7 @@ does not receive any hrtimer interrupt during that time the generate a kernel warning or call panic, depending on the configuration. -The watchdog task is a high priority kernel thread that updates a +The watchdog job runs in a stop scheduling thread that updates a timestamp every time it is scheduled. If that timestamp is not updated for 2*watchdog_thresh seconds (the softlockup threshold) the 'softlockup detector' (coded inside the hrtimer callback function) diff --git a/Documentation/admin-guide/media/bt8xx.rst b/Documentation/admin-guide/media/bt8xx.rst index 1382ada1e38e..3589f6ab7e46 100644 --- a/Documentation/admin-guide/media/bt8xx.rst +++ b/Documentation/admin-guide/media/bt8xx.rst @@ -15,11 +15,12 @@ Authors: General information ------------------- -This class of cards has a bt878a as the PCI interface, and require the bttv driver -for accessing the i2c bus and the gpio pins of the bt8xx chipset. +This class of cards has a bt878a as the PCI interface, and require the bttv +driver for accessing the i2c bus and the gpio pins of the bt8xx chipset. -Please see :doc:`bttv-cardlist` for a complete list of Cards based on the -Conexant Bt8xx PCI bridge supported by the Linux Kernel. +Please see Documentation/admin-guide/media/bttv-cardlist.rst for a complete +list of Cards based on the Conexant Bt8xx PCI bridge supported by the +Linux Kernel. In order to be able to compile the kernel, some config options should be enabled:: @@ -80,7 +81,7 @@ for dvb-bt8xx drivers by passing modprobe parameters may be necessary. Running TwinHan and Clones ~~~~~~~~~~~~~~~~~~~~~~~~~~ -As shown at :doc:`bttv-cardlist`, TwinHan and +As shown at Documentation/admin-guide/media/bttv-cardlist.rst, TwinHan and clones use ``card=113`` modprobe parameter. So, in order to properly detect it for devices without EEPROM, you should use:: @@ -105,12 +106,12 @@ The autodetected values are determined by the cards' "response string". In your logs see f. ex.: dst_get_device_id: Recognize [DSTMCI]. For bug reports please send in a complete log with verbose=4 activated. -Please also see :doc:`ci`. +Please also see Documentation/admin-guide/media/ci.rst. Running multiple cards ~~~~~~~~~~~~~~~~~~~~~~ -See :doc:`bttv-cardlist` for a complete list of +See Documentation/admin-guide/media/bttv-cardlist.rst for a complete list of Card ID. Some examples: =========================== === diff --git a/Documentation/admin-guide/media/bttv.rst b/Documentation/admin-guide/media/bttv.rst index 0ef1f203104d..125f6f47123d 100644 --- a/Documentation/admin-guide/media/bttv.rst +++ b/Documentation/admin-guide/media/bttv.rst @@ -24,7 +24,8 @@ If your board has digital TV, you'll also need:: ./scripts/config -m DVB_BT8XX -In this case, please see :doc:`bt8xx` for additional notes. +In this case, please see Documentation/admin-guide/media/bt8xx.rst +for additional notes. Make bttv work with your card ----------------------------- @@ -39,7 +40,7 @@ If it doesn't bttv likely could not autodetect your card and needs some insmod options. The most important insmod option for bttv is "card=n" to select the correct card type. If you get video but no sound you've very likely specified the wrong (or no) card type. A list of supported -cards is in :doc:`bttv-cardlist`. +cards is in Documentation/admin-guide/media/bttv-cardlist.rst. If bttv takes very long to load (happens sometimes with the cheap cards which have no tuner), try adding this to your modules configuration @@ -57,8 +58,8 @@ directory should be enough for it to be autoload during the driver's probing mode (e. g. when the Kernel boots or when the driver is manually loaded via ``modprobe`` command). -If your card isn't listed in :doc:`bttv-cardlist` or if you have -trouble making audio work, please read :ref:`still_doesnt_work`. +If your card isn't listed in Documentation/admin-guide/media/bttv-cardlist.rst +or if you have trouble making audio work, please read :ref:`still_doesnt_work`. Autodetecting cards @@ -77,8 +78,8 @@ the Subsystem ID in the second line, looks like this: only bt878-based cards can have a subsystem ID (which does not mean that every card really has one). bt848 cards can't have a Subsystem ID and therefore can't be autodetected. There is a list with the ID's -at :doc:`bttv-cardlist` (in case you are interested or want to mail -patches with updates). +at Documentation/admin-guide/media/bttv-cardlist.rst +(in case you are interested or want to mail patches with updates). .. _still_doesnt_work: @@ -259,15 +260,15 @@ bug. It is very helpful if you can tell where exactly it broke With a hard freeze you probably doesn't find anything in the logfiles. The only way to capture any kernel messages is to hook up a serial console and let some terminal application log the messages. /me uses -screen. See :doc:`/admin-guide/serial-console` for details on setting -up a serial console. +screen. See Documentation/admin-guide/serial-console.rst for details on +setting up a serial console. -Read :doc:`/admin-guide/bug-hunting` to learn how to get any useful +Read Documentation/admin-guide/bug-hunting.rst to learn how to get any useful information out of a register+stack dump printed by the kernel on protection faults (so-called "kernel oops"). If you run into some kind of deadlock, you can try to dump a call trace -for each process using sysrq-t (see :doc:`/admin-guide/sysrq`). +for each process using sysrq-t (see Documentation/admin-guide/sysrq.rst). This way it is possible to figure where *exactly* some process in "D" state is stuck. diff --git a/Documentation/admin-guide/media/index.rst b/Documentation/admin-guide/media/index.rst index 6e0d2bae7154..c676af665111 100644 --- a/Documentation/admin-guide/media/index.rst +++ b/Documentation/admin-guide/media/index.rst @@ -11,12 +11,14 @@ its supported drivers. Please see: -- :doc:`/userspace-api/media/index` - for the userspace APIs used on media devices. +Documentation/userspace-api/media/index.rst -- :doc:`/driver-api/media/index` - for driver development information and Kernel APIs used by - media devices; + - for the userspace APIs used on media devices. + +Documentation/driver-api/media/index.rst + + - for driver development information and Kernel APIs used by + media devices; The media subsystem =================== diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst index f59697c7b374..52c1c04173da 100644 --- a/Documentation/admin-guide/media/ipu3.rst +++ b/Documentation/admin-guide/media/ipu3.rst @@ -234,22 +234,23 @@ The IPU3 ImgU pipelines can be configured using the Media Controller, defined at Running mode and firmware binary selection ------------------------------------------ -ImgU works based on firmware, currently the ImgU firmware support run 2 pipes in -time-sharing with single input frame data. Each pipe can run at certain mode - -"VIDEO" or "STILL", "VIDEO" mode is commonly used for video frames capture, and -"STILL" is used for still frame capture. However, you can also select "VIDEO" to -capture still frames if you want to capture images with less system load and -power. For "STILL" mode, ImgU will try to use smaller BDS factor and output -larger bayer frame for further YUV processing than "VIDEO" mode to get high -quality images. Besides, "STILL" mode need XNR3 to do noise reduction, hence -"STILL" mode will need more power and memory bandwidth than "VIDEO" mode. TNR -will be enabled in "VIDEO" mode and bypassed by "STILL" mode. ImgU is running at -“VIDEO” mode by default, the user can use v4l2 control V4L2_CID_INTEL_IPU3_MODE -(currently defined in drivers/staging/media/ipu3/include/intel-ipu3.h) to query -and set the running mode. For user, there is no difference for buffer queueing -between the "VIDEO" and "STILL" mode, mandatory input and main output node -should be enabled and buffers need be queued, the statistics and the view-finder -queues are optional. +ImgU works based on firmware, currently the ImgU firmware support run 2 pipes +in time-sharing with single input frame data. Each pipe can run at certain mode +- "VIDEO" or "STILL", "VIDEO" mode is commonly used for video frames capture, +and "STILL" is used for still frame capture. However, you can also select +"VIDEO" to capture still frames if you want to capture images with less system +load and power. For "STILL" mode, ImgU will try to use smaller BDS factor and +output larger bayer frame for further YUV processing than "VIDEO" mode to get +high quality images. Besides, "STILL" mode need XNR3 to do noise reduction, +hence "STILL" mode will need more power and memory bandwidth than "VIDEO" mode. +TNR will be enabled in "VIDEO" mode and bypassed by "STILL" mode. ImgU is +running at "VIDEO" mode by default, the user can use v4l2 control +V4L2_CID_INTEL_IPU3_MODE (currently defined in +drivers/staging/media/ipu3/include/uapi/intel-ipu3.h) to query and set the +running mode. For user, there is no difference for buffer queueing between the +"VIDEO" and "STILL" mode, mandatory input and main output node should be +enabled and buffers need be queued, the statistics and the view-finder queues +are optional. The firmware binary will be selected according to current running mode, such log "using binary if_to_osys_striped " or "using binary if_to_osys_primary_striped" @@ -586,7 +587,7 @@ preserved. References ========== -.. [#f5] drivers/staging/media/ipu3/include/intel-ipu3.h +.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h .. [#f1] https://github.com/intel/nvt diff --git a/Documentation/admin-guide/media/saa7134.rst b/Documentation/admin-guide/media/saa7134.rst index 7ab9c70b9abe..51eae7eb5ab7 100644 --- a/Documentation/admin-guide/media/saa7134.rst +++ b/Documentation/admin-guide/media/saa7134.rst @@ -50,7 +50,8 @@ To build and install, you should run:: Once the new Kernel is booted, saa7134 driver should be loaded automatically. Depending on the card you might have to pass ``card=<nr>`` as insmod option. -If so, please check :doc:`saa7134-cardlist` for valid choices. +If so, please check Documentation/admin-guide/media/saa7134-cardlist.rst +for valid choices. Once you have your card type number, you can pass a modules configuration via a file (usually, it is either ``/etc/modules.conf`` or some file at diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 10fde58d0869..aec2cd2aaea7 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -347,81 +347,8 @@ for tickless systems. It follows the same basic strategy as the ``menu`` `one <menu-gov_>`_: it always tries to find the deepest idle state suitable for the given conditions. However, it applies a different approach to that problem. -First, it does not use sleep length correction factors, but instead it attempts -to correlate the observed idle duration values with the available idle states -and use that information to pick up the idle state that is most likely to -"match" the upcoming CPU idle interval. Second, it does not take the tasks -that were running on the given CPU in the past and are waiting on some I/O -operations to complete now at all (there is no guarantee that they will run on -the same CPU when they become runnable again) and the pattern detection code in -it avoids taking timer wakeups into account. It also only uses idle duration -values less than the current time till the closest timer (with the scheduler -tick excluded) for that purpose. - -Like in the ``menu`` governor `case <menu-gov_>`_, the first step is to obtain -the *sleep length*, which is the time until the closest timer event with the -assumption that the scheduler tick will be stopped (that also is the upper bound -on the time until the next CPU wakeup). That value is then used to preselect an -idle state on the basis of three metrics maintained for each idle state provided -by the ``CPUIdle`` driver: ``hits``, ``misses`` and ``early_hits``. - -The ``hits`` and ``misses`` metrics measure the likelihood that a given idle -state will "match" the observed (post-wakeup) idle duration if it "matches" the -sleep length. They both are subject to decay (after a CPU wakeup) every time -the target residency of the idle state corresponding to them is less than or -equal to the sleep length and the target residency of the next idle state is -greater than the sleep length (that is, when the idle state corresponding to -them "matches" the sleep length). The ``hits`` metric is increased if the -former condition is satisfied and the target residency of the given idle state -is less than or equal to the observed idle duration and the target residency of -the next idle state is greater than the observed idle duration at the same time -(that is, it is increased when the given idle state "matches" both the sleep -length and the observed idle duration). In turn, the ``misses`` metric is -increased when the given idle state "matches" the sleep length only and the -observed idle duration is too short for its target residency. - -The ``early_hits`` metric measures the likelihood that a given idle state will -"match" the observed (post-wakeup) idle duration if it does not "match" the -sleep length. It is subject to decay on every CPU wakeup and it is increased -when the idle state corresponding to it "matches" the observed (post-wakeup) -idle duration and the target residency of the next idle state is less than or -equal to the sleep length (i.e. the idle state "matching" the sleep length is -deeper than the given one). - -The governor walks the list of idle states provided by the ``CPUIdle`` driver -and finds the last (deepest) one with the target residency less than or equal -to the sleep length. Then, the ``hits`` and ``misses`` metrics of that idle -state are compared with each other and it is preselected if the ``hits`` one is -greater (which means that that idle state is likely to "match" the observed idle -duration after CPU wakeup). If the ``misses`` one is greater, the governor -preselects the shallower idle state with the maximum ``early_hits`` metric -(or if there are multiple shallower idle states with equal ``early_hits`` -metric which also is the maximum, the shallowest of them will be preselected). -[If there is a wakeup latency constraint coming from the `PM QoS framework -<cpu-pm-qos_>`_ which is hit before reaching the deepest idle state with the -target residency within the sleep length, the deepest idle state with the exit -latency within the constraint is preselected without consulting the ``hits``, -``misses`` and ``early_hits`` metrics.] - -Next, the governor takes several idle duration values observed most recently -into consideration and if at least a half of them are greater than or equal to -the target residency of the preselected idle state, that idle state becomes the -final candidate to ask for. Otherwise, the average of the most recent idle -duration values below the target residency of the preselected idle state is -computed and the governor walks the idle states shallower than the preselected -one and finds the deepest of them with the target residency within that average. -That idle state is then taken as the final candidate to ask for. - -Still, at this point the governor may need to refine the idle state selection if -it has not decided to `stop the scheduler tick <idle-cpus-and-tick_>`_. That -generally happens if the target residency of the idle state selected so far is -less than the tick period and the tick has not been stopped already (in a -previous iteration of the idle loop). Then, like in the ``menu`` governor -`case <menu-gov_>`_, the sleep length used in the previous computations may not -reflect the real time until the closest timer event and if it really is greater -than that time, a shallower state with a suitable target residency may need to -be selected. - +.. kernel-doc:: drivers/cpuidle/governors/teo.c + :doc: teo-description .. _idle-states-representation: diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index 89309e1b0e48..b799a43da62e 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -20,8 +20,8 @@ Nehalem and later generations of Intel processors, but the level of support for a particular processor model in it depends on whether or not it recognizes that processor model and may also depend on information coming from the platform firmware. [To understand ``intel_idle`` it is necessary to know how ``CPUIdle`` -works in general, so this is the time to get familiar with :doc:`cpuidle` if you -have not done that yet.] +works in general, so this is the time to get familiar with +Documentation/admin-guide/pm/cpuidle.rst if you have not done that yet.] ``intel_idle`` uses the ``MWAIT`` instruction to inform the processor that the logical CPU executing it is idle and so it may be possible to put some of the @@ -53,7 +53,8 @@ processor) corresponding to them depends on the processor model and it may also depend on the configuration of the platform. In order to create a list of available idle states required by the ``CPUIdle`` -subsystem (see :ref:`idle-states-representation` in :doc:`cpuidle`), +subsystem (see :ref:`idle-states-representation` in +Documentation/admin-guide/pm/cpuidle.rst), ``intel_idle`` can use two sources of information: static tables of idle states for different processor models included in the driver itself and the ACPI tables of the system. The former are always used if the processor model at hand is @@ -98,7 +99,8 @@ states may not be enabled by default if there are no matching entries in the preliminary list of idle states coming from the ACPI tables. In that case user space still can enable them later (on a per-CPU basis) with the help of the ``disable`` idle state attribute in ``sysfs`` (see -:ref:`idle-states-representation` in :doc:`cpuidle`). This basically means that +:ref:`idle-states-representation` in +Documentation/admin-guide/pm/cpuidle.rst). This basically means that the idle states "known" to the driver may not be enabled by default if they have not been exposed by the platform firmware (through the ACPI tables). @@ -186,7 +188,8 @@ be desirable. In practice, it is only really necessary to do that if the idle states in question cannot be enabled during system startup, because in the working state of the system the CPU power management quality of service (PM QoS) feature can be used to prevent ``CPUIdle`` from touching those idle states -even if they have been enumerated (see :ref:`cpu-pm-qos` in :doc:`cpuidle`). +even if they have been enumerated (see :ref:`cpu-pm-qos` in +Documentation/admin-guide/pm/cpuidle.rst). Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` @@ -202,7 +205,8 @@ Namely, the positions of the bits that are set in the ``states_off`` value are the indices of idle states to be disabled by default (as reflected by the names of the corresponding idle state directories in ``sysfs``, :file:`state0`, :file:`state1` ... :file:`state<i>` ..., where ``<i>`` is the index of the given -idle state; see :ref:`idle-states-representation` in :doc:`cpuidle`). +idle state; see :ref:`idle-states-representation` in +Documentation/admin-guide/pm/cpuidle.rst). For example, if ``states_off`` is equal to 3, the driver will disable idle states 0 and 1 by default, and if it is equal to 8, idle state 3 will be diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index df29b4f1f219..d5043cd8d2f5 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -18,8 +18,8 @@ General Information (``CPUFreq``). It is a scaling driver for the Sandy Bridge and later generations of Intel processors. Note, however, that some of those processors may not be supported. [To understand ``intel_pstate`` it is necessary to know -how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if -you have not done that yet.] +how ``CPUFreq`` works in general, so this is the time to read +Documentation/admin-guide/pm/cpufreq.rst if you have not done that yet.] For the processors supported by ``intel_pstate``, the P-state concept is broader than just an operating frequency or an operating performance point (see the @@ -365,6 +365,9 @@ argument is passed to the kernel in the command line. inclusive) including both turbo and non-turbo P-states (see `Turbo P-states Support`_). + This attribute is present only if the value exposed by it is the same + for all of the CPUs in the system. + The value of this attribute is not affected by the ``no_turbo`` setting described `below <no_turbo_attr_>`_. @@ -374,6 +377,9 @@ argument is passed to the kernel in the command line. Ratio of the `turbo range <turbo_>`_ size to the size of the entire range of supported P-states, in percent. + This attribute is present only if the value exposed by it is the same + for all of the CPUs in the system. + This attribute is read-only. .. _no_turbo_attr: @@ -445,8 +451,9 @@ Interpretation of Policy Attributes ----------------------------------- The interpretation of some ``CPUFreq`` policy attributes described in -:doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver -and it generally depends on the driver's `operation mode <Operation Modes_>`_. +Documentation/admin-guide/pm/cpufreq.rst is special with ``intel_pstate`` +as the current scaling driver and it generally depends on the driver's +`operation mode <Operation Modes_>`_. First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and ``scaling_cur_freq`` attributes are produced by applying a processor-specific diff --git a/Documentation/admin-guide/pstore-blk.rst b/Documentation/admin-guide/pstore-blk.rst index 49d8149f8d32..2d22ead9520e 100644 --- a/Documentation/admin-guide/pstore-blk.rst +++ b/Documentation/admin-guide/pstore-blk.rst @@ -45,15 +45,18 @@ blkdev The block device to use. Most of the time, it is a partition of block device. It's required for pstore/blk. It is also used for MTD device. -It accepts the following variants for block device: +When pstore/blk is built as a module, "blkdev" accepts the following variants: -1. <hex_major><hex_minor> device number in hexadecimal represents itself; no - leading 0x, for example b302. -#. /dev/<disk_name> represents the device number of disk +1. /dev/<disk_name> represents the device number of disk #. /dev/<disk_name><decimal> represents the device number of partition - device number of disk plus the partition number #. /dev/<disk_name>p<decimal> - same as the above; this form is used when disk name of partitioned disk ends with a digit. + +When pstore/blk is built into the kernel, "blkdev" accepts the following variants: + +#. <hex_major><hex_minor> device number in hexadecimal representation, + with no leading 0x, for example b302. #. PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF represents the unique id of a partition if the partition table provides it. The UUID may be either an EFI/GPT UUID, or refer to an MSDOS partition using the format SSSSSSSS-PP, @@ -227,8 +230,5 @@ For developer reference, here are all the important structures and APIs: .. kernel-doc:: include/linux/pstore_zone.h :internal: -.. kernel-doc:: fs/pstore/blk.c - :internal: - .. kernel-doc:: include/linux/pstore_blk.h :internal: diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst index 18d8e25ba9df..d7ac13f789cc 100644 --- a/Documentation/admin-guide/reporting-issues.rst +++ b/Documentation/admin-guide/reporting-issues.rst @@ -1248,7 +1248,7 @@ paragraph makes the severeness obvious. In case you performed a successful bisection, use the title of the change that introduced the regression as the second part of your subject. Make the report -also mention the commit id of the culprit. In case of an unsuccessful bisection, +also mention the commit id of the culprit. In case of an unsuccessful bisection, make your report mention the latest tested version that's working fine (say 5.7) and the oldest where the issue occurs (say 5.8-rc1). diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst index 77b1d1b2ad42..4e6db0a2a4c0 100644 --- a/Documentation/admin-guide/sysctl/abi.rst +++ b/Documentation/admin-guide/sysctl/abi.rst @@ -11,7 +11,7 @@ Documentation for /proc/sys/abi/ Copyright (c) 2020, Stephen Kitt -For general info, see :doc:`index`. +For general info, see Documentation/admin-guide/sysctl/index.rst. ------------------------------------------------------------------------------ diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 68b21395a743..426162009ce9 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -9,7 +9,8 @@ Copyright (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> Copyright (c) 2009, Shen Feng<shen@cn.fujitsu.com> -For general info and legal blurb, please look in :doc:`index`. +For general info and legal blurb, please look in +Documentation/admin-guide/sysctl/index.rst. ------------------------------------------------------------------------------ @@ -54,7 +55,7 @@ free space valid for 30 seconds. acpi_video_flags ================ -See :doc:`/power/video`. This allows the video resume mode to be set, +See Documentation/power/video.rst. This allows the video resume mode to be set, in a similar fashion to the ``acpi_sleep`` kernel parameter, by combining the following values: @@ -89,7 +90,7 @@ is 0x15 and the full version number is 0x234, this file will contain the value 340 = 0x154. See the ``type_of_loader`` and ``ext_loader_type`` fields in -:doc:`/x86/boot` for additional information. +Documentation/x86/boot.rst for additional information. bootloader_version (x86 only) @@ -99,7 +100,7 @@ The complete bootloader version number. In the example above, this file will contain the value 564 = 0x234. See the ``type_of_loader`` and ``ext_loader_ver`` fields in -:doc:`/x86/boot` for additional information. +Documentation/x86/boot.rst for additional information. bpf_stats_enabled @@ -269,7 +270,7 @@ see the ``hostname(1)`` man page. firmware_config =============== -See :doc:`/driver-api/firmware/fallback-mechanisms`. +See Documentation/driver-api/firmware/fallback-mechanisms.rst. The entries in this directory allow the firmware loader helper fallback to be controlled: @@ -297,7 +298,7 @@ crashes and outputting them to a serial console. ftrace_enabled, stack_tracer_enabled ==================================== -See :doc:`/trace/ftrace`. +See Documentation/trace/ftrace.rst. hardlockup_all_cpu_backtrace @@ -325,7 +326,7 @@ when a hard lockup is detected. 1 Panic on hard lockup. = =========================== -See :doc:`/admin-guide/lockup-watchdogs` for more information. +See Documentation/admin-guide/lockup-watchdogs.rst for more information. This can also be set using the nmi_watchdog kernel parameter. @@ -333,7 +334,12 @@ hotplug ======= Path for the hotplug policy agent. -Default value is "``/sbin/hotplug``". +Default value is ``CONFIG_UEVENT_HELPER_PATH``, which in turn defaults +to the empty string. + +This file only exists when ``CONFIG_UEVENT_HELPER`` is enabled. Most +modern systems rely exclusively on the netlink-based uevent source and +don't need this. hung_task_all_cpu_backtrace @@ -582,7 +588,8 @@ in a KVM virtual machine. This default can be overridden by adding:: nmi_watchdog=1 -to the guest kernel command line (see :doc:`/admin-guide/kernel-parameters`). +to the guest kernel command line (see +Documentation/admin-guide/kernel-parameters.rst). numa_balancing @@ -1067,7 +1074,7 @@ that support this feature. real-root-dev ============= -See :doc:`/admin-guide/initrd`. +See Documentation/admin-guide/initrd.rst. reboot-cmd (SPARC only) @@ -1088,6 +1095,13 @@ Model available). If your platform happens to meet the requirements for EAS but you do not want to use it, change this value to 0. +task_delayacct +=============== + +Enables/disables task delay accounting (see +:doc:`accounting/delay-accounting.rst`). Enabling this feature incurs +a small amount of overhead in the scheduler but is useful for debugging +and performance tuning. It is required by some tools such as iotop. sched_schedstats ================ @@ -1154,7 +1168,7 @@ will take effect. seccomp ======= -See :doc:`/userspace-api/seccomp_filter`. +See Documentation/userspace-api/seccomp_filter.rst. sg-big-buff @@ -1283,11 +1297,11 @@ This parameter can be used to control the soft lockup detector. = ================================= The soft lockup detector monitors CPUs for threads that are hogging the CPUs -without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads -from running. The mechanism depends on the CPUs ability to respond to timer -interrupts which are needed for the 'watchdog/N' threads to be woken up by -the watchdog timer function, otherwise the NMI watchdog — if enabled — can -detect a hard lockup condition. +without rescheduling voluntarily, and thus prevent the 'migration/N' threads +from running, causing the watchdog work fail to execute. The mechanism depends +on the CPUs ability to respond to timer interrupts which are needed for the +watchdog work to be queued by the watchdog timer function, otherwise the NMI +watchdog — if enabled — can detect a hard lockup condition. stack_erasing @@ -1325,7 +1339,7 @@ the boot PROM. sysrq ===== -See :doc:`/admin-guide/sysrq`. +See Documentation/admin-guide/sysrq.rst. tainted @@ -1355,15 +1369,16 @@ ORed together. The letters are seen in "Tainted" line of Oops reports. 131072 `(T)` The kernel was built with the struct randomization plugin ====== ===== ============================================================== -See :doc:`/admin-guide/tainted-kernels` for more information. +See Documentation/admin-guide/tainted-kernels.rst for more information. Note: writes to this sysctl interface will fail with ``EINVAL`` if the kernel is booted with the command line option ``panic_on_taint=<bitmask>,nousertaint`` and any of the ORed together values being written to ``tainted`` match with the bitmask declared on panic_on_taint. - See :doc:`/admin-guide/kernel-parameters` for more details on that particular - kernel command line option and its optional ``nousertaint`` switch. + See Documentation/admin-guide/kernel-parameters.rst for more details on + that particular kernel command line option and its optional + ``nousertaint`` switch. threads-max =========== @@ -1387,7 +1402,7 @@ If a value outside of this range is written to ``threads-max`` an traceoff_on_warning =================== -When set, disables tracing (see :doc:`/trace/ftrace`) when a +When set, disables tracing (see Documentation/trace/ftrace.rst) when a ``WARN()`` is hit. @@ -1407,8 +1422,8 @@ will send them to printk() again. This only works if the kernel was booted with ``tp_printk`` enabled. -See :doc:`/admin-guide/kernel-parameters` and -:doc:`/trace/boottime-trace`. +See Documentation/admin-guide/kernel-parameters.rst and +Documentation/trace/boottime-trace.rst. .. _unaligned-dump-stack: diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 586cd4b86428..8387ad0b0b83 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -64,7 +64,7 @@ Currently, these files are in /proc/sys/vm: - overcommit_ratio - page-cluster - panic_on_oom -- percpu_pagelist_fraction +- percpu_pagelist_high_fraction - stat_interval - stat_refresh - numa_stat @@ -790,22 +790,24 @@ panic_on_oom=2+kdump gives you very strong tool to investigate why oom happens. You can get snapshot. -percpu_pagelist_fraction -======================== +percpu_pagelist_high_fraction +============================= -This is the fraction of pages at most (high mark pcp->high) in each zone that -are allocated for each per cpu page list. The min value for this is 8. It -means that we don't allow more than 1/8th of pages in each zone to be -allocated in any single per_cpu_pagelist. This entry only changes the value -of hot per cpu pagelists. User can specify a number like 100 to allocate -1/100th of each zone to each per cpu page list. +This is the fraction of pages in each zone that are can be stored to +per-cpu page lists. It is an upper boundary that is divided depending +on the number of online CPUs. The min value for this is 8 which means +that we do not allow more than 1/8th of pages in each zone to be stored +on per-cpu page lists. This entry only changes the value of hot per-cpu +page lists. A user can specify a number like 100 to allocate 1/100th of +each zone between per-cpu lists. -The batch value of each per cpu pagelist is also updated as a result. It is -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) +The batch value of each per-cpu page list remains the same regardless of +the value of the high fraction so allocation latencies are unaffected. -The initial value is zero. Kernel does not use this value at boot time to set -the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. +The initial value is zero. Kernel uses this value to set the high pcp->high +mark based on the low watermark for the zone and the number of local +online CPUs. If the user writes '0' to this sysctl, it will revert to +this default behavior. stat_interval @@ -936,12 +938,12 @@ allocations, THP and hugetlbfs pages. To make it sensible with respect to the watermark_scale_factor parameter, the unit is in fractions of 10,000. The default value of -15,000 on !DISCONTIGMEM configurations means that up to 150% of the high -watermark will be reclaimed in the event of a pageblock being mixed due -to fragmentation. The level of reclaim is determined by the number of -fragmentation events that occurred in the recent past. If this value is -smaller than a pageblock then a pageblocks worth of pages will be reclaimed -(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. +15,000 means that up to 150% of the high watermark will be reclaimed in the +event of a pageblock being mixed due to fragmentation. The level of reclaim +is determined by the number of fragmentation events that occurred in the +recent past. If this value is smaller than a pageblock then a pageblocks +worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor +of 0 will disable the feature. watermark_scale_factor |