diff options
Diffstat (limited to 'Documentation/admin-guide')
44 files changed, 1771 insertions, 162 deletions
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst index 520a1c2c6fd2..cdd65164ca96 100644 --- a/Documentation/admin-guide/LSM/SELinux.rst +++ b/Documentation/admin-guide/LSM/SELinux.rst @@ -2,6 +2,17 @@ SELinux ======= +Information about the SELinux kernel subsystem can be found at the +following links: + + https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git/tree/README.md + + https://github.com/selinuxproject/selinux-kernel/wiki + +Information about the SELinux userspace can be found at: + + https://github.com/SELinuxProject/selinux/wiki + If you want to use SELinux, chances are you will want to use the distro-provided policies, or install the latest reference policy release from diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst index f93a467db628..dc7088451f9d 100644 --- a/Documentation/admin-guide/LSM/ipe.rst +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -423,7 +423,7 @@ Field descriptions: Event Example:: - type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 errno=0 type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E @@ -433,24 +433,55 @@ This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record Field descriptions: -+----------------+------------+-----------+---------------------------------------------------+ -| Field | Value Type | Optional? | Description of Value | -+================+============+===========+===================================================+ -| policy_name | string | No | The policy_name | -+----------------+------------+-----------+---------------------------------------------------+ -| policy_version | string | No | The policy_version | -+----------------+------------+-----------+---------------------------------------------------+ -| policy_digest | string | No | The policy hash | -+----------------+------------+-----------+---------------------------------------------------+ -| auid | integer | No | The login user ID | -+----------------+------------+-----------+---------------------------------------------------+ -| ses | integer | No | The login session ID | -+----------------+------------+-----------+---------------------------------------------------+ -| lsm | string | No | The lsm name associated with the event | -+----------------+------------+-----------+---------------------------------------------------+ -| res | integer | No | The result of the audited operation(success/fail) | -+----------------+------------+-----------+---------------------------------------------------+ - ++----------------+------------+-----------+-------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+=============================================================+ +| policy_name | string | Yes | The policy_name | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_version | string | Yes | The policy_version | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_digest | string | Yes | The policy hash | ++----------------+------------+-----------+-------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+-------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+-------------------------------------------------------------+ +| errno | integer | No | Error code from policy loading operations (see table below) | ++----------------+------------+-----------+-------------------------------------------------------------+ + +Policy error codes (errno): + +The following table lists the error codes that may appear in the errno field while loading or updating the policy: + ++----------------+--------------------------------------------------------+ +| Error Code | Description | ++================+========================================================+ +| 0 | Success | ++----------------+--------------------------------------------------------+ +| -EPERM | Insufficient permission | ++----------------+--------------------------------------------------------+ +| -EEXIST | Same name policy already deployed | ++----------------+--------------------------------------------------------+ +| -EBADMSG | Policy is invalid | ++----------------+--------------------------------------------------------+ +| -ENOMEM | Out of memory (OOM) | ++----------------+--------------------------------------------------------+ +| -ERANGE | Policy version number overflow | ++----------------+--------------------------------------------------------+ +| -EINVAL | Policy version parsing error | ++----------------+--------------------------------------------------------+ +| -ENOKEY | Key used to sign the IPE policy not found in keyring | ++----------------+--------------------------------------------------------+ +| -EKEYREJECTED | Policy signature verification failed | ++----------------+--------------------------------------------------------+ +| -ESTALE | Attempting to update an IPE policy with older version | ++----------------+--------------------------------------------------------+ +| -ENOENT | Policy was deleted while updating | ++----------------+--------------------------------------------------------+ 1404 AUDIT_MAC_STATUS ^^^^^^^^^^^^^^^^^^^^^ diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index 70b02f30013a..05301f03b717 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -259,7 +259,7 @@ Configuring the kernel Compiling the kernel -------------------- - - Make sure you have at least gcc 5.1 available. + - Make sure you have at least gcc 8.1 available. For more information, refer to :ref:`Documentation/process/changes.rst <changes>`. - Do a ``make`` to create a compressed kernel image. It is also possible to do diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index 9bdb30901a93..3e273c1bb749 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -317,6 +317,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -331,24 +351,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -369,6 +372,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. @@ -482,8 +502,6 @@ attempt to recompress::: echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress -Recompression of idle pages requires memory tracking. - During re-compression for every page, that matches re-compression criteria, ZRAM iterates the list of registered alternative compression algorithms in order of their priorities. ZRAM stops either when re-compression was diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index ce6f4e8ca487..30858757c9f2 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -196,7 +196,7 @@ will see the assembler code for the routine shown, but if your kernel has debug symbols the C code will also be available. (Debug symbols can be enabled in the kernel hacking menu of the menu configuration.) For example:: - $ objdump -r -S -l --disassemble net/dccp/ipv4.o + $ objdump -r -S -l --disassemble net/ipv4/tcp.o .. note:: diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index a3e2edb3d274..463f98453323 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -13,7 +13,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson <pj@sgi.com> -Modified by Christoph Lameter <cl@linux.com> +Modified by Christoph Lameter <cl@gentwo.org> .. CONTENTS: diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index f401af5e2f09..c7909e5ac136 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -10,7 +10,7 @@ Written by Simon.Derr@bull.net - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Modified by Paul Jackson <pj@sgi.com> -- Modified by Christoph Lameter <cl@linux.com> +- Modified by Christoph Lameter <cl@gentwo.org> - Modified by Paul Menage <menage@google.com> - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 9e7de8e70048..bd98ea3175ec 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1076,7 +1076,7 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 cpu controller doesn't yet fully support the control of +WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group scheduling of realtime processes, the cpu controller can only be enabled when all RT processes are in the root cgroup. Be aware that system @@ -1095,19 +1095,34 @@ realtime processes irrespective of CONFIG_RT_GROUP_SCHED. CPU Interface Files ~~~~~~~~~~~~~~~~~~~ -All time durations are in microseconds. +The interaction of a process with the cpu controller depends on its scheduling +policy and the underlying scheduler. From the point of view of the cpu controller, +processes can be categorized as follows: + +* Processes under the fair-class scheduler +* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback +* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler + without the ``cgroup_set_weight`` callback + +For details on when a process is under the fair-class scheduler or a BPF scheduler, +check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`. + +For each of the following interface files, the above categories +will be referred to. All time durations are in microseconds. cpu.stat A read-only flat-keyed file. This file exists whether the controller is enabled or not. - It always reports the following three stats: + It always reports the following three stats, which account for all the + processes in the cgroup: - usage_usec - user_usec - system_usec - and the following five when the controller is enabled: + and the following five when the controller is enabled, which account for + only the processes under the fair-class scheduler: - nr_periods - nr_throttled @@ -1125,6 +1140,10 @@ All time durations are in microseconds. If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), then the weight will show as a 0. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.weight.nice A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1137,6 +1156,10 @@ All time durations are in microseconds. granularity is coarser for the nice values, the read value is the closest approximation of the current weight. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.max A read-write two value file which exists on non-root cgroups. The default is "max 100000". @@ -1149,43 +1172,55 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + This file affects only processes under the fair-class scheduler. + cpu.max.burst A read-write single value file which exists on non-root cgroups. The default is "0". The burst in the range [0, $MAX]. + This file affects only processes under the fair-class scheduler. + cpu.pressure A read-write nested-keyed file. Shows pressure stall information for CPU. See :ref:`Documentation/accounting/psi.rst <psi>` for details. + This file accounts for all the processes in the cgroup. + cpu.uclamp.min - A read-write single value file which exists on non-root cgroups. - The default is "0", i.e. no utilization boosting. + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. - The requested minimum utilization (protection) as a percentage - rational number, e.g. 12.34 for 12.34%. + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. - This interface allows reading and setting minimum utilization clamp - values similar to the sched_setattr(2). This minimum utilization - value is used to clamp the task specific minimum utilization clamp. + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp, + including those of realtime processes. - The requested minimum utilization (protection) is always capped by - the current value for the maximum utilization (limit), i.e. - `cpu.uclamp.max`. + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. + + This file affects all the processes in the cgroup. cpu.uclamp.max - A read-write single value file which exists on non-root cgroups. - The default is "max". i.e. no utilization capping + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping + + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. - The requested maximum utilization (limit) as a percentage rational - number, e.g. 98.76 for 98.76%. + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp, + including those of realtime processes. - This interface allows reading and setting maximum utilization clamp - values similar to the sched_setattr(2). This maximum utilization - value is used to clamp the task specific maximum utilization clamp. + This file affects all the processes in the cgroup. cpu.idle A read-write single value file which exists on non-root cgroups. @@ -1197,7 +1232,7 @@ All time durations are in microseconds. own relative priorities, but the cgroup itself will be treated as very low priority relative to its peers. - + This file affects only processes under the fair-class scheduler. Memory ------ @@ -1299,6 +1334,18 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1316,6 +1363,18 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. @@ -1348,6 +1407,9 @@ The following nested keys are defined. same semantics as vm.swappiness applied to memcg reclaim with all the existing limitations and potential future extensions. + The valid range for swappiness is [0-200, max], setting + swappiness=max exclusively reclaims anonymous memory. + memory.peak A read-write single value file which exists on non-root cgroups. diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index c09674a75a9e..d989ae5778ba 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -270,6 +270,8 @@ configured for Unix Extensions (and the client has not disabled illegal Windows/NTFS/SMB characters to a remap range (this mount parameter is the default for SMB3). This remap (``mapposix``) range is also compatible with Mac (and "Services for Mac" on some older Windows). +When POSIX Extensions for SMB 3.1.1 are negotiated, remapping is automatically +disabled. CIFS VFS Mount Options ====================== diff --git a/Documentation/admin-guide/gpio/gpio-aggregator.rst b/Documentation/admin-guide/gpio/gpio-aggregator.rst index 5cd1e7221756..8374a9df9105 100644 --- a/Documentation/admin-guide/gpio/gpio-aggregator.rst +++ b/Documentation/admin-guide/gpio/gpio-aggregator.rst @@ -69,6 +69,113 @@ write-only attribute files in sysfs. $ echo gpio-aggregator.0 > delete_device +Aggregating GPIOs using Configfs +-------------------------------- + +**Group:** ``/config/gpio-aggregator`` + + This is the root directory of the gpio-aggregator configfs tree. + +**Group:** ``/config/gpio-aggregator/<example-name>`` + + This directory represents a GPIO aggregator device. You can assign any + name to ``<example-name>`` (e.g. ``agg0``), except names starting with + ``_sysfs`` prefix, which are reserved for auto-generated configfs + entries corresponding to devices created via Sysfs. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/live`` + + The ``live`` attribute allows to trigger the actual creation of the device + once it's fully configured. Accepted values are: + + * ``1``, ``yes``, ``true`` : enable the virtual device + * ``0``, ``no``, ``false`` : disable the virtual device + +**Attribute:** ``/config/gpio-aggregator/<example-name>/dev_name`` + + The read-only ``dev_name`` attribute exposes the name of the device as it + will appear in the system on the platform bus (e.g. ``gpio-aggregator.0``). + This is useful for identifying a character device for the newly created + aggregator. If it's ``gpio-aggregator.0``, + ``/sys/devices/platform/gpio-aggregator.0/gpiochipX`` path tells you that the + GPIO device id is ``X``. + +You must create subdirectories for each virtual line you want to +instantiate, named exactly as ``line0``, ``line1``, ..., ``lineY``, when +you want to instantiate ``Y+1`` (Y >= 0) lines. Configure all lines before +activating the device by setting ``live`` to 1. + +**Group:** ``/config/gpio-aggregator/<example-name>/<lineY>/`` + + This directory represents a GPIO line to include in the aggregator. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/key`` + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/offset`` + + The default values after creating the ``<lineY>`` directory are: + + * ``key`` : <empty> + * ``offset`` : -1 + + ``key`` must always be explicitly configured, while ``offset`` depends. + Two configuration patterns exist for each ``<lineY>``: + + (a). For lookup by GPIO line name: + + * Set ``key`` to the line name. + * Ensure ``offset`` remains -1 (the default). + + (b). For lookup by GPIO chip name and the line offset within the chip: + + * Set ``key`` to the chip name. + * Set ``offset`` to the line offset (0 <= ``offset`` < 65535). + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/name`` + + The ``name`` attribute sets a custom name for lineY. If left unset, the + line will remain unnamed. + +Once the configuration is done, the ``'live'`` attribute must be set to 1 +in order to instantiate the aggregator device. It can be set back to 0 to +destroy the virtual device. The module will synchronously wait for the new +aggregator device to be successfully probed and if this doesn't happen, writing +to ``'live'`` will result in an error. This is a different behaviour from the +case when you create it using sysfs ``new_device`` interface. + +.. note:: + + For aggregators created via Sysfs, the configfs entries are + auto-generated and appear as ``/config/gpio-aggregator/_sysfs.<N>/``. You + cannot add or remove line directories with mkdir(2)/rmdir(2). To modify + lines, you must use the "delete_device" interface to tear down the + existing device and reconfigure it from scratch. However, you can still + toggle the aggregator with the ``live`` attribute and adjust the + ``key``, ``offset``, and ``name`` attributes for each line when ``live`` + is set to 0 by hand (i.e. it's not waiting for deferred probe). + +Sample configuration commands +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + # Create a directory for an aggregator device + $ mkdir /sys/kernel/config/gpio-aggregator/agg0 + + # Configure each line + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line0 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line0/key + $ echo 6 > /sys/kernel/config/gpio-aggregator/agg0/line0/offset + $ echo test0 > /sys/kernel/config/gpio-aggregator/agg0/line0/name + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line1 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line1/key + $ echo 7 > /sys/kernel/config/gpio-aggregator/agg0/line1/offset + $ echo test1 > /sys/kernel/config/gpio-aggregator/agg0/line1/name + + # Activate the aggregator device + $ echo 1 > /sys/kernel/config/gpio-aggregator/agg0/live + + Generic GPIO Driver ------------------- diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index 35d49ccd49e0..f5135a14ef2e 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -50,8 +50,11 @@ the number of lines exposed by this bank. **Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/name`` -This group represents a single line at the offset Y. The 'name' attribute -allows to set the line name as represented by the 'gpio-line-names' property. +**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/valid`` + +This group represents a single line at the offset Y. The ``valid`` attribute +indicates whether the line can be used as GPIO. The ``name`` attribute allows +to set the line name as represented by the 'gpio-line-names' property. **Item:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog`` diff --git a/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst new file mode 100644 index 000000000000..b4de16f5ec44 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst @@ -0,0 +1,238 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Attack Vector Controls +====================== + +Attack vector controls provide a simple method to configure only the mitigations +for CPU vulnerabilities which are relevant given the intended use of a system. +Administrators are encouraged to consider which attack vectors are relevant and +disable all others in order to recoup system performance. + +When new relevant CPU vulnerabilities are found, they will be added to these +attack vector controls so administrators will likely not need to reconfigure +their command line parameters as mitigations will continue to be correctly +applied based on the chosen attack vector controls. + +Attack Vectors +-------------- + +There are 5 sets of attack-vector mitigations currently supported by the kernel: + +#. :ref:`user_kernel` +#. :ref:`user_user` +#. :ref:`guest_host` +#. :ref:`guest_guest` +#. :ref:`smt` + +To control the enabled attack vectors, see :ref:`cmdline`. + +.. _user_kernel: + +User-to-Kernel +^^^^^^^^^^^^^^ + +The user-to-kernel attack vector involves a malicious userspace program +attempting to leak kernel data into userspace by exploiting a CPU vulnerability. +The kernel data involved might be limited to certain kernel memory, or include +all memory in the system, depending on the vulnerability exploited. + +If no untrusted userspace applications are being run, such as with single-user +systems, consider disabling user-to-kernel mitigations. + +Note that the CPU vulnerabilities mitigated by Linux have generally not been +shown to be exploitable from browser-based sandboxes. User-to-kernel +mitigations are therefore mostly relevant if unknown userspace applications may +be run by untrusted users. + +*user-to-kernel mitigations are enabled by default* + +.. _user_user: + +User-to-User +^^^^^^^^^^^^ + +The user-to-user attack vector involves a malicious userspace program attempting +to influence the behavior of another unsuspecting userspace program in order to +exfiltrate data. The vulnerability of a userspace program is based on the +program itself and the interfaces it provides. + +If no untrusted userspace applications are being run, consider disabling +user-to-user mitigations. + +Note that because the Linux kernel contains a mapping of all physical memory, +preventing a malicious userspace program from leaking data from another +userspace program requires mitigating user-to-kernel attacks as well for +complete protection. + +*user-to-user mitigations are enabled by default* + +.. _guest_host: + +Guest-to-Host +^^^^^^^^^^^^^ + +The guest-to-host attack vector involves a malicious VM attempting to leak +hypervisor data into the VM. The data involved may be limited, or may +potentially include all memory in the system, depending on the vulnerability +exploited. + +If no untrusted VMs are being run, consider disabling guest-to-host mitigations. + +*guest-to-host mitigations are enabled by default if KVM support is present* + +.. _guest_guest: + +Guest-to-Guest +^^^^^^^^^^^^^^ + +The guest-to-guest attack vector involves a malicious VM attempting to influence +the behavior of another unsuspecting VM in order to exfiltrate data. The +vulnerability of a VM is based on the code inside the VM itself and the +interfaces it provides. + +If no untrusted VMs, or only a single VM is being run, consider disabling +guest-to-guest mitigations. + +Similar to the user-to-user attack vector, preventing a malicious VM from +leaking data from another VM requires mitigating guest-to-host attacks as well +due to the Linux kernel phys map. + +*guest-to-guest mitigations are enabled by default if KVM support is present* + +.. _smt: + +Cross-Thread +^^^^^^^^^^^^ + +The cross-thread attack vector involves a malicious userspace program or +malicious VM either observing or attempting to influence the behavior of code +running on the SMT sibling thread in order to exfiltrate data. + +Many cross-thread attacks can only be mitigated if SMT is disabled, which will +result in reduced CPU core count and reduced performance. + +If cross-thread mitigations are fully enabled ('auto,nosmt'), all mitigations +for cross-thread attacks will be enabled. SMT may be disabled depending on +which vulnerabilities are present in the CPU. + +If cross-thread mitigations are partially enabled ('auto'), mitigations for +cross-thread attacks will be enabled but SMT will not be disabled. + +If cross-thread mitigations are disabled, no mitigations for cross-thread +attacks will be enabled. + +Cross-thread mitigation may not be required if core-scheduling or similar +techniques are used to prevent untrusted workloads from running on SMT siblings. + +*cross-thread mitigations default to partially enabled* + +.. _cmdline: + +Command Line Controls +--------------------- + +Attack vectors are controlled through the mitigations= command line option. The +value provided begins with a global option and then may optionally include one +or more options to disable various attack vectors. + +Format: + | ``mitigations=[global]`` + | ``mitigations=[global],[attack vectors]`` + +Global options: + +============ ============================================================= +Option Description +============ ============================================================= +'off' All attack vectors disabled. +'auto' All attack vectors enabled, partial cross-thread mitigations. +'auto,nosmt' All attack vectors enabled, full cross-thread mitigations. +============ ============================================================= + +Attack vector options: + +================= ======================================= +Option Description +================= ======================================= +'no_user_kernel' Disables user-to-kernel mitigations. +'no_user_user' Disables user-to-user mitigations. +'no_guest_host' Disables guest-to-host mitigations. +'no_guest_guest' Disables guest-to-guest mitigations +'no_cross_thread' Disables all cross-thread mitigations. +================= ======================================= + +Multiple attack vector options may be specified in a comma-separated list. If +the global option is not specified, it defaults to 'auto'. The global option +'off' is equivalent to disabling all attack vectors. + +Examples: + | ``mitigations=auto,no_user_kernel`` + + Enable all attack vectors except user-to-kernel. Partial cross-thread + mitigations. + + | ``mitigations=auto,nosmt,no_guest_host,no_guest_guest`` + + Enable all attack vectors and cross-thread mitigations except for + guest-to-host and guest-to-guest mitigations. + + | ``mitigations=,no_cross_thread`` + + Enable all attack vectors but not cross-thread mitigations. + +Interactions with command-line options +-------------------------------------- + +Vulnerability-specific controls (e.g. "retbleed=off") take precedence over all +attack vector controls. Mitigations for individual vulnerabilities may be +turned on or off via their command-line options regardless of the attack vector +controls. + +Summary of attack-vector mitigations +------------------------------------ + +When a vulnerability is mitigated due to an attack-vector control, the default +mitigation option for that particular vulnerability is used. To use a different +mitigation, please use the vulnerability-specific command line option. + +The table below summarizes which vulnerabilities are mitigated when different +attack vectors are enabled and assuming the CPU is vulnerable. + +=============== ============== ============ ============= ============== ============ ======== +Vulnerability User-to-Kernel User-to-User Guest-to-Host Guest-to-Guest Cross-Thread Notes +=============== ============== ============ ============= ============== ============ ======== +BHI X X +ITS X X +GDS X X X X * (Note 1) +L1TF X X * (Note 2) +MDS X X X X * (Note 2) +MMIO X X X X * (Note 2) +Meltdown X +Retbleed X X * (Note 3) +RFDS X X X X +Spectre_v1 X +Spectre_v2 X X +Spectre_v2_user X X * (Note 1) +SRBDS X X X X +SRSO X X +SSB (Note 4) +TAA X X X X * (Note 2) +TSA X X X X +=============== ============== ============ ============= ============== ============ ======== + +Notes: + 1 -- Can be mitigated without disabling SMT. + + 2 -- Disables SMT if cross-thread mitigations are fully enabled and the CPU + is vulnerable + + 3 -- Disables SMT if cross-thread mitigations are fully enabled, the CPU is + vulnerable, and STIBP is not supported + + 4 -- Speculative store bypass is always enabled by default (no kernel + mitigation applied) unless overridden with spec_store_bypass_disable option + +When an attack-vector is disabled, all mitigations for the vulnerabilities +listed in the above table are disabled, unless mitigation is required for a +different enabled attack-vector or a mitigation is explicitly selected via a +vulnerability-specific command line option. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 09890a8f3ee9..89ca636081b7 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -9,6 +9,7 @@ are configurable at compile, boot or run time. .. toctree:: :maxdepth: 1 + attack_vector_controls spectre l1tf mds diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst index 1302fd1b55e8..6dba18dbb9ab 100644 --- a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -157,9 +157,7 @@ This is achieved by using the otherwise unused and obsolete VERW instruction in combination with a microcode update. The microcode clears the affected CPU buffers when the VERW instruction is executed. -Kernel reuses the MDS function to invoke the buffer clearing: - - mds_clear_cpu_buffers() +Kernel does the buffer clearing with x86_clear_cpu_buffers(). On MDS affected CPUs, the kernel already invokes CPU buffer clear on kernel/userspace, hypervisor/guest and C-state (idle) transitions. No diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 1f7f14c6e184..20fabdf6567e 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -547,6 +547,38 @@ from within add_taint() whenever the value set in this bitmask matches with the bit flag being set by add_taint(). This will cause a kdump to occur at the add_taint()->panic() call. +Write the dump file to encrypted disk volume +============================================ + +CONFIG_CRASH_DM_CRYPT can be enabled to support saving the dump file to an +encrypted disk volume (only x86_64 supported for now). User space can interact +with /sys/kernel/config/crash_dm_crypt_keys for setup, + +1. Tell the first kernel what logon keys are needed to unlock the disk volumes, + # Add key #1 + mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720 + # Add key #1's description + echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 1 + + # Add key #2 in the same way + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 2 + + # To support CPU/memory hot-plugging, re-use keys already saved to reserved + # memory + echo true > /sys/kernel/config/crash_dm_crypt_key/reuse + +2. Load the dump-capture kernel + +3. After the dump-capture kerne get booted, restore the keys to user keyring + echo yes > /sys/kernel/crash_dm_crypt_keys/restore + Contact ======= diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 0f714fc945ac..404a15f6782c 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -325,14 +325,14 @@ NR_FREE_PAGES On linux-2.6.21 or later, the number of free pages is in vm_stat[NR_FREE_PAGES]. Used to get the number of free pages. -PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb ------------------------------------------------------------------------------------------ +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_hwpoison|PG_head_mask +-------------------------------------------------------------------------- Page attributes. These flags are used to filter various unnecessary for dumping pages. -PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) ------------------------------------------------------------------------------ +PAGE_SLAB_MAPCOUNT_VALUE|PAGE_BUDDY_MAPCOUNT_VALUE|PAGE_OFFLINE_MAPCOUNT_VALUE|PAGE_HUGETLB_MAPCOUNT_VALUE|PAGE_UNACCEPTED_MAPCOUNT_VALUE +------------------------------------------------------------------------------------------------------------------------------------------ More page attributes. These flags are used to filter various unnecessary for dumping pages. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 1d6d39f4c68d..4943fc845a15 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -458,6 +458,9 @@ arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory Set instructions support + arm64.nompam [ARM64] Unconditionally disable Memory Partitioning And + Monitoring support + arm64.nomte [ARM64] Unconditionally disable Memory Tagging Extension support @@ -1828,6 +1831,13 @@ lz4: Select LZ4 compression algorithm to compress/decompress hibernation image. + hibernate.pm_test_delay= + [HIBERNATION] + Sets the number of seconds to remain in a hibernation test + mode before resuming the system (see + /sys/power/pm_test). Only available when CONFIG_PM_DEBUG + is set. Default value is 5. + highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact size of <nn>. This works even on boxes that have no highmem otherwise. This also works to reduce highmem @@ -2528,6 +2538,13 @@ requires the kernel to be built with CONFIG_ARM64_PSEUDO_NMI. + irqchip.riscv_imsic_noipi + [RISC-V,EARLY] + Force the kernel to not use IMSIC software injected MSIs + as IPIs. Intended for system where IMSIC is trap-n-emulated, + and thus want to reduce MMIO traps when triggering IPIs + to multiple harts. + irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken @@ -2742,6 +2759,31 @@ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. @@ -3755,6 +3797,10 @@ mmio_stale_data=full,nosmt [X86] retbleed=auto,nosmt [X86] + [X86] After one of the above options, additionally + supports attack-vector based controls as documented in + Documentation/admin-guide/hw-vuln/attack_vector_controls.rst + mminit_loglevel= [KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for @@ -4965,6 +5011,18 @@ that number, otherwise (e.g., 'pmu_override=on'), MMCR1 remains 0. + pm_async= [PM] + Format: off + This parameter sets the initial value of the + /sys/power/pm_async sysfs knob at boot time. + If set to "off", disables asynchronous suspend and + resume of devices during system-wide power transitions. + This can be useful on platforms where device + dependencies are not well-defined, or for debugging + power management issues. Asynchronous operations are + enabled by default. + + pm_debug_messages [SUSPEND,KNL] Enable suspend/resume debug messages during boot up. @@ -5450,7 +5508,8 @@ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" - Default is 0. + Default is 1 if num_possible_cpus() <= 16 and it is not explicitly + disabled by the boot parameter passing 0. rcuscale.gp_async= [KNL] Measure performance of asynchronous @@ -6352,6 +6411,11 @@ sa1100ir [NET] See drivers/net/irda/sa1100_ir.c. + sched_proxy_exec= [KNL] + Enables or disables "proxy execution" style + solution to mutex-based priority inversion. + Format: <bool> + sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages. schedstats= [KNL,X86] Enable or disable scheduled statistics. @@ -6523,14 +6587,14 @@ slab_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_debug legacy name also accepted for now) slab_max_order= [MM] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_max_order legacy name also accepted for now) slab_merge [MM] @@ -6545,13 +6609,14 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_min_objects legacy name also accepted for now) slab_min_order= [MM] Determines the minimum page order for slabs. Must be lower or equal to slab_max_order. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_min_order legacy name also accepted for now) slab_nomerge [MM] @@ -6565,7 +6630,8 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_nomerge legacy name also accepted for now) slab_strict_numa [MM] @@ -7179,6 +7245,14 @@ causing a major performance hit, and the space where machines are deployed is by other means guarded. + tpm_crb_ffa.busy_timeout_ms= [ARM64,TPM] + Maximum time in milliseconds to retry sending a message + to the TPM service before giving up. This parameter controls + how long the system will continue retrying when the TPM + service is busy. + Format: <unsigned int> + Default: 2000 (2 seconds) + tpm_suspend_pcr=[HW,TPM] Format: integer pcr id Specify that at suspend time, the tpm driver @@ -7453,6 +7527,19 @@ having this key zero'ed is acceptable. E.g. in testing scenarios. + tsa= [X86] Control mitigation for Transient Scheduler + Attacks on AMD CPUs. Search the following in your + favourite search engine for more details: + + "Technical guidance for mitigating transient scheduler + attacks". + + off - disable the mitigation + on - enable the mitigation (default) + user - mitigate only user/kernel transitions + vm - mitigate only guest/host transitions + + tsc= Disable clocksource stability checks for TSC. Format: <string> [x86] reliable: mark tsc clocksource as reliable, this diff --git a/Documentation/admin-guide/laptops/alienware-wmi.rst b/Documentation/admin-guide/laptops/alienware-wmi.rst new file mode 100644 index 000000000000..27a32a8057da --- /dev/null +++ b/Documentation/admin-guide/laptops/alienware-wmi.rst @@ -0,0 +1,127 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Alienware WMI Driver +==================== + +Kurt Borja <kuurtb@gmail.com> + +This is a driver for the "WMAX" WMI device, which is found in most Dell gaming +laptops and controls various special features. + +Before the launch of M-Series laptops (~2018), the "WMAX" device controlled +basic RGB lighting, deep sleep mode, HDMI mode and amplifier status. + +Later, this device was completely repurpused. Now it mostly deals with thermal +profiles, sensor monitoring and overclocking. This interface is named "AWCC" and +is known to be used by the AWCC OEM application to control these features. + +The alienware-wmi driver controls both interfaces. + +AWCC Interface +============== + +WMI device documentation: Documentation/wmi/devices/alienware-wmi.rst + +Supported devices +----------------- + +- Alienware M-Series laptops +- Alienware X-Series laptops +- Alienware Aurora Desktops +- Dell G-Series laptops + +If you believe your device supports the AWCC interface and you don't have any of +the features described in this document, try the following alienware-wmi module +parameters: + +- ``force_platform_profile=1``: Forces probing for platform profile support +- ``force_hwmon=1``: Forces probing for HWMON support + +If the module loads successfully with these parameters, consider submitting a +patch adding your model to the ``awcc_dmi_table`` located in +``drivers/platform/x86/dell/alienware-wmi-wmax.c`` or contacting the maintainer +for further guidance. + +Status +------ + +The following features are currently supported: + +- :ref:`Platform Profile <platform-profile>`: + + - Thermal profile control + + - G-Mode toggling + +- :ref:`HWMON <hwmon>`: + + - Sensor monitoring + + - Manual fan control + +.. _platform-profile: + +Platform Profile +---------------- + +The AWCC interface exposes various firmware defined thermal profiles. These are +exposed to user-space through the Platform Profile class interface. Refer to +:ref:`sysfs-class-platform-profile <abi_file_testing_sysfs_class_platform_profile>` +for more information. + +The name of the platform-profile class device exported by this driver is +"alienware-wmi" and it's path can be found with: + +:: + + grep -l "alienware-wmi" /sys/class/platform-profile/platform-profile-*/name | sed 's|/[^/]*$||' + +If the device supports G-Mode, it is also toggled when selecting the +``performance`` profile. + +.. note:: + You may set the ``force_gmode`` module parameter to always try to toggle this + feature, without checking if your model supports it. + +.. _hwmon: + +HWMON +----- + +The AWCC interface also supports sensor monitoring and manual fan control. Both +of these features are exposed to user-space through the HWMON interface. + +The name of the hwmon class device exported by this driver is "alienware_wmi" +and it's path can be found with: + +:: + + grep -l "alienware_wmi" /sys/class/hwmon/hwmon*/name | sed 's|/[^/]*$||' + +Sensor monitoring is done through the standard HWMON interface. Refer to +:ref:`sysfs-class-hwmon <abi_file_testing_sysfs_class_hwmon>` for more +information. + +Manual fan control on the other hand, is not exposed directly by the AWCC +interface. Instead it let's us control a fan `boost` value. This `boost` value +has the following aproximate behavior over the fan pwm: + +:: + + pwm = pwm_base + (fan_boost / 255) * (pwm_max - pwm_base) + +Due to the above behavior, the fan `boost` control is exposed to user-space +through the following, custom hwmon sysfs attribute: + +=============================== ======= ======================================= +Name Perm Description +=============================== ======= ======================================= +fan[1-4]_boost RW Fan boost value. + + Integer value between 0 and 255 +=============================== ======= ======================================= + +.. note:: + In some devices, manual fan control only works reliably if the ``custom`` + platform profile is selected. diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst index e71c8984c23e..db842b629303 100644 --- a/Documentation/admin-guide/laptops/index.rst +++ b/Documentation/admin-guide/laptops/index.rst @@ -7,6 +7,7 @@ Laptop Drivers .. toctree:: :maxdepth: 1 + alienware-wmi asus-laptop disk-shock-protection laptop-mode diff --git a/Documentation/admin-guide/media/c3-isp.dot b/Documentation/admin-guide/media/c3-isp.dot new file mode 100644 index 000000000000..42dc931ee84a --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.dot @@ -0,0 +1,26 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0 | <port1> 1} | c3-isp-core\n/dev/v4l-subdev0 | {<port2> 2 | <port3> 3 | <port4> 4 | <port5> 5}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port3 -> n00000008:port0 + n00000001:port4 -> n0000000b:port0 + n00000001:port5 -> n0000000e:port0 + n00000001:port2 -> n00000027 + n00000008 [label="{{<port0> 0} | c3-isp-resizer0\n/dev/v4l-subdev1 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000008:port1 -> n00000016 [style=bold] + n0000000b [label="{{<port0> 0} | c3-isp-resizer1\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000b:port1 -> n0000001a [style=bold] + n0000000e [label="{{<port0> 0} | c3-isp-resizer2\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000e:port1 -> n00000023 [style=bold] + n00000011 [label="{{<port0> 0} | c3-mipi-adapter\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000011:port1 -> n00000001:port0 [style=bold] + n00000016 [label="c3-isp-cap0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n0000001a [label="c3-isp-cap1\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n0000001e [label="{{<port0> 0} | c3-mipi-csi2\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000001e:port1 -> n00000011:port0 [style=bold] + n00000023 [label="c3-isp-cap2\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n00000027 [label="c3-isp-stats\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n0000002b [label="c3-isp-params\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n0000002b -> n00000001:port1 + n0000003f [label="{{} | imx290 2-001a\n/dev/v4l-subdev6 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n0000003f:port0 -> n0000001e:port0 [style=bold] +} diff --git a/Documentation/admin-guide/media/c3-isp.rst b/Documentation/admin-guide/media/c3-isp.rst new file mode 100644 index 000000000000..ac508b8c6831 --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR MIT) + +.. include:: <isonum.txt> + +================================================= +Amlogic C3 Image Signal Processing (C3ISP) driver +================================================= + +Introduction +============ + +This file documents the Amlogic C3ISP driver located under +drivers/media/platform/amlogic/c3/isp. + +The current version of the driver supports the C3ISP found on +Amlogic C308L processor. + +The driver implements V4L2, Media controller and V4L2 subdev interfaces. +Camera sensor using V4L2 subdev interface in the kernel is supported. + +The driver has been tested on AW419-C308L-Socket platform. + +Amlogic C3 ISP +============== + +The Camera hardware found on C308L processors and supported by +the driver consists of: + +- 1 MIPI-CSI-2 module: handles the physical layer of the MIPI CSI-2 receiver and + receives data from the connected camera sensor. +- 1 MIPI-ADAPTER module: organizes MIPI data to meet ISP input requirements and + send MIPI data to ISP. +- 1 ISP (Image Signal Processing) module: contains a pipeline of image processing + hardware blocks. The ISP pipeline contains three resizers at the end each of + them connected to a DMA interface which writes the output data to memory. + +A high-level functional view of the C3 ISP is presented below.:: + + +----------+ +-------+ + | Resizer |--->| WRMIF | + +---------+ +------------+ +--------------+ +-------+ |----------+ +-------+ + | Sensor |--->| MIPI CSI-2 |--->| MIPI ADAPTER |--->| ISP |---|----------+ +-------+ + +---------+ +------------+ +--------------+ +-------+ | Resizer |--->| WRMIF | + +----------+ +-------+ + |----------+ +-------+ + | Resizer |--->| WRMIF | + +----------+ +-------+ + +Driver architecture and design +============================== + +With the goal to model the hardware links between the modules and to expose a +clean, logical and usable interface, the driver registers the following V4L2 +sub-devices: + +- 1 `c3-mipi-csi2` sub-device - the MIPI CSI-2 receiver +- 1 `c3-mipi-adapter` sub-device - the MIPI adapter +- 1 `c3-isp-core` sub-device - the ISP core +- 3 `c3-isp-resizer` sub-devices - the ISP resizers + +The `c3-isp-core` sub-device is linked to 2 video device nodes for statistics +capture and parameters programming: + +- the `c3-isp-stats` capture video device node for statistics capture +- the `c3-isp-params` output video device for parameters programming + +Each `c3-isp-resizer` sub-device is linked to a capture video device node where +frames are captured from: + +- `c3-isp-resizer0` is linked to the `c3-isp-cap0` capture video device +- `c3-isp-resizer1` is linked to the `c3-isp-cap1` capture video device +- `c3-isp-resizer2` is linked to the `c3-isp-cap2` capture video device + +The media controller pipeline graph is as follows (with connected a +IMX290 camera sensor): + +.. _isp_topology_graph: + +.. kernel-figure:: c3-isp.dot + :alt: c3-isp.dot + :align: center + + Media pipeline topology + +Implementation +============== + +Runtime configuration of the ISP hardware is performed on the `c3-isp-params` +video device node using the :ref:`V4L2_META_FMT_C3ISP_PARAMS +<v4l2-meta-fmt-c3isp-params>` as data format. The buffer structure is defined by +:c:type:`c3_isp_params_cfg`. + +Statistics are captured from the `c3-isp-stats` video device node using the +:ref:`V4L2_META_FMT_C3ISP_STATS <v4l2-meta-fmt-c3isp-stats>` data format. + +The final picture size and format is configured using the V4L2 video +capture interface on the `c3-isp-cap[0, 2]` video device nodes. + +The Amlogic C3 ISP is supported by `libcamera <https://libcamera.org>`_ with a +dedicated pipeline handler and algorithms that perform run-time image correction +and enhancement. diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index f69d331e3cb1..5ac69b833a7a 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -1,8 +1,17 @@ .. SPDX-License-Identifier: GPL-2.0 +.. include:: <isonum.txt> + The mgb4 driver =============== +Copyright |copy| 2023 - 2025 Digiteq Automotive + author: Martin Tůma <martin.tuma@digiteqautomotive.com> + +This is a v4l2 device driver for the Digiteq Automotive FrameGrabber 4, a PCIe +card capable of capturing and generating FPD-Link III and GMSL2/3 video streams +as used in the automotive industry. + sysfs interface --------------- diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst index 7d8e3c8987db..239879634ea5 100644 --- a/Documentation/admin-guide/media/pci-cardlist.rst +++ b/Documentation/admin-guide/media/pci-cardlist.rst @@ -86,7 +86,6 @@ saa7134 Philips SAA7134 saa7164 NXP SAA7164 smipcie SMI PCIe DVBSky cards solo6x10 Bluecherry / Softlogic 6x10 capture cards (MPEG-4/H.264) -sta2x11_vip STA2X11 VIP Video For Linux tw5864 Techwell TW5864 video/audio grabber and encoder tw686x Intersil/Techwell TW686x tw68 Techwell tw68x Video For Linux diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index e8761561b2fe..3bac5165b134 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -10,6 +10,7 @@ Video4Linux (V4L) driver-specific documentation :maxdepth: 2 bttv + c3-isp cafe_ccic cx88 fimc diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..bc7e976120e0 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index ced2013db3df..d960aba72b82 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -81,7 +81,7 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max @@ -390,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains four files, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. Users can set and get the +four parameters for the quota auto-tuning goals that specified on the +:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory <sysfs_kdamond>` to pass the feedback to DAMON. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 8b35795b664b..ebc83ca20fdc 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,8 +37,10 @@ the Linux memory management. numaperf pagemap shrinker_debugfs + slab soft-dirty swap_numa transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..6dc18ed4b886 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts <kho-concepts>`. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase <kho-finalization-phase>` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index afce291649dd..e60e9211fd9b 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -250,6 +250,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PFNZERO`` - Page has zero PFN - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/slab.rst b/Documentation/admin-guide/mm/slab.rst new file mode 100644 index 000000000000..14429ab90611 --- /dev/null +++ b/Documentation/admin-guide/mm/slab.rst @@ -0,0 +1,469 @@ +======================================== +Short users guide for the slab allocator +======================================== + +The slab allocator includes full debugging support (when built with +CONFIG_SLUB_DEBUG=y) but it is off by default (unless built with +CONFIG_SLUB_DEBUG_ON=y). You can enable debugging only for selected +slabs in order to avoid an impact on overall system performance which +may make a bug more difficult to find. + +In order to switch debugging on one can add an option ``slab_debug`` +to the kernel command line. That will enable full debugging for +all slabs. + +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists +slabs that have data in them. See "slabinfo -h" for more options when +running the command. ``slabinfo`` can be compiled with +:: + + gcc -o slabinfo tools/mm/slabinfo.c + +Some of the modes of operation of ``slabinfo`` require that slub debugging +be enabled on the command line. F.e. no tracking information will be +available without debugging on and validation can only partially +be performed if debugging was not switched on. + +Some more sophisticated uses of slab_debug: +------------------------------------------- + +Parameters may be given to ``slab_debug``. If none is specified then full +debugging is enabled. Format: + +slab_debug=<Debug-Options> + Enable options for all slabs + +slab_debug=<Debug-Options>,<slab name1>,<slab name2>,... + Enable options only for select slabs (no spaces + after a comma) + +Multiple blocks of options for all slabs or selected slabs can be given, with +blocks of options delimited by ';'. The last of "all slabs" blocks is applied +to all slabs except those that match one of the "select slabs" block. Options +of the first "select slabs" blocks that matches the slab's name are applied. + +Possible debug options are:: + + F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS + Sorry SLAB legacy issues) + Z Red zoning + P Poisoning (object and padding) + U User tracking (free and alloc) + T Trace (please only use on single slabs) + A Enable failslab filter mark for the cache + O Switch debugging off for caches that would have + caused higher minimum slab orders + - Switch all debugging off (useful if the kernel is + configured with CONFIG_SLUB_DEBUG_ON) + +F.e. in order to boot just with sanity checks and red zoning one would specify:: + + slab_debug=FZ + +Trying to find an issue in the dentry cache? Try:: + + slab_debug=,dentry + +to only enable debugging on the dentry cache. You may use an asterisk at the +end of the slab name, in order to cover all slabs with the same prefix. For +example, here's how you can poison the dentry cache as well as all kmalloc +slabs:: + + slab_debug=P,kmalloc-*,dentry + +Red zoning and tracking may realign the slab. We can just apply sanity checks +to the dentry cache with:: + + slab_debug=F,dentry + +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher likelihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use:: + + slab_debug=O + +You can apply different options to different list of slab names, using blocks +of options. This will enable red zoning for dentry and user tracking for +kmalloc. All other slabs will not get any debugging enabled:: + + slab_debug=Z,dentry;U,kmalloc-* + +You can also enable options (e.g. sanity checks and poisoning) for all caches +except some that are deemed too performance critical and don't need to be +debugged by specifying global debug options followed by a list of slab names +with "-" as options:: + + slab_debug=FZ;-,zs_handle,zspage + +The state of each debug option for a slab can be found in the respective files +under:: + + /sys/kernel/slab/<slab name>/ + +If the file contains 1, the option is enabled, 0 means disabled. The debug +options from the ``slab_debug`` parameter translate to the following files:: + + F sanity_checks + Z red_zone + P poison + U store_user + T trace + A failslab + +failslab file is writable, so writing 1 or 0 will enable or disable +the option at runtime. Write returns -EINVAL if cache is an alias. +Careful with tracing: It may spew out lots of information and never stop if +used on the wrong slab. + +Slab merging +============ + +If no debug options are specified then SLUB may merge similar slabs together +in order to reduce overhead and increase cache hotness of objects. +``slabinfo -a`` displays which slabs were merged together. + +Slab validation +=============== + +SLUB can validate all object if the kernel was booted with slab_debug. In +order to do so you must have the ``slabinfo`` tool. Then you can do +:: + + slabinfo -v + +which will test all objects. Output will be generated to the syslog. + +This also works in a more limited way if boot was without slab debug. +In that case ``slabinfo -v`` simply tests all reachable objects. Usually +these are in the cpu slabs and the partial slabs. Full slabs are not +tracked by SLUB in a non debug situation. + +Getting more performance +======================== + +To some degree SLUB's performance is limited by the need to take the +list_lock once in a while to deal with partial slabs. That overhead is +governed by the order of the allocation for each slab. The allocations +can be influenced by kernel parameters: + +.. slab_min_objects=x (default: automatically scaled by number of cpus) +.. slab_min_order=x (default 0) +.. slab_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) + +``slab_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. + +``slab_min_order`` + specifies a minimum order of slabs. A similar effect like + ``slab_min_objects``. + +``slab_max_order`` + specified the order at which ``slab_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slab_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slab_max_order`` to 0, what cause minimum possible order of + slabs allocation. + +``slab_strict_numa`` + Enables the application of memory policies on each + allocation. This results in more accurate placement of + objects which may result in the reduction of accesses + to remote nodes. The default is to only apply memory + policies at the folio level when a new folio is acquired + or a folio is retrieved from the lists. Enabling this + option reduces the fastpath performance of the slab allocator. + +SLUB Debug output +================= + +Here is a sample of slub debug output:: + + ==================================================================== + BUG kmalloc-8: Right Redzone overwritten + -------------------------------------------------------------------- + + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + + Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object (0xc90f6d20): 31 30 31 39 2e 30 30 35 1019.005 + Redzone (0xc90f6d28): 00 cc cc cc . + Padding (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + + [<c010523d>] dump_trace+0x63/0x1eb + [<c01053df>] show_trace_log_lvl+0x1a/0x2f + [<c010601d>] show_trace+0x12/0x14 + [<c0106035>] dump_stack+0x16/0x18 + [<c017e0fa>] object_err+0x143/0x14b + [<c017e2cc>] check_object+0x66/0x234 + [<c017eb43>] __slab_free+0x239/0x384 + [<c017f446>] kfree+0xa6/0xc6 + [<c02e2335>] get_modalias+0xb9/0xf5 + [<c02e23b7>] dmi_dev_uevent+0x27/0x3c + [<c027866a>] dev_uevent+0x1ad/0x1da + [<c0205024>] kobject_uevent_env+0x20a/0x45b + [<c020527f>] kobject_uevent+0xa/0xf + [<c02779f1>] store_uevent+0x4f/0x58 + [<c027758e>] dev_attr_store+0x29/0x2f + [<c01bec4f>] sysfs_write_file+0x16e/0x19c + [<c0183ba7>] vfs_write+0xd1/0x15a + [<c01841d7>] sys_write+0x3d/0x72 + [<c0104112>] sysenter_past_esp+0x5f/0x99 + [<b7f7b410>] 0xb7f7b410 + ======================= + + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + +If SLUB encounters a corrupted object (full detection requires the kernel +to be booted with slab_debug) then the following output will be dumped +into the syslog: + +1. Description of the problem encountered + + This will be a message in the system log starting with:: + + =============================================== + BUG <slab cache affected>: <What went wrong> + ----------------------------------------------- + + INFO: <corruption start>-<corruption_end> <more info> + INFO: Slab <address> <slab information> + INFO: Object <address> <object information> + INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by + cpu> pid=<pid of the process> + INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> + pid=<pid of the process> + + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slab_debug sets that option) + +2. The object contents if an object was involved. + + Various types of lines can follow the BUG SLUB line: + + Bytes b4 <address> : <bytes> + Shows a few bytes before the object where the problem was detected. + Can be useful if the corruption does not stop with the start of the + object. + + Object <address> : <bytes> + The bytes of the object. If the object is inactive then the bytes + typically contain poison values. Any non-poison value shows a + corruption by a write after free. + + Redzone <address> : <bytes> + The Redzone following the object. The Redzone is used to detect + writes after the object. All bytes should always have the same + value. If there is any deviation then it is due to a write after + the object boundary. + + (Redzone information is only available if SLAB_RED_ZONE is set. + slab_debug sets that option) + + Padding <address> : <bytes> + Unused data to fill up the space in order to get the next object + properly aligned. In the debug case we make sure that there are + at least 4 bytes of padding. This allows the detection of writes + before the object. + +3. A stackdump + + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. + +4. Report on how the problem was dealt with in order to ensure the continued + operation of the system. + + These are messages in the system log beginning with:: + + FIX <slab cache affected>: <corrective action taken> + + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. + +Emergency operations +==================== + +Minimal debugging (sanity checks alone) can be enabled by booting with:: + + slab_debug=F + +This will be generally be enough to enable the resiliency features of slub +which will keep the system running even if a bad kernel component will +keep corrupting objects. This may be important for production systems. +Performance will be impacted by the sanity checks and there will be a +continual stream of error messages to the syslog but no additional memory +will be used (unlike full debugging). + +No guarantees. The kernel component still needs to be fixed. Performance +may be optimized further by locating the slab that experiences corruption +and enabling debugging only for that cache + +I.e.:: + + slab_debug=F,dentry + +If the corruption occurs by writing after the end of the object then it +may be advisable to enable a Redzone to avoid corrupting the beginning +of other objects:: + + slab_debug=FZ,dentry + +Extended slabinfo mode and plotting +=================================== + +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N <num> slabs, default 1) + - Slabs sorted by loss (up to -N <num> slabs, default 1) + +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. + +To generate plots: + +a) collect slabinfo extended records, for example:: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: + + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: + + while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done + +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and height + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). + + +DebugFS files for SLUB +====================== + +For more information about current state of SLUB caches with the user tracking +debug option enabled, debugfs files are available, typically under +/sys/kernel/debug/slab/<cache>/ (created only for caches with enabled user +tracking). There are 2 types of these files with the following debug +information: + +1. alloc_traces:: + + Prints information about unique allocation traces of the currently + allocated objects. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, allocating function, possible memory wastage of + kmalloc objects(total/per-object), minimal/average/maximal jiffies + since alloc, pid range of the allocating processes, cpu mask of + allocating cpus, numa node mask of origins of memory, and stack trace. + + Example::: + + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 + __kmem_cache_alloc_node+0x11f/0x4e0 + kmalloc_trace+0x26/0xa0 + pci_alloc_dev+0x2c/0xa0 + pci_scan_single_device+0xd2/0x150 + pci_scan_slot+0xf7/0x2d0 + pci_scan_child_bus_extend+0x4e/0x360 + acpi_pci_root_create+0x32e/0x3b0 + pci_acpi_scan_root+0x2b9/0x2d0 + acpi_pci_root_add.cold.11+0x110/0xb0a + acpi_bus_attach+0x262/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + +2. free_traces:: + + Prints information about unique freeing traces of the currently allocated + objects. The freeing traces thus come from the previous life-cycle of the + objects and are reported as not available for objects allocated for the first + time. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, freeing function, minimal/average/maximal jiffies since free, + pid range of the freeing processes, cpu mask of freeing cpus, and stack trace. + + Example::: + + 1980 <not-available> age=4294912290 pid=0 cpus=0 + 51 acpi_ut_update_ref_count+0x6a6/0x782 age=236886/237027/237772 pid=1 cpus=1 + kfree+0x2db/0x420 + acpi_ut_update_ref_count+0x6a6/0x782 + acpi_ut_update_object_reference+0x1ad/0x234 + acpi_ut_remove_reference+0x7d/0x84 + acpi_rs_get_prt_method_data+0x97/0xd6 + acpi_get_irq_routing_table+0x82/0xc4 + acpi_pci_irq_find_prt_entry+0x8e/0x2e0 + acpi_pci_irq_lookup+0x3a/0x1e0 + acpi_pci_irq_enable+0x77/0x240 + pcibios_enable_device+0x39/0x40 + do_pci_enable_device.part.0+0x5d/0xe0 + pci_enable_device_flags+0xfc/0x120 + pci_enable_device+0x13/0x20 + virtio_pci_probe+0x9e/0x170 + local_pci_probe+0x48/0x80 + pci_device_probe+0x105/0x1c0 + +Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/admin-guide/namespaces/resource-control.rst b/Documentation/admin-guide/namespaces/resource-control.rst index 369556e00f0c..553a44803231 100644 --- a/Documentation/admin-guide/namespaces/resource-control.rst +++ b/Documentation/admin-guide/namespaces/resource-control.rst @@ -1,17 +1,17 @@ -=========================== -Namespaces research control -=========================== +==================================== +User namespaces and resource control +==================================== -There are a lot of kinds of objects in the kernel that don't have -individual limits or that have limits that are ineffective when a set -of processes is allowed to switch user ids. With user namespaces -enabled in a kernel for people who don't trust their users or their -users programs to play nice this problems becomes more acute. +The kernel contains many kinds of objects that either don't have +individual limits or that have limits which are ineffective when +a set of processes is allowed to switch their UID. On a system +where the admins don't trust their users or their users' programs, +user namespaces expose the system to potential misuse of resources. -Therefore it is recommended that memory control groups be enabled in -kernels that enable user namespaces, and it is further recommended -that userspace configure memory control groups to limit how much -memory user's they don't trust to play nice can use. +In order to mitigate this, we recommend that admins enable memory +control groups on any system that enables user namespaces. +Furthermore, we recommend that admins configure the memory control +groups to limit the maximum memory usable by any untrusted user. Memory control groups can be configured by installing the libcgroup package present on most distros editing /etc/cgrules.conf, diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 412423c54f25..e1771f2225d5 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -72,7 +72,7 @@ to manage each performance update behavior. :: Lowest non- | | | | linear perf ------>+-----------------------+ +-----------------------+ | | | | - | | Lowest perf ---->| | + | | Min perf ---->| | | | | | Lowest perf ------>+-----------------------+ +-----------------------+ | | | | diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 3950583f2b15..cacb9f0307dd 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -231,7 +231,7 @@ are the following: present). The existence of the limit may be a result of some (often unintentional) - BIOS settings, restrictions coming from a service processor or another + BIOS settings, restrictions coming from a service processor or other BIOS/HW-based mechanisms. This does not cover ACPI thermal limitations which can be discovered @@ -258,8 +258,8 @@ are the following: extension on ARM). If one cannot be determined, this attribute should not be present. - Note, that failed attempt to retrieve current frequency for a given - CPU(s) will result in an appropriate error, i.e: EAGAIN for CPU that + Note that failed attempt to retrieve current frequency for a given + CPU(s) will result in an appropriate error, i.e.: EAGAIN for CPU that remains idle (raised on ARM). ``cpuinfo_max_freq`` @@ -398,7 +398,9 @@ policy limits change after that. This governor does not do anything by itself. Instead, it allows user space to set the CPU frequency for the policy it is attached to by writing to the -``scaling_setspeed`` attribute of that policy. +``scaling_setspeed`` attribute of that policy. Though the intention may be to +set an exact frequency for the policy, the actual frequency may vary depending +on hardware coordination, thermal and power limits, and other factors. ``schedutil`` ------------- @@ -499,7 +501,7 @@ This governor exposes the following tunables: represented by it to be 1.5 times as high as the transition latency (the default):: - # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate + # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2))` > ondemand/sampling_rate ``up_threshold`` If the estimated CPU load is above this value (in percent), the governor diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index 5940528146eb..ed6f055d4b14 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -38,6 +38,27 @@ instruction at all. only way to pass early-configuration-time parameters to it is via the kernel command line. +Sysfs Interface +=============== + +The ``intel_idle`` driver exposes the following ``sysfs`` attributes in +``/sys/devices/system/cpu/cpuidle/``: + +``intel_c1_demotion`` + Enable or disable C1 demotion for all CPUs in the system. This file is + only exposed on platforms that support the C1 demotion feature and where + it was tested. Value 0 means that C1 demotion is disabled, value 1 means + that it is enabled. Write 0 or 1 to disable or enable C1 demotion for + all CPUs. + + The C1 demotion feature involves the platform firmware demoting deep + C-state requests from the OS (e.g., C6 requests) to C1. The idea is that + firmware monitors CPU wake-up rate, and if it is higher than a + platform-specific threshold, the firmware demotes deep C-state requests + to C1. For example, Linux requests C6, but firmware noticed too many + wake-ups per second, and it keeps the CPU in C1. When the CPU stays in + C1 long enough, the platform promotes it back to C6. This may improve + some workloads' performance, but it may also increase power consumption. .. _intel-idle-enumeration-of-states: diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index 78fc83ed2a7e..26e702c7016e 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -329,6 +329,106 @@ information listed above is the same for all of the processors supporting the HWP feature, which is why ``intel_pstate`` works with all of them.] +Support for Hybrid Processors +============================= + +Some processors supported by ``intel_pstate`` contain two or more types of CPU +cores differing by the maximum turbo P-state, performance vs power characteristics, +cache sizes, and possibly other properties. They are commonly referred to as +hybrid processors. To support them, ``intel_pstate`` requires HWP to be enabled +and it assumes the HWP performance units to be the same for all CPUs in the +system, so a given HWP performance level always represents approximately the +same physical performance regardless of the core (CPU) type. + +Hybrid Processors with SMT +-------------------------- + +On systems where SMT (Simultaneous Multithreading), also referred to as +HyperThreading (HT) in the context of Intel processors, is enabled on at least +one core, ``intel_pstate`` assigns performance-based priorities to CPUs. Namely, +the priority of a given CPU reflects its highest HWP performance level which +causes the CPU scheduler to generally prefer more performant CPUs, so the less +performant CPUs are used when the other ones are fully loaded. However, SMT +siblings (that is, logical CPUs sharing one physical core) are treated in a +special way such that if one of them is in use, the effective priority of the +other ones is lowered below the priorities of the CPUs located in the other +physical cores. + +This approach maximizes performance in the majority of cases, but unfortunately +it also leads to excessive energy usage in some important scenarios, like video +playback, which is not generally desirable. While there is no other viable +choice with SMT enabled because the effective capacity and utilization of SMT +siblings are hard to determine, hybrid processors without SMT can be handled in +more energy-efficient ways. + +.. _CAS: + +Capacity-Aware Scheduling Support +--------------------------------- + +The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by +``intel_pstate`` by default on hybrid processors without SMT. CAS generally +causes the scheduler to put tasks on a CPU so long as there is a sufficient +amount of spare capacity on it, and if the utilization of a given task is too +high for it, the task will need to go somewhere else. + +Since CAS takes CPU capacities into account, it does not require CPU +prioritization and it allows tasks to be distributed more symmetrically among +the more performant and less performant CPUs. Once placed on a CPU with enough +capacity to accommodate it, a task may just continue to run there regardless of +whether or not the other CPUs are fully loaded, so on average CAS reduces the +utilization of the more performant CPUs which causes the energy usage to be more +balanced because the more performant CPUs are generally less energy-efficient +than the less performant ones. + +In order to use CAS, the scheduler needs to know the capacity of each CPU in +the system and it needs to be able to compute scale-invariant utilization of +CPUs, so ``intel_pstate`` provides it with the requisite information. + +First of all, the capacity of each CPU is represented by the ratio of its highest +HWP performance level, multiplied by 1024, to the highest HWP performance level +of the most performant CPU in the system, which works because the HWP performance +units are the same for all CPUs. Second, the frequency-invariance computations, +carried out by the scheduler to always express CPU utilization in the same units +regardless of the frequency it is currently running at, are adjusted to take the +CPU capacity into account. All of this happens when ``intel_pstate`` has +registered itself with the ``CPUFreq`` core and it has figured out that it is +running on a hybrid processor without SMT. + +Energy-Aware Scheduling Support +------------------------------- + +If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and +``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling +`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the +Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if +``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate`` +to operate in the `passive mode <Passive Mode_>`_. + +The Energy Model registered by ``intel_pstate`` is artificial (that is, it is +based on abstract cost values and it does not include any real power numbers) +and it is relatively simple to avoid unnecessary computations in the scheduler. +There is a performance domain in it for every CPU in the system and the cost +values for these performance domains have been chosen so that running a task on +a less performant (small) CPU appears to be always cheaper than running that +task on a more performant (big) CPU. However, for two CPUs of the same type, +the cost difference depends on their current utilization, and the CPU whose +current utilization is higher generally appears to be a more expensive +destination for a given task. This helps to balance the load among CPUs of the +same type. + +Since EAS works on top of CAS, high-utilization tasks are always migrated to +CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization +tasks tend to be placed on the CPUs that look less expensive to the scheduler. +Effectively, this causes the less performant and less loaded CPUs to be +preferred as long as they have enough spare capacity to run the given task +which generally leads to reduced energy usage. + +The Energy Model created by ``intel_pstate`` can be inspected by looking at +the ``energy_model`` directory in ``debugfs`` (typlically mounted on +``/sys/kernel/debug/``). + + User Space Interface in ``sysfs`` ================================= @@ -697,8 +797,8 @@ of them have to be prepended with the ``intel_pstate=`` prefix. Limits`_ for details). ``no_cas`` - Do not enable capacity-aware scheduling (CAS) which is enabled by - default on hybrid systems. + Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by + default on hybrid systems without SMT. Diagnostics and Tuning ====================== diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst index 5151ec312dc0..d367ba4d744a 100644 --- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst +++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst @@ -91,12 +91,22 @@ Attributes in each directory: ``domain_id`` This attribute is used to get the power domain id of this instance. +``die_id`` + This attribute is used to get the Linux die id of this instance. + This attribute is only present for domains with core agents and + when the CPUID leaf 0x1f presents die ID. + ``fabric_cluster_id`` This attribute is used to get the fabric cluster id of this instance. ``package_id`` This attribute is used to get the package id of this instance. +``agent_types`` + This attribute displays all the hardware agents present within the + domain. Each agent has the capability to control one or more hardware + subsystems, which include: core, cache, memory, and I/O. + The other attributes are same as presented at package_*_die_* level. In most of current use cases, the "max_freq_khz" and "min_freq_khz" diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst index 07cfd8863b46..4a5ffb0996a3 100644 --- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst +++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst @@ -347,7 +347,7 @@ again. [:ref:`details<uninstall>`] -.. _submit_improvements: +.. _submit_improvements_qbtl: Did you run into trouble following any of the above steps that is not cleared up by the reference section below? Or do you have ideas how to improve the text? @@ -1070,7 +1070,7 @@ complicated, and harder to follow. That being said: this of course is a balancing act. Hence, if you think an additional use-case is worth describing, suggest it to the maintainers of this -document, as :ref:`described above <submit_improvements>`. +document, as :ref:`described above <submit_improvements_qbtl>`. .. diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst index 2fd5a030235a..9a847506f6ec 100644 --- a/Documentation/admin-guide/reporting-issues.rst +++ b/Documentation/admin-guide/reporting-issues.rst @@ -41,7 +41,7 @@ If you are facing multiple issues with the Linux kernel at once, report each separately. While writing your report, include all information relevant to the issue, like the kernel and the distro used. In case of a regression, CC the regressions mailing list (regressions@lists.linux.dev) to your report. Also try -to pin-point the culprit with a bisection; if you succeed, include its +to pinpoint the culprit with a bisection; if you succeed, include its commit-id and CC everyone in the sign-off-by chain. Once the report is out, answer any questions that come up and help where you @@ -206,7 +206,7 @@ Reporting issues only occurring in older kernel version lines This subsection is for you, if you tried the latest mainline kernel as outlined above, but failed to reproduce your issue there; at the same time you want to see the issue fixed in a still supported stable or longterm series or vendor -kernels regularly rebased on those. If that the case, follow these steps: +kernels regularly rebased on those. If that is the case, follow these steps: * Prepare yourself for the possibility that going through the next few steps might not get the issue solved in older releases: the fix might be too big @@ -312,7 +312,7 @@ small modifications to a kernel based on a recent Linux version; that for example often holds true for the mainline kernels shipped by Debian GNU/Linux Sid or Fedora Rawhide. Some developers will also accept reports about issues with kernels from distributions shipping the latest stable kernel, as long as -its only slightly modified; that for example is often the case for Arch Linux, +it's only slightly modified; that for example is often the case for Arch Linux, regular Fedora releases, and openSUSE Tumbleweed. But keep in mind, you better want to use a mainline Linux and avoid using a stable kernel for this process, as outlined in the section 'Install a fresh kernel for testing' in more diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst index e3cfffef5a63..c1768d9e80fa 100644 --- a/Documentation/admin-guide/syscall-user-dispatch.rst +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -53,20 +53,25 @@ following prctl: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) -<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and -disable the mechanism globally for that thread. When -PR_SYS_DISPATCH_OFF is used, the other fields must be zero. - -[<offset>, <offset>+<length>) delimit a memory region interval -from which syscalls are always executed directly, regardless of the -userspace selector. This provides a fast path for the C library, which -includes the most common syscall dispatchers in the native code -applications, and also provides a way for the signal handler to return +<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON +or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for +that thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero. + +For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit +a memory region interval from which syscalls are always executed directly, +regardless of the userspace selector. This provides a fast path for the +C library, which includes the most common syscall dispatchers in the native +code applications, and also provides a way for the signal handler to return without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this interface should make sure that at least the signal trampoline code is included in this region. In addition, for syscalls that implement the trampoline code on the vDSO, that trampoline is never intercepted. +For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit +a memory region interval from which syscalls are dispatched based on +the userspace selector. Syscalls from outside of the range are always +executed directly. + [selector] is a pointer to a char-sized region in the process memory region, that provides a quick way to enable disable syscall redirection thread-wide, without the need to invoke the kernel directly. selector diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index dd49a89a62d3..c04e6b8eb2b1 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1014,30 +1014,26 @@ perf_user_access (arm64 and riscv only) Controls user space access for reading perf event counters. -arm64 -===== - -The default value is 0 (access disabled). +* for arm64 + The default value is 0 (access disabled). -When set to 1, user space can read performance monitor counter registers -directly. + When set to 1, user space can read performance monitor counter registers + directly. -See Documentation/arch/arm64/perf.rst for more information. - -riscv -===== + See Documentation/arch/arm64/perf.rst for more information. -When set to 0, user space access is disabled. +* for riscv + When set to 0, user space access is disabled. -The default value is 1, user space can read performance monitor counter -registers through perf, any direct access without perf intervention will trigger -an illegal instruction. + The default value is 1, user space can read performance monitor counter + registers through perf, any direct access without perf intervention will trigger + an illegal instruction. -When set to 2, which enables legacy mode (user space has direct access to cycle -and insret CSRs only). Note that this legacy value is deprecated and will be -removed once all user space applications are fixed. + When set to 2, which enables legacy mode (user space has direct access to cycle + and insret CSRs only). Note that this legacy value is deprecated and will be + removed once all user space applications are fixed. -Note that the time CSR is always directly accessible to all modes. + Note that the time CSR is always directly accessible to all modes. pid_max ======= @@ -1465,7 +1461,7 @@ stack_erasing ============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. +of syscalls for kernels built with ``CONFIG_KSTACK_ERASE``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. @@ -1473,7 +1469,7 @@ The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. = ==================================================================== -0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +0 Kernel stack erasing is disabled, KSTACK_ERASE_METRICS are not updated. 1 Kernel stack erasing is enabled (default), it is performed before returning to the userspace at the end of syscalls. = ==================================================================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d385985b305f..4d71211fdad8 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -132,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. @@ -459,8 +465,8 @@ The minimum value is 1 (1/1 -> 100%). The value less than 1 completely disables protection of the pages. -max_map_count: -============== +max_map_count +============= This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling @@ -489,8 +495,8 @@ memory allocations. The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. -memory_failure_early_kill: -========================== +memory_failure_early_kill +========================= Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index d0502691dfa1..102c693c8f81 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -296,6 +296,39 @@ information is missing. To recover from this mode, one needs to flash a valid NVM image to the host controller in the same way it is done in the previous chapter. +Tunneling events +---------------- +The driver sends ``KOBJ_CHANGE`` events to userspace when there is a +tunneling change in the ``thunderbolt_domain``. The notification carries +following environment variables:: + + TUNNEL_EVENT=<EVENT> + TUNNEL_DETAILS=0:12 <-> 1:20 (USB3) + +Possible values for ``<EVENT>`` are: + + activated + The tunnel was activated (created). + + changed + There is a change in this tunnel. For example bandwidth allocation was + changed. + + deactivated + The tunnel was torn down. + + low bandwidth + The tunnel is not getting optimal bandwidth. + + insufficient bandwidth + There is not enough bandwidth for the current tunnel requirements. + +The ``TUNNEL_DETAILS`` is only provided if the tunnel is known. For +example, in case of Firmware Connection Manager this is missing or does +not provide full tunnel information. In case of Software Connection Manager +this includes full tunnel details. The format currently matches what the +driver uses when logging. This may change over time. + Networking over Thunderbolt cable --------------------------------- Thunderbolt technology allows software communication between two hosts @@ -325,12 +358,7 @@ Forcing power Many OEMs include a method that can be used to force the power of a Thunderbolt controller to an "On" state even if nothing is connected. If supported by your machine this will be exposed by the WMI bus with -a sysfs attribute called "force_power". - -For example the intel-wmi-thunderbolt driver exposes this attribute in: - /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power - - To force the power to on, write 1 to this attribute file. - To disable force power, write 0 to this attribute file. +a sysfs attribute called "force_power", see +Documentation/ABI/testing/sysfs-platform-intel-wmi-thunderbolt for details. Note: it's currently not possible to query the force power state of a platform. diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst index 03c55151346c..d8946b084b1e 100644 --- a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -267,7 +267,7 @@ culprit might be known already. For further details on what actually qualifies as a regression check out Documentation/admin-guide/reporting-regressions.rst. If you run into any problems while following this guide or have ideas how to -improve it, :ref:`please let the kernel developers know <submit_improvements>`. +improve it, :ref:`please let the kernel developers know <submit_improvements_vbbr>`. .. _introprep_bissbs: @@ -1055,7 +1055,7 @@ follow these instructions. [:ref:`details <introoptional_bisref>`] -.. _submit_improvements: +.. _submit_improvements_vbbr: Conclusion ---------- |