summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r--Documentation/admin-guide/LSM/apparmor.rst7
-rw-r--r--Documentation/admin-guide/LSM/index.rst1
-rw-r--r--Documentation/admin-guide/LSM/ipe.rst793
-rw-r--r--Documentation/admin-guide/LSM/tomoyo.rst35
-rw-r--r--Documentation/admin-guide/RAS/address-translation.rst24
-rw-r--r--Documentation/admin-guide/RAS/error-decoding.rst21
-rw-r--r--Documentation/admin-guide/RAS/index.rst7
-rw-r--r--Documentation/admin-guide/RAS/main.rst (renamed from Documentation/admin-guide/ras.rst)10
-rw-r--r--Documentation/admin-guide/README.rst75
-rw-r--r--Documentation/admin-guide/blockdev/zram.rst73
-rw-r--r--Documentation/admin-guide/braille-console.rst4
-rw-r--r--Documentation/admin-guide/bug-bisect.rst219
-rw-r--r--Documentation/admin-guide/bug-hunting.rst26
-rw-r--r--Documentation/admin-guide/cgroup-v1/cgroups.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v1/cpusets.rst9
-rw-r--r--Documentation/admin-guide/cgroup-v1/hugetlb.rst20
-rw-r--r--Documentation/admin-guide/cgroup-v1/memcg_test.rst2
-rw-r--r--Documentation/admin-guide/cgroup-v1/memory.rst116
-rw-r--r--Documentation/admin-guide/cgroup-v1/pids.rst3
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst245
-rw-r--r--Documentation/admin-guide/cifs/introduction.rst2
-rw-r--r--Documentation/admin-guide/cifs/usage.rst36
-rw-r--r--Documentation/admin-guide/device-mapper/delay.rst41
-rw-r--r--Documentation/admin-guide/device-mapper/dm-crypt.rst26
-rw-r--r--Documentation/admin-guide/device-mapper/index.rst2
-rw-r--r--Documentation/admin-guide/device-mapper/vdo-design.rst633
-rw-r--r--Documentation/admin-guide/device-mapper/vdo.rst412
-rw-r--r--Documentation/admin-guide/dynamic-debug-howto.rst5
-rw-r--r--Documentation/admin-guide/edid.rst35
-rw-r--r--Documentation/admin-guide/ext4.rst10
-rw-r--r--Documentation/admin-guide/gpio/gpio-mockup.rst8
-rw-r--r--Documentation/admin-guide/gpio/gpio-virtuser.rst177
-rw-r--r--Documentation/admin-guide/gpio/index.rst7
-rw-r--r--Documentation/admin-guide/gpio/obsolete.rst13
-rw-r--r--Documentation/admin-guide/gpio/sysfs.rst167
-rw-r--r--Documentation/admin-guide/hw-vuln/core-scheduling.rst4
-rw-r--r--Documentation/admin-guide/hw-vuln/index.rst1
-rw-r--r--Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst104
-rw-r--r--Documentation/admin-guide/hw-vuln/spectre.rst110
-rw-r--r--Documentation/admin-guide/hw-vuln/srso.rst71
-rw-r--r--Documentation/admin-guide/index.rst165
-rw-r--r--Documentation/admin-guide/kdump/kdump.rst15
-rw-r--r--Documentation/admin-guide/kdump/vmcoreinfo.rst8
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst42
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt1638
-rw-r--r--Documentation/admin-guide/kernel-per-CPU-kthreads.rst2
-rw-r--r--Documentation/admin-guide/laptops/thinkpad-acpi.rst11
-rw-r--r--Documentation/admin-guide/media/building.rst2
-rw-r--r--Documentation/admin-guide/media/cec.rst87
-rw-r--r--Documentation/admin-guide/media/em28xx-cardlist.rst8
-rw-r--r--Documentation/admin-guide/media/index.rst5
-rw-r--r--Documentation/admin-guide/media/ipu3.rst6
-rw-r--r--Documentation/admin-guide/media/ipu6-isys.rst161
-rw-r--r--Documentation/admin-guide/media/ipu6_isys_graph.svg548
-rw-r--r--Documentation/admin-guide/media/mgb4.rst58
-rw-r--r--Documentation/admin-guide/media/omap4_camera.rst62
-rw-r--r--Documentation/admin-guide/media/raspberrypi-pisp-be.dot20
-rw-r--r--Documentation/admin-guide/media/raspberrypi-pisp-be.rst109
-rw-r--r--Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot27
-rw-r--r--Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst78
-rw-r--r--Documentation/admin-guide/media/rkisp1.rst11
-rw-r--r--Documentation/admin-guide/media/saa7134.rst2
-rw-r--r--Documentation/admin-guide/media/tuner-cardlist.rst2
-rw-r--r--Documentation/admin-guide/media/v4l-drivers.rst4
-rw-r--r--Documentation/admin-guide/media/visl.rst12
-rw-r--r--Documentation/admin-guide/media/vivid.rst193
-rw-r--r--Documentation/admin-guide/mm/damon/reclaim.rst27
-rw-r--r--Documentation/admin-guide/mm/damon/start.rst71
-rw-r--r--Documentation/admin-guide/mm/damon/usage.rst502
-rw-r--r--Documentation/admin-guide/mm/hugetlbpage.rst7
-rw-r--r--Documentation/admin-guide/mm/index.rst2
-rw-r--r--Documentation/admin-guide/mm/ksm.rst2
-rw-r--r--Documentation/admin-guide/mm/memory-hotplug.rst9
-rw-r--r--Documentation/admin-guide/mm/numa_memory_policy.rst9
-rw-r--r--Documentation/admin-guide/mm/pagemap.rst25
-rw-r--r--Documentation/admin-guide/mm/transhuge.rst283
-rw-r--r--Documentation/admin-guide/mm/zswap.rst33
-rw-r--r--Documentation/admin-guide/nvme-multipath.rst72
-rw-r--r--Documentation/admin-guide/perf/arm-ni.rst17
-rw-r--r--Documentation/admin-guide/perf/dwc_pcie_pmu.rst22
-rw-r--r--Documentation/admin-guide/perf/hisi-pcie-pmu.rst36
-rw-r--r--Documentation/admin-guide/perf/hisi-pmu.rst6
-rw-r--r--Documentation/admin-guide/perf/hns3-pmu.rst8
-rw-r--r--Documentation/admin-guide/perf/index.rst5
-rw-r--r--Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst80
-rw-r--r--Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst37
-rw-r--r--Documentation/admin-guide/perf/mrvl-pem-pmu.rst56
-rw-r--r--Documentation/admin-guide/perf/nvidia-pmu.rst52
-rw-r--r--Documentation/admin-guide/perf/qcom_l2_pmu.rst2
-rw-r--r--Documentation/admin-guide/perf/qcom_l3_pmu.rst2
-rw-r--r--Documentation/admin-guide/perf/starfive_starlink_pmu.rst46
-rw-r--r--Documentation/admin-guide/perf/thunderx2-pmu.rst2
-rw-r--r--Documentation/admin-guide/perf/xgene-pmu.rst2
-rw-r--r--Documentation/admin-guide/pm/amd-pstate.rst88
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst24
-rw-r--r--Documentation/admin-guide/pm/cpuidle.rst72
-rw-r--r--Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst59
-rw-r--r--Documentation/admin-guide/pmf.rst24
-rw-r--r--Documentation/admin-guide/quickly-build-trimmed-linux.rst2
-rw-r--r--Documentation/admin-guide/ramoops.rst13
-rw-r--r--Documentation/admin-guide/reporting-regressions.rst12
-rw-r--r--Documentation/admin-guide/sysctl/fs.rst15
-rw-r--r--Documentation/admin-guide/sysctl/kernel.rst72
-rw-r--r--Documentation/admin-guide/sysctl/net.rst6
-rw-r--r--Documentation/admin-guide/sysctl/vm.rst54
-rw-r--r--Documentation/admin-guide/sysrq.rst29
-rw-r--r--Documentation/admin-guide/tainted-kernels.rst6
-rw-r--r--Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst2222
-rw-r--r--Documentation/admin-guide/workload-tracing.rst2
109 files changed, 8860 insertions, 2087 deletions
diff --git a/Documentation/admin-guide/LSM/apparmor.rst b/Documentation/admin-guide/LSM/apparmor.rst
index 6cf81bbd7ce8..47939ee89d74 100644
--- a/Documentation/admin-guide/LSM/apparmor.rst
+++ b/Documentation/admin-guide/LSM/apparmor.rst
@@ -18,8 +18,11 @@ set ``CONFIG_SECURITY_APPARMOR=y``
If AppArmor should be selected as the default security module then set::
- CONFIG_DEFAULT_SECURITY="apparmor"
- CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
+ CONFIG_DEFAULT_SECURITY_APPARMOR=y
+
+The CONFIG_LSM parameter manages the order and selection of LSMs.
+Specify apparmor as the first "major" module (e.g. AppArmor, SELinux, Smack)
+in the list.
Build the kernel
diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst
index a6ba95fbaa9f..ce63be6d64ad 100644
--- a/Documentation/admin-guide/LSM/index.rst
+++ b/Documentation/admin-guide/LSM/index.rst
@@ -47,3 +47,4 @@ subdirectories.
tomoyo
Yama
SafeSetID
+ ipe
diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst
new file mode 100644
index 000000000000..f93a467db628
--- /dev/null
+++ b/Documentation/admin-guide/LSM/ipe.rst
@@ -0,0 +1,793 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Integrity Policy Enforcement (IPE)
+==================================
+
+.. NOTE::
+
+ This is the documentation for admins, system builders, or individuals
+ attempting to use IPE. If you're looking for more developer-focused
+ documentation about IPE please see :doc:`the design docs </security/ipe>`.
+
+Overview
+--------
+
+Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a
+complementary approach to access control. Unlike traditional access control
+mechanisms that rely on labels and paths for decision-making, IPE focuses
+on the immutable security properties inherent to system components. These
+properties are fundamental attributes or features of a system component
+that cannot be altered, ensuring a consistent and reliable basis for
+security decisions.
+
+To elaborate, in the context of IPE, system components primarily refer to
+files or the devices these files reside on. However, this is just a
+starting point. The concept of system components is flexible and can be
+extended to include new elements as the system evolves. The immutable
+properties include the origin of a file, which remains constant and
+unchangeable over time. For example, IPE policies can be crafted to trust
+files originating from the initramfs. Since initramfs is typically verified
+by the bootloader, its files are deemed trustworthy; "file is from
+initramfs" becomes an immutable property under IPE's consideration.
+
+The immutable property concept extends to the security features enabled on
+a file's origin, such as dm-verity or fs-verity, which provide a layer of
+integrity and trust. For example, IPE allows the definition of policies
+that trust files from a dm-verity protected device. dm-verity ensures the
+integrity of an entire device by providing a verifiable and immutable state
+of its contents. Similarly, fs-verity offers filesystem-level integrity
+checks, allowing IPE to enforce policies that trust files protected by
+fs-verity. These two features cannot be turned off once established, so
+they are considered immutable properties. These examples demonstrate how
+IPE leverages immutable properties, such as a file's origin and its
+integrity protection mechanisms, to make access control decisions.
+
+For the IPE policy, specifically, it grants the ability to enforce
+stringent access controls by assessing security properties against
+reference values defined within the policy. This assessment can be based on
+the existence of a security property (e.g., verifying if a file originates
+from initramfs) or evaluating the internal state of an immutable security
+property. The latter includes checking the roothash of a dm-verity
+protected device, determining whether dm-verity possesses a valid
+signature, assessing the digest of a fs-verity protected file, or
+determining whether fs-verity possesses a valid built-in signature. This
+nuanced approach to policy enforcement enables a highly secure and
+customizable system defense mechanism, tailored to specific security
+requirements and trust models.
+
+To enable IPE, ensure that ``CONFIG_SECURITY_IPE`` (under
+:menuselection:`Security -> Integrity Policy Enforcement (IPE)`) config
+option is enabled.
+
+Use Cases
+---------
+
+IPE works best in fixed-function devices: devices in which their purpose
+is clearly defined and not supposed to be changed (e.g. network firewall
+device in a data center, an IoT device, etcetera), where all software and
+configuration is built and provisioned by the system owner.
+
+IPE is a long-way off for use in general-purpose computing: the Linux
+community as a whole tends to follow a decentralized trust model (known as
+the web of trust), which IPE has no support for it yet. Instead, IPE
+supports PKI (public key infrastructure), which generally designates a
+set of trusted entities that provide a measure of absolute trust.
+
+Additionally, while most packages are signed today, the files inside
+the packages (for instance, the executables), tend to be unsigned. This
+makes it difficult to utilize IPE in systems where a package manager is
+expected to be functional, without major changes to the package manager
+and ecosystem behind it.
+
+The digest_cache LSM [#digest_cache_lsm]_ is a system that when combined with IPE,
+could be used to enable and support general-purpose computing use cases.
+
+Known Limitations
+-----------------
+
+IPE cannot verify the integrity of anonymous executable memory, such as
+the trampolines created by gcc closures and libffi (<3.4.2), or JIT'd code.
+Unfortunately, as this is dynamically generated code, there is no way
+for IPE to ensure the integrity of this code to form a trust basis.
+
+IPE cannot verify the integrity of programs written in interpreted
+languages when these scripts are invoked by passing these program files
+to the interpreter. This is because the way interpreters execute these
+files; the scripts themselves are not evaluated as executable code
+through one of IPE's hooks, but they are merely text files that are read
+(as opposed to compiled executables) [#interpreters]_.
+
+Threat Model
+------------
+
+IPE specifically targets the risk of tampering with user-space executable
+code after the kernel has initially booted, including the kernel modules
+loaded from userspace via ``modprobe`` or ``insmod``.
+
+To illustrate, consider a scenario where an untrusted binary, possibly
+malicious, is downloaded along with all necessary dependencies, including a
+loader and libc. The primary function of IPE in this context is to prevent
+the execution of such binaries and their dependencies.
+
+IPE achieves this by verifying the integrity and authenticity of all
+executable code before allowing them to run. It conducts a thorough
+check to ensure that the code's integrity is intact and that they match an
+authorized reference value (digest, signature, etc) as per the defined
+policy. If a binary does not pass this verification process, either
+because its integrity has been compromised or it does not meet the
+authorization criteria, IPE will deny its execution. Additionally, IPE
+generates audit logs which may be utilized to detect and analyze failures
+resulting from policy violation.
+
+Tampering threat scenarios include modification or replacement of
+executable code by a range of actors including:
+
+- Actors with physical access to the hardware
+- Actors with local network access to the system
+- Actors with access to the deployment system
+- Compromised internal systems under external control
+- Malicious end users of the system
+- Compromised end users of the system
+- Remote (external) compromise of the system
+
+IPE does not mitigate threats arising from malicious but authorized
+developers (with access to a signing certificate), or compromised
+developer tools used by them (i.e. return-oriented programming attacks).
+Additionally, IPE draws hard security boundary between userspace and
+kernelspace. As a result, kernel-level exploits are considered outside
+the scope of IPE and mitigation is left to other mechanisms.
+
+Policy
+------
+
+IPE policy is a plain-text [#devdoc]_ policy composed of multiple statements
+over several lines. There is one required line, at the top of the
+policy, indicating the policy name, and the policy version, for
+instance::
+
+ policy_name=Ex_Policy policy_version=0.0.0
+
+The policy name is a unique key identifying this policy in a human
+readable name. This is used to create nodes under securityfs as well as
+uniquely identify policies to deploy new policies vs update existing
+policies.
+
+The policy version indicates the current version of the policy (NOT the
+policy syntax version). This is used to prevent rollback of policy to
+potentially insecure previous versions of the policy.
+
+The next portion of IPE policy are rules. Rules are formed by key=value
+pairs, known as properties. IPE rules require two properties: ``action``,
+which determines what IPE does when it encounters a match against the
+rule, and ``op``, which determines when the rule should be evaluated.
+The ordering is significant, a rule must start with ``op``, and end with
+``action``. Thus, a minimal rule is::
+
+ op=EXECUTE action=ALLOW
+
+This example will allow any execution. Additional properties are used to
+assess immutable security properties about the files being evaluated.
+These properties are intended to be descriptions of systems within the
+kernel that can provide a measure of integrity verification, such that IPE
+can determine the trust of the resource based on the value of the property.
+
+Rules are evaluated top-to-bottom. As a result, any revocation rules,
+or denies should be placed early in the file to ensure that these rules
+are evaluated before a rule with ``action=ALLOW``.
+
+IPE policy supports comments. The character '#' will function as a
+comment, ignoring all characters to the right of '#' until the newline.
+
+The default behavior of IPE evaluations can also be expressed in policy,
+through the ``DEFAULT`` statement. This can be done at a global level,
+or a per-operation level::
+
+ # Global
+ DEFAULT action=ALLOW
+
+ # Operation Specific
+ DEFAULT op=EXECUTE action=ALLOW
+
+A default must be set for all known operations in IPE. If you want to
+preserve older policies being compatible with newer kernels that can introduce
+new operations, set a global default of ``ALLOW``, then override the
+defaults on a per-operation basis (as above).
+
+With configurable policy-based LSMs, there's several issues with
+enforcing the configurable policies at startup, around reading and
+parsing the policy:
+
+1. The kernel *should* not read files from userspace, so directly reading
+ the policy file is prohibited.
+2. The kernel command line has a character limit, and one kernel module
+ should not reserve the entire character limit for its own
+ configuration.
+3. There are various boot loaders in the kernel ecosystem, so handing
+ off a memory block would be costly to maintain.
+
+As a result, IPE has addressed this problem through a concept of a "boot
+policy". A boot policy is a minimal policy which is compiled into the
+kernel. This policy is intended to get the system to a state where
+userspace is set up and ready to receive commands, at which point a more
+complex policy can be deployed via securityfs. The boot policy can be
+specified via ``SECURITY_IPE_BOOT_POLICY`` config option, which accepts
+a path to a plain-text version of the IPE policy to apply. This policy
+will be compiled into the kernel. If not specified, IPE will be disabled
+until a policy is deployed and activated through securityfs.
+
+Deploying Policies
+~~~~~~~~~~~~~~~~~~
+
+Policies can be deployed from userspace through securityfs. These policies
+are signed through the PKCS#7 message format to enforce some level of
+authorization of the policies (prohibiting an attacker from gaining
+unconstrained root, and deploying an "allow all" policy). These
+policies must be signed by a certificate that chains to the
+``SYSTEM_TRUSTED_KEYRING``, or to the secondary and/or platform keyrings if
+``CONFIG_IPE_POLICY_SIG_SECONDARY_KEYRING`` and/or
+``CONFIG_IPE_POLICY_SIG_PLATFORM_KEYRING`` are enabled, respectively.
+With openssl, the policy can be signed by::
+
+ openssl smime -sign \
+ -in "$MY_POLICY" \
+ -signer "$MY_CERTIFICATE" \
+ -inkey "$MY_PRIVATE_KEY" \
+ -noattr \
+ -nodetach \
+ -nosmimecap \
+ -outform der \
+ -out "$MY_POLICY.p7b"
+
+Deploying the policies is done through securityfs, through the
+``new_policy`` node. To deploy a policy, simply cat the file into the
+securityfs node::
+
+ cat "$MY_POLICY.p7b" > /sys/kernel/security/ipe/new_policy
+
+Upon success, this will create one subdirectory under
+``/sys/kernel/security/ipe/policies/``. The subdirectory will be the
+``policy_name`` field of the policy deployed, so for the example above,
+the directory will be ``/sys/kernel/security/ipe/policies/Ex_Policy``.
+Within this directory, there will be seven files: ``pkcs7``, ``policy``,
+``name``, ``version``, ``active``, ``update``, and ``delete``.
+
+The ``pkcs7`` file is read-only. Reading it returns the raw PKCS#7 data
+that was provided to the kernel, representing the policy. If the policy being
+read is the boot policy, this will return ``ENOENT``, as it is not signed.
+
+The ``policy`` file is read only. Reading it returns the PKCS#7 inner
+content of the policy, which will be the plain text policy.
+
+The ``active`` file is used to set a policy as the currently active policy.
+This file is rw, and accepts a value of ``"1"`` to set the policy as active.
+Since only a single policy can be active at one time, all other policies
+will be marked inactive. The policy being marked active must have a policy
+version greater or equal to the currently-running version.
+
+The ``update`` file is used to update a policy that is already present
+in the kernel. This file is write-only and accepts a PKCS#7 signed
+policy. Two checks will always be performed on this policy: First, the
+``policy_names`` must match with the updated version and the existing
+version. Second the updated policy must have a policy version greater than
+the currently-running version. This is to prevent rollback attacks.
+
+The ``delete`` file is used to remove a policy that is no longer needed.
+This file is write-only and accepts a value of ``1`` to delete the policy.
+On deletion, the securityfs node representing the policy will be removed.
+However, delete the current active policy is not allowed and will return
+an operation not permitted error.
+
+Similarly, writing to both ``update`` and ``new_policy`` could result in
+bad message(policy syntax error) or file exists error. The latter error happens
+when trying to deploy a policy with a ``policy_name`` while the kernel already
+has a deployed policy with the same ``policy_name``.
+
+Deploying a policy will *not* cause IPE to start enforcing the policy. IPE will
+only enforce the policy marked active. Note that only one policy can be active
+at a time.
+
+Once deployment is successful, the policy can be activated, by writing file
+``/sys/kernel/security/ipe/policies/$policy_name/active``.
+For example, the ``Ex_Policy`` can be activated by::
+
+ echo 1 > "/sys/kernel/security/ipe/policies/Ex_Policy/active"
+
+From above point on, ``Ex_Policy`` is now the enforced policy on the
+system.
+
+IPE also provides a way to delete policies. This can be done via the
+``delete`` securityfs node,
+``/sys/kernel/security/ipe/policies/$policy_name/delete``.
+Writing ``1`` to that file deletes the policy::
+
+ echo 1 > "/sys/kernel/security/ipe/policies/$policy_name/delete"
+
+There is only one requirement to delete a policy: the policy being deleted
+must be inactive.
+
+.. NOTE::
+
+ If a traditional MAC system is enabled (SELinux, apparmor, smack), all
+ writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``.
+
+Modes
+~~~~~
+
+IPE supports two modes of operation: permissive (similar to SELinux's
+permissive mode) and enforced. In permissive mode, all events are
+checked and policy violations are logged, but the policy is not really
+enforced. This allows users to test policies before enforcing them.
+
+The default mode is enforce, and can be changed via the kernel command
+line parameter ``ipe.enforce=(0|1)``, or the securityfs node
+``/sys/kernel/security/ipe/enforce``.
+
+.. NOTE::
+
+ If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera),
+ all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``.
+
+Audit Events
+~~~~~~~~~~~~
+
+1420 AUDIT_IPE_ACCESS
+^^^^^^^^^^^^^^^^^^^^^
+Event Examples::
+
+ type=1420 audit(1653364370.067:61): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2241 comm="ld-linux.so" path="/deny/lib/libc.so.6" dev="sda2" ino=14549020 rule="DEFAULT action=DENY"
+ type=1300 audit(1653364370.067:61): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=7f1105a28000 a1=195000 a2=5 a3=812 items=0 ppid=2219 pid=2241 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="ld-linux.so" exe="/tmp/ipe-test/lib/ld-linux.so" subj=unconfined key=(null)
+ type=1327 audit(1653364370.067:61): 707974686F6E3300746573742F6D61696E2E7079002D6E00
+
+ type=1420 audit(1653364735.161:64): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2472 comm="mmap_test" path=? dev=? ino=? rule="DEFAULT action=DENY"
+ type=1300 audit(1653364735.161:64): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=0 a1=1000 a2=4 a3=21 items=0 ppid=2219 pid=2472 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="mmap_test" exe="/root/overlake_test/upstream_test/vol_fsverity/bin/mmap_test" subj=unconfined key=(null)
+ type=1327 audit(1653364735.161:64): 707974686F6E3300746573742F6D61696E2E7079002D6E00
+
+This event indicates that IPE made an access control decision; the IPE
+specific record (1420) is always emitted in conjunction with a
+``AUDITSYSCALL`` record.
+
+Determining whether IPE is in permissive or enforced mode can be derived
+from ``success`` property and exit code of the ``AUDITSYSCALL`` record.
+
+
+Field descriptions:
+
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| Field | Value Type | Optional? | Description of Value |
++===========+============+===========+=================================================================================+
+| ipe_op | string | No | The IPE operation name associated with the log |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| ipe_hook | string | No | The name of the LSM hook that triggered the IPE event |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| enforcing | integer | No | The current IPE enforcing state 1 is in enforcing mode, 0 is in permissive mode |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| pid | integer | No | The pid of the process that triggered the IPE event. |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| comm | string | No | The command line program name of the process that triggered the IPE event |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| path | string | Yes | The absolute path to the evaluated file |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| ino | integer | Yes | The inode number of the evaluated file |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| dev | string | Yes | The device name of the evaluated file, e.g. vda |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+| rule | string | No | The matched policy rule |
++-----------+------------+-----------+---------------------------------------------------------------------------------+
+
+1421 AUDIT_IPE_CONFIG_CHANGE
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Event Example::
+
+ type=1421 audit(1653425583.136:54): old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1
+ type=1300 audit(1653425583.136:54): SYSCALL arch=c000003e syscall=1 success=yes exit=2 a0=3 a1=5596fcae1fb0 a2=2 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null)
+ type=1327 audit(1653425583.136:54): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2
+
+This event indicates that IPE switched the active poliy from one to another
+along with the version and the hash digest of the two policies.
+Note IPE can only have one policy active at a time, all access decision
+evaluation is based on the current active policy.
+The normal procedure to deploy a new policy is loading the policy to deploy
+into the kernel first, then switch the active policy to it.
+
+This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall.
+
+Field descriptions:
+
++------------------------+------------+-----------+---------------------------------------------------+
+| Field | Value Type | Optional? | Description of Value |
++========================+============+===========+===================================================+
+| old_active_pol_name | string | Yes | The name of previous active policy |
++------------------------+------------+-----------+---------------------------------------------------+
+| old_active_pol_version | string | Yes | The version of previous active policy |
++------------------------+------------+-----------+---------------------------------------------------+
+| old_policy_digest | string | Yes | The hash of previous active policy |
++------------------------+------------+-----------+---------------------------------------------------+
+| new_active_pol_name | string | No | The name of current active policy |
++------------------------+------------+-----------+---------------------------------------------------+
+| new_active_pol_version | string | No | The version of current active policy |
++------------------------+------------+-----------+---------------------------------------------------+
+| new_policy_digest | string | No | The hash of current active policy |
++------------------------+------------+-----------+---------------------------------------------------+
+| auid | integer | No | The login user ID |
++------------------------+------------+-----------+---------------------------------------------------+
+| ses | integer | No | The login session ID |
++------------------------+------------+-----------+---------------------------------------------------+
+| lsm | string | No | The lsm name associated with the event |
++------------------------+------------+-----------+---------------------------------------------------+
+| res | integer | No | The result of the audited operation(success/fail) |
++------------------------+------------+-----------+---------------------------------------------------+
+
+1422 AUDIT_IPE_POLICY_LOAD
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Event Example::
+
+ type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1
+ type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null)
+ type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E
+
+This record indicates a new policy has been loaded into the kernel with the policy name, policy version and policy hash.
+
+This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall.
+
+Field descriptions:
+
++----------------+------------+-----------+---------------------------------------------------+
+| Field | Value Type | Optional? | Description of Value |
++================+============+===========+===================================================+
+| policy_name | string | No | The policy_name |
++----------------+------------+-----------+---------------------------------------------------+
+| policy_version | string | No | The policy_version |
++----------------+------------+-----------+---------------------------------------------------+
+| policy_digest | string | No | The policy hash |
++----------------+------------+-----------+---------------------------------------------------+
+| auid | integer | No | The login user ID |
++----------------+------------+-----------+---------------------------------------------------+
+| ses | integer | No | The login session ID |
++----------------+------------+-----------+---------------------------------------------------+
+| lsm | string | No | The lsm name associated with the event |
++----------------+------------+-----------+---------------------------------------------------+
+| res | integer | No | The result of the audited operation(success/fail) |
++----------------+------------+-----------+---------------------------------------------------+
+
+
+1404 AUDIT_MAC_STATUS
+^^^^^^^^^^^^^^^^^^^^^
+
+Event Examples::
+
+ type=1404 audit(1653425689.008:55): enforcing=0 old_enforcing=1 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1
+ type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=)
+ type=1327 audit(1653425689.008:55): proctitle="-bash"
+
+ type=1404 audit(1653425689.008:55): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1
+ type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=)
+ type=1327 audit(1653425689.008:55): proctitle="-bash"
+
+This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall.
+
+Field descriptions:
+
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| Field | Value Type | Optional? | Description of Value |
++===============+============+===========+=================================================================================================+
+| enforcing | integer | No | The enforcing state IPE is being switched to, 1 is in enforcing mode, 0 is in permissive mode |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| old_enforcing | integer | No | The enforcing state IPE is being switched from, 1 is in enforcing mode, 0 is in permissive mode |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| auid | integer | No | The login user ID |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| ses | integer | No | The login session ID |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| enabled | integer | No | The new TTY audit enabled setting |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| old-enabled | integer | No | The old TTY audit enabled setting |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| lsm | string | No | The lsm name associated with the event |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+| res | integer | No | The result of the audited operation(success/fail) |
++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+
+
+
+Success Auditing
+^^^^^^^^^^^^^^^^
+
+IPE supports success auditing. When enabled, all events that pass IPE
+policy and are not blocked will emit an audit event. This is disabled by
+default, and can be enabled via the kernel command line
+``ipe.success_audit=(0|1)`` or
+``/sys/kernel/security/ipe/success_audit`` securityfs file.
+
+This is *very* noisy, as IPE will check every userspace binary on the
+system, but is useful for debugging policies.
+
+.. NOTE::
+
+ If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera),
+ all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``.
+
+Properties
+----------
+
+As explained above, IPE properties are ``key=value`` pairs expressed in IPE
+policy. Two properties are built-into the policy parser: 'op' and 'action'.
+The other properties are used to restrict immutable security properties
+about the files being evaluated. Currently those properties are:
+'``boot_verified``', '``dmverity_signature``', '``dmverity_roothash``',
+'``fsverity_signature``', '``fsverity_digest``'. A description of all
+properties supported by IPE are listed below:
+
+op
+~~
+
+Indicates the operation for a rule to apply to. Must be in every rule,
+as the first token. IPE supports the following operations:
+
+ ``EXECUTE``
+
+ Pertains to any file attempting to be executed, or loaded as an
+ executable.
+
+ ``FIRMWARE``:
+
+ Pertains to firmware being loaded via the firmware_class interface.
+ This covers both the preallocated buffer and the firmware file
+ itself.
+
+ ``KMODULE``:
+
+ Pertains to loading kernel modules via ``modprobe`` or ``insmod``.
+
+ ``KEXEC_IMAGE``:
+
+ Pertains to kernel images loading via ``kexec``.
+
+ ``KEXEC_INITRAMFS``
+
+ Pertains to initrd images loading via ``kexec --initrd``.
+
+ ``POLICY``:
+
+ Controls loading policies via reading a kernel-space initiated read.
+
+ An example of such is loading IMA policies by writing the path
+ to the policy file to ``$securityfs/ima/policy``
+
+ ``X509_CERT``:
+
+ Controls loading IMA certificates through the Kconfigs,
+ ``CONFIG_IMA_X509_PATH`` and ``CONFIG_EVM_X509_PATH``.
+
+action
+~~~~~~
+
+ Determines what IPE should do when a rule matches. Must be in every
+ rule, as the final clause. Can be one of:
+
+ ``ALLOW``:
+
+ If the rule matches, explicitly allow access to the resource to proceed
+ without executing any more rules.
+
+ ``DENY``:
+
+ If the rule matches, explicitly prohibit access to the resource to
+ proceed without executing any more rules.
+
+boot_verified
+~~~~~~~~~~~~~
+
+ This property can be utilized for authorization of files from initramfs.
+ The format of this property is::
+
+ boot_verified=(TRUE|FALSE)
+
+
+ .. WARNING::
+
+ This property will trust files from initramfs(rootfs). It should
+ only be used during early booting stage. Before mounting the real
+ rootfs on top of the initramfs, initramfs script will recursively
+ remove all files and directories on the initramfs. This is typically
+ implemented by using switch_root(8) [#switch_root]_. Therefore the
+ initramfs will be empty and not accessible after the real
+ rootfs takes over. It is advised to switch to a different policy
+ that doesn't rely on the property after this point.
+ This ensures that the trust policies remain relevant and effective
+ throughout the system's operation.
+
+dmverity_roothash
+~~~~~~~~~~~~~~~~~
+
+ This property can be utilized for authorization or revocation of
+ specific dm-verity volumes, identified via their root hashes. It has a
+ dependency on the DM_VERITY module. This property is controlled by
+ the ``IPE_PROP_DM_VERITY`` config option, it will be automatically
+ selected when ``SECURITY_IPE`` and ``DM_VERITY`` are all enabled.
+ The format of this property is::
+
+ dmverity_roothash=DigestName:HexadecimalString
+
+ The supported DigestNames for dmverity_roothash are [#dmveritydigests]_
+
+ + blake2b-512
+ + blake2s-256
+ + sha256
+ + sha384
+ + sha512
+ + sha3-224
+ + sha3-256
+ + sha3-384
+ + sha3-512
+ + sm3
+ + rmd160
+
+dmverity_signature
+~~~~~~~~~~~~~~~~~~
+
+ This property can be utilized for authorization of all dm-verity
+ volumes that have a signed roothash that validated by a keyring
+ specified by dm-verity's configuration, either the system trusted
+ keyring, or the secondary keyring. It depends on
+ ``DM_VERITY_VERIFY_ROOTHASH_SIG`` config option and is controlled by
+ the ``IPE_PROP_DM_VERITY_SIGNATURE`` config option, it will be automatically
+ selected when ``SECURITY_IPE``, ``DM_VERITY`` and
+ ``DM_VERITY_VERIFY_ROOTHASH_SIG`` are all enabled.
+ The format of this property is::
+
+ dmverity_signature=(TRUE|FALSE)
+
+fsverity_digest
+~~~~~~~~~~~~~~~
+
+ This property can be utilized for authorization of specific fsverity
+ enabled files, identified via their fsverity digests.
+ It depends on ``FS_VERITY`` config option and is controlled by
+ the ``IPE_PROP_FS_VERITY`` config option, it will be automatically
+ selected when ``SECURITY_IPE`` and ``FS_VERITY`` are all enabled.
+ The format of this property is::
+
+ fsverity_digest=DigestName:HexadecimalString
+
+ The supported DigestNames for fsverity_digest are [#fsveritydigest]_
+
+ + sha256
+ + sha512
+
+fsverity_signature
+~~~~~~~~~~~~~~~~~~
+
+ This property is used to authorize all fs-verity enabled files that have
+ been verified by fs-verity's built-in signature mechanism. The signature
+ verification relies on a key stored within the ".fs-verity" keyring. It
+ depends on ``FS_VERITY_BUILTIN_SIGNATURES`` config option and
+ it is controlled by the ``IPE_PROP_FS_VERITY`` config option,
+ it will be automatically selected when ``SECURITY_IPE``, ``FS_VERITY``
+ and ``FS_VERITY_BUILTIN_SIGNATURES`` are all enabled.
+ The format of this property is::
+
+ fsverity_signature=(TRUE|FALSE)
+
+Policy Examples
+---------------
+
+Allow all
+~~~~~~~~~
+
+::
+
+ policy_name=Allow_All policy_version=0.0.0
+ DEFAULT action=ALLOW
+
+Allow only initramfs
+~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ policy_name=Allow_Initramfs policy_version=0.0.0
+ DEFAULT action=DENY
+
+ op=EXECUTE boot_verified=TRUE action=ALLOW
+
+Allow any signed and validated dm-verity volume and the initramfs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ policy_name=Allow_Signed_DMV_And_Initramfs policy_version=0.0.0
+ DEFAULT action=DENY
+
+ op=EXECUTE boot_verified=TRUE action=ALLOW
+ op=EXECUTE dmverity_signature=TRUE action=ALLOW
+
+Prohibit execution from a specific dm-verity volume
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ policy_name=Deny_DMV_By_Roothash policy_version=0.0.0
+ DEFAULT action=DENY
+
+ op=EXECUTE dmverity_roothash=sha256:cd2c5bae7c6c579edaae4353049d58eb5f2e8be0244bf05345bc8e5ed257baff action=DENY
+
+ op=EXECUTE boot_verified=TRUE action=ALLOW
+ op=EXECUTE dmverity_signature=TRUE action=ALLOW
+
+Allow only a specific dm-verity volume
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ policy_name=Allow_DMV_By_Roothash policy_version=0.0.0
+ DEFAULT action=DENY
+
+ op=EXECUTE dmverity_roothash=sha256:401fcec5944823ae12f62726e8184407a5fa9599783f030dec146938 action=ALLOW
+
+Allow any fs-verity file with a valid built-in signature
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ policy_name=Allow_Signed_And_Validated_FSVerity policy_version=0.0.0
+ DEFAULT action=DENY
+
+ op=EXECUTE fsverity_signature=TRUE action=ALLOW
+
+Allow execution of a specific fs-verity file
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+::
+
+ policy_name=ALLOW_FSV_By_Digest policy_version=0.0.0
+ DEFAULT action=DENY
+
+ op=EXECUTE fsverity_digest=sha256:fd88f2b8824e197f850bf4c5109bea5cf0ee38104f710843bb72da796ba5af9e action=ALLOW
+
+Additional Information
+----------------------
+
+- `Github Repository <https://github.com/microsoft/ipe>`_
+- :doc:`Developer and design docs for IPE </security/ipe>`
+
+FAQ
+---
+
+Q:
+ What's the difference between other LSMs which provide a measure of
+ trust-based access control?
+
+A:
+
+ In general, there's two other LSMs that can provide similar functionality:
+ IMA, and Loadpin.
+
+ IMA and IPE are functionally very similar. The significant difference between
+ the two is the policy. [#devdoc]_
+
+ Loadpin and IPE differ fairly dramatically, as Loadpin only covers the IPE's
+ kernel read operations, whereas IPE is capable of controlling execution
+ on top of kernel read. The trust model is also different; Loadpin roots its
+ trust in the initial super-block, whereas trust in IPE is stemmed from kernel
+ itself (via ``SYSTEM_TRUSTED_KEYS``).
+
+-----------
+
+.. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/
+
+.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_.
+
+.. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on
+ this topic.
+
+.. [#switch_root] https://man7.org/linux/man-pages/man8/switch_root.8.html
+
+.. [#dmveritydigests] These hash algorithms are based on values accepted by
+ the Linux crypto API; IPE does not impose any
+ restrictions on the digest algorithm itself;
+ thus, this list may be out of date.
+
+.. [#fsveritydigest] These hash algorithms are based on values accepted by the
+ kernel's fsverity support; IPE does not impose any
+ restrictions on the digest algorithm itself;
+ thus, this list may be out of date.
diff --git a/Documentation/admin-guide/LSM/tomoyo.rst b/Documentation/admin-guide/LSM/tomoyo.rst
index 4bc9c2b4da6f..bdb2c2e2a1b2 100644
--- a/Documentation/admin-guide/LSM/tomoyo.rst
+++ b/Documentation/admin-guide/LSM/tomoyo.rst
@@ -9,8 +9,8 @@ TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel.
LiveCD-based tutorials are available at
-http://tomoyo.sourceforge.jp/1.8/ubuntu12.04-live.html
-http://tomoyo.sourceforge.jp/1.8/centos6-live.html
+https://tomoyo.sourceforge.net/1.8/ubuntu12.04-live.html
+https://tomoyo.sourceforge.net/1.8/centos6-live.html
Though these tutorials use non-LSM version of TOMOYO, they are useful for you
to know what TOMOYO is.
@@ -21,45 +21,32 @@ How to enable TOMOYO?
Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on
kernel's command line.
-Please see http://tomoyo.osdn.jp/2.5/ for details.
+Please see https://tomoyo.sourceforge.net/2.6/ for details.
Where is documentation?
=======================
User <-> Kernel interface documentation is available at
-https://tomoyo.osdn.jp/2.5/policy-specification/index.html .
+https://tomoyo.sourceforge.net/2.6/policy-specification/index.html .
Materials we prepared for seminars and symposiums are available at
-https://osdn.jp/projects/tomoyo/docs/?category_id=532&language_id=1 .
+https://sourceforge.net/projects/tomoyo/files/docs/ .
Below lists are chosen from three aspects.
What is TOMOYO?
TOMOYO Linux Overview
- https://osdn.jp/projects/tomoyo/docs/lca2009-takeda.pdf
+ https://sourceforge.net/projects/tomoyo/files/docs/lca2009-takeda.pdf
TOMOYO Linux: pragmatic and manageable security for Linux
- https://osdn.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf
+ https://sourceforge.net/projects/tomoyo/files/docs/freedomhectaipei-tomoyo.pdf
TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box
- https://osdn.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf
+ https://sourceforge.net/projects/tomoyo/files/docs/PacSec2007-en-no-demo.pdf
What can TOMOYO do?
Deep inside TOMOYO Linux
- https://osdn.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf
+ https://sourceforge.net/projects/tomoyo/files/docs/lca2009-kumaneko.pdf
The role of "pathname based access control" in security.
- https://osdn.jp/projects/tomoyo/docs/lfj2008-bof.pdf
+ https://sourceforge.net/projects/tomoyo/files/docs/lfj2008-bof.pdf
History of TOMOYO?
Realities of Mainlining
- https://osdn.jp/projects/tomoyo/docs/lfj2008.pdf
-
-What is future plan?
-====================
-
-We believe that inode based security and name based security are complementary
-and both should be used together. But unfortunately, so far, we cannot enable
-multiple LSM modules at the same time. We feel sorry that you have to give up
-SELinux/SMACK/AppArmor etc. when you want to use TOMOYO.
-
-We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM
-version of TOMOYO, available at http://tomoyo.osdn.jp/1.8/ .
-LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning
-to port non-LSM version's functionalities to LSM versions.
+ https://sourceforge.net/projects/tomoyo/files/docs/lfj2008.pdf
diff --git a/Documentation/admin-guide/RAS/address-translation.rst b/Documentation/admin-guide/RAS/address-translation.rst
new file mode 100644
index 000000000000..f0ca17b43cd3
--- /dev/null
+++ b/Documentation/admin-guide/RAS/address-translation.rst
@@ -0,0 +1,24 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Address translation
+===================
+
+x86 AMD
+-------
+
+Zen-based AMD systems include a Data Fabric that manages the layout of
+physical memory. Devices attached to the Fabric, like memory controllers,
+I/O, etc., may not have a complete view of the system physical memory map.
+These devices may provide a "normalized", i.e. device physical, address
+when reporting memory errors. Normalized addresses must be translated to
+a system physical address for the kernel to action on the memory.
+
+AMD Address Translation Library (CONFIG_AMD_ATL) provides translation for
+this case.
+
+Glossary of acronyms used in address translation for Zen-based systems
+
+* CCM = Cache Coherent Moderator
+* COD = Cluster-on-Die
+* COH_ST = Coherent Station
+* DF = Data Fabric
diff --git a/Documentation/admin-guide/RAS/error-decoding.rst b/Documentation/admin-guide/RAS/error-decoding.rst
new file mode 100644
index 000000000000..26a72f3fe5de
--- /dev/null
+++ b/Documentation/admin-guide/RAS/error-decoding.rst
@@ -0,0 +1,21 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Error decoding
+==============
+
+x86
+---
+
+Error decoding on AMD systems should be done using the rasdaemon tool:
+https://github.com/mchehab/rasdaemon/
+
+While the daemon is running, it would automatically log and decode
+errors. If not, one can still decode such errors by supplying the
+hardware information from the error::
+
+ $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca
+
+Also, the user can pass particular family and model to decode the error
+string::
+
+ $ rasdaemon -p --status <STATUS> --ipid <IPID> --smca --family <CPU Family> --model <CPU Model> --bank <BANK_NUM>
diff --git a/Documentation/admin-guide/RAS/index.rst b/Documentation/admin-guide/RAS/index.rst
new file mode 100644
index 000000000000..f4087040a7c0
--- /dev/null
+++ b/Documentation/admin-guide/RAS/index.rst
@@ -0,0 +1,7 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. toctree::
+ :maxdepth: 2
+
+ main
+ error-decoding
+ address-translation
diff --git a/Documentation/admin-guide/ras.rst b/Documentation/admin-guide/RAS/main.rst
index 8e03751d126d..7ac1d4ccc509 100644
--- a/Documentation/admin-guide/ras.rst
+++ b/Documentation/admin-guide/RAS/main.rst
@@ -1,8 +1,12 @@
+.. SPDX-License-Identifier: GPL-2.0
.. include:: <isonum.txt>
-============================================
-Reliability, Availability and Serviceability
-============================================
+==================================================
+Reliability, Availability and Serviceability (RAS)
+==================================================
+
+This documents different aspects of the RAS functionality present in the
+kernel.
RAS concepts
************
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst
index 9a969c0157f1..b557cf1c820d 100644
--- a/Documentation/admin-guide/README.rst
+++ b/Documentation/admin-guide/README.rst
@@ -176,7 +176,7 @@ Configuring the kernel
values without prompting.
"make defconfig" Create a ./.config file by using the default
- symbol values from either arch/$ARCH/defconfig
+ symbol values from either arch/$ARCH/configs/defconfig
or arch/$ARCH/configs/${PLATFORM}_defconfig,
depending on the architecture.
@@ -262,9 +262,11 @@ Compiling the kernel
- Make sure you have at least gcc 5.1 available.
For more information, refer to :ref:`Documentation/process/changes.rst <changes>`.
- - Do a ``make`` to create a compressed kernel image. It is also
- possible to do ``make install`` if you have lilo installed to suit the
- kernel makefiles, but you may want to check your particular lilo setup first.
+ - Do a ``make`` to create a compressed kernel image. It is also possible to do
+ ``make install`` if you have lilo installed or if your distribution has an
+ install script recognised by the kernel's installer. Most popular
+ distributions will have a recognized install script. You may want to
+ check your distribution's setup first.
To do the actual install, you have to be root, but none of the normal
build should require that. Don't take the name of root in vain.
@@ -301,32 +303,51 @@ Compiling the kernel
image (e.g. .../linux/arch/x86/boot/bzImage after compilation)
to the place where your regular bootable kernel is found.
- - Booting a kernel directly from a floppy without the assistance of a
- bootloader such as LILO, is no longer supported.
-
- If you boot Linux from the hard drive, chances are you use LILO, which
- uses the kernel image as specified in the file /etc/lilo.conf. The
- kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
- /boot/bzImage. To use the new kernel, save a copy of the old image
- and copy the new image over the old one. Then, you MUST RERUN LILO
- to update the loading map! If you don't, you won't be able to boot
- the new kernel image.
-
- Reinstalling LILO is usually a matter of running /sbin/lilo.
- You may wish to edit /etc/lilo.conf to specify an entry for your
- old kernel image (say, /vmlinux.old) in case the new one does not
- work. See the LILO docs for more information.
-
- After reinstalling LILO, you should be all set. Shutdown the system,
+ - Booting a kernel directly from a storage device without the assistance
+ of a bootloader such as LILO or GRUB, is no longer supported in BIOS
+ (non-EFI systems). On UEFI/EFI systems, however, you can use EFISTUB
+ which allows the motherboard to boot directly to the kernel.
+ On modern workstations and desktops, it's generally recommended to use a
+ bootloader as difficulties can arise with multiple kernels and secure boot.
+ For more details on EFISTUB,
+ see "Documentation/admin-guide/efi-stub.rst".
+
+ - It's important to note that as of 2016 LILO (LInux LOader) is no longer in
+ active development, though as it was extremely popular, it often comes up
+ in documentation. Popular alternatives include GRUB2, rEFInd, Syslinux,
+ systemd-boot, or EFISTUB. For various reasons, it's not recommended to use
+ software that's no longer in active development.
+
+ - Chances are your distribution includes an install script and running
+ ``make install`` will be all that's needed. Should that not be the case
+ you'll have to identify your bootloader and reference its documentation or
+ configure your EFI.
+
+Legacy LILO Instructions
+------------------------
+
+
+ - If you use LILO the kernel images are specified in the file /etc/lilo.conf.
+ The kernel image file is usually /vmlinuz, /boot/vmlinuz, /bzImage or
+ /boot/bzImage. To use the new kernel, save a copy of the old image and copy
+ the new image over the old one. Then, you MUST RERUN LILO to update the
+ loading map! If you don't, you won't be able to boot the new kernel image.
+
+ - Reinstalling LILO is usually a matter of running /sbin/lilo. You may wish
+ to edit /etc/lilo.conf to specify an entry for your old kernel image
+ (say, /vmlinux.old) in case the new one does not work. See the LILO docs
+ for more information.
+
+ - After reinstalling LILO, you should be all set. Shutdown the system,
reboot, and enjoy!
- If you ever need to change the default root device, video mode,
- etc. in the kernel image, use your bootloader's boot options
- where appropriate. No need to recompile the kernel to change
- these parameters.
+ - If you ever need to change the default root device, video mode, etc. in the
+ kernel image, use your bootloader's boot options where appropriate. No need
+ to recompile the kernel to change these parameters.
- Reboot with the new kernel and enjoy.
+
If something goes wrong
-----------------------
@@ -335,5 +356,5 @@ instructions at 'Documentation/admin-guide/reporting-issues.rst'.
Hints on understanding kernel bug reports are in
'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel
-with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and
-'Documentation/dev-tools/kgdb.rst'.
+with gdb is in 'Documentation/process/debugging/gdb-kernel-debugging.rst' and
+'Documentation/process/debugging/kgdb.rst'.
diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst
index ee2b0030d416..1576fb93f06c 100644
--- a/Documentation/admin-guide/blockdev/zram.rst
+++ b/Documentation/admin-guide/blockdev/zram.rst
@@ -47,6 +47,8 @@ The list of possible return codes:
-ENOMEM zram was not able to allocate enough memory to fulfil your
needs.
-EINVAL invalid input has been provided.
+-EAGAIN re-try operation later (e.g. when attempting to run recompress
+ and writeback simultaneously).
======== =============================================================
If you use 'echo', the returned value is set by the 'echo' utility,
@@ -102,17 +104,41 @@ Examples::
#select lzo compression algorithm
echo lzo > /sys/block/zram0/comp_algorithm
-For the time being, the `comp_algorithm` content does not necessarily
-show every compression algorithm supported by the kernel. We keep this
-list primarily to simplify device configuration and one can configure
-a new device with a compression algorithm that is not listed in
-`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API
-and, if some of the algorithms were built as modules, it's impossible
-to list all of them using, for instance, /proc/crypto or any other
-method. This, however, has an advantage of permitting the usage of
-custom crypto compression modules (implementing S/W or H/W compression).
-
-4) Set Disksize
+For the time being, the `comp_algorithm` content shows only compression
+algorithms that are supported by zram.
+
+4) Set compression algorithm parameters: Optional
+=================================================
+
+Compression algorithms may support specific parameters which can be
+tweaked for particular dataset. ZRAM has an `algorithm_params` device
+attribute which provides a per-algorithm params configuration.
+
+For example, several compression algorithms support `level` parameter.
+In addition, certain compression algorithms support pre-trained dictionaries,
+which significantly change algorithms' characteristics. In order to configure
+compression algorithm to use external pre-trained dictionary, pass full
+path to the `dict` along with other parameters::
+
+ #pass path to pre-trained zstd dictionary
+ echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params
+
+ #same, but using algorithm priority
+ echo "priority=1 dict=/etc/dictionary" > \
+ /sys/block/zram0/algorithm_params
+
+ #pass path to pre-trained zstd dictionary and compression level
+ echo "algo=zstd level=8 dict=/etc/dictionary" > \
+ /sys/block/zram0/algorithm_params
+
+Parameters are algorithm specific: not all algorithms support pre-trained
+dictionaries, not all algorithms support `level`. Furthermore, for certain
+algorithms `level` controls the compression level (the higher the value the
+better the compression ratio, it even can take negatives values for some
+algorithms), for other algorithms `level` is acceleration level (the higher
+the value the lower the compression ratio).
+
+5) Set Disksize
===============
Set disk size by writing the value to sysfs node 'disksize'.
@@ -132,7 +158,7 @@ There is little point creating a zram of greater than twice the size of memory
since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the
size of the disk when not in use so a huge zram is wasteful.
-5) Set memory limit: Optional
+6) Set memory limit: Optional
=============================
Set memory limit by writing the value to sysfs node 'mem_limit'.
@@ -151,7 +177,7 @@ Examples::
# To disable memory limit
echo 0 > /sys/block/zram0/mem_limit
-6) Activate
+7) Activate
===========
::
@@ -162,7 +188,7 @@ Examples::
mkfs.ext4 /dev/zram1
mount /dev/zram1 /tmp
-7) Add/remove zram devices
+8) Add/remove zram devices
==========================
zram provides a control interface, which enables dynamic (on-demand) device
@@ -182,7 +208,7 @@ execute::
echo X > /sys/class/zram-control/hot_remove
-8) Stats
+9) Stats
========
Per-device statistics are exported as various nodes under /sys/block/zram<id>/
@@ -205,6 +231,7 @@ writeback_limit_enable RW show and set writeback_limit feature
max_comp_streams RW the number of possible concurrent compress
operations
comp_algorithm RW show and change the compression algorithm
+algorithm_params WO setup compression algorithm parameters
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
@@ -283,15 +310,15 @@ a single line of text and contains the following stats separated by whitespace:
Unit: 4K bytes
============== =============================================================
-9) Deactivate
-=============
+10) Deactivate
+==============
::
swapoff /dev/zram0
umount /dev/zram1
-10) Reset
+11) Reset
=========
Write any positive value to 'reset' sysfs node::
@@ -466,6 +493,11 @@ of equal or greater size:::
#recompress idle pages larger than 2000 bytes
echo "type=idle threshold=2000" > /sys/block/zramX/recompress
+It is also possible to limit the number of pages zram re-compression will
+attempt to recompress:::
+
+ echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress
+
Recompression of idle pages requires memory tracking.
During re-compression for every page, that matches re-compression criteria,
@@ -482,11 +514,14 @@ registered compression algorithms, increases our chances of finding the
algorithm that successfully compresses a particular page. Sometimes, however,
it is convenient (and sometimes even necessary) to limit recompression to
only one particular algorithm so that it will not try any other algorithms.
-This can be achieved by providing a algo=NAME parameter:::
+This can be achieved by providing a `algo` or `priority` parameter:::
#use zstd algorithm only (if registered)
echo "type=huge algo=zstd" > /sys/block/zramX/recompress
+ #use zstd algorithm only (if zstd was registered under priority 1)
+ echo "type=huge priority=1" > /sys/block/zramX/recompress
+
memory tracking
===============
diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst
index 18e79337dcfd..153472e93cae 100644
--- a/Documentation/admin-guide/braille-console.rst
+++ b/Documentation/admin-guide/braille-console.rst
@@ -21,8 +21,8 @@ override the baud rate to 115200, etc.
By default, the braille device will just show the last kernel message (console
mode). To review previous messages, press the Insert key to switch to the VT
review mode. In review mode, the arrow keys permit to browse in the VT content,
-:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and
-the :kbd:`HOME` key goes back
+`PAGE-UP`/`PAGE-DOWN` keys go at the top/bottom of the screen, and
+the `HOME` key goes back
to the cursor, hence providing very basic screen reviewing facility.
Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel
diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst
index 325c5d0ed34a..f4f867cabb17 100644
--- a/Documentation/admin-guide/bug-bisect.rst
+++ b/Documentation/admin-guide/bug-bisect.rst
@@ -1,76 +1,165 @@
-Bisecting a bug
-+++++++++++++++
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
-Last updated: 28 October 2016
+======================
+Bisecting a regression
+======================
-Introduction
-============
+This document describes how to use a ``git bisect`` to find the source code
+change that broke something -- for example when some functionality stopped
+working after upgrading from Linux 6.0 to 6.1.
-Always try the latest kernel from kernel.org and build from source. If you are
-not confident in doing that please report the bug to your distribution vendor
-instead of to a kernel developer.
+The text focuses on the gist of the process. If you are new to bisecting the
+kernel, better follow Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
+instead: it depicts everything from start to finish while covering multiple
+aspects even kernel developers occasionally forget. This includes detecting
+situations early where a bisection would be a waste of time, as nobody would
+care about the result -- for example, because the problem happens after the
+kernel marked itself as 'tainted', occurs in an abandoned version, was already
+fixed, or is caused by a .config change you or your Linux distributor performed.
-Finding bugs is not always easy. Have a go though. If you can't find it don't
-give up. Report as much as you have found to the relevant maintainer. See
-MAINTAINERS for who that is for the subsystem you have worked on.
+Finding the change causing a kernel issue using a bisection
+===========================================================
-Before you submit a bug report read
-'Documentation/admin-guide/reporting-issues.rst'.
+*Note: the following process assumes you prepared everything for a bisection.
+This includes having a Git clone with the appropriate sources, installing the
+software required to build and install kernels, as well as a .config file stored
+in a safe place (the following example assumes '~/prepared_kernel_.config') to
+use as pristine base at each bisection step; ideally, you have also worked out
+a fully reliable and straight-forward way to reproduce the regression, too.*
-Devices not appearing
-=====================
+* Preparation: start the bisection and tell Git about the points in the history
+ you consider to be working and broken, which Git calls 'good' and 'bad'::
-Often this is caused by udev/systemd. Check that first before blaming it
-on the kernel.
+ git bisect start
+ git bisect good v6.0
+ git bisect bad v6.1
-Finding patch that caused a bug
-===============================
+ Instead of Git tags like 'v6.0' and 'v6.1' you can specify commit-ids, too.
-Using the provided tools with ``git`` makes finding bugs easy provided the bug
-is reproducible.
+1. Copy your prepared .config into the build directory and adjust it to the
+ needs of the codebase Git checked out for testing::
-Steps to do it:
+ cp ~/prepared_kernel_.config .config
+ make olddefconfig
-- build the Kernel from its git source
-- start bisect with [#f1]_::
-
- $ git bisect start
-
-- mark the broken changeset with::
-
- $ git bisect bad [commit]
-
-- mark a changeset where the code is known to work with::
-
- $ git bisect good [commit]
-
-- rebuild the Kernel and test
-- interact with git bisect by using either::
-
- $ git bisect good
-
- or::
-
- $ git bisect bad
-
- depending if the bug happened on the changeset you're testing
-- After some interactions, git bisect will give you the changeset that
- likely caused the bug.
-
-- For example, if you know that the current version is bad, and version
- 4.8 is good, you could do::
-
- $ git bisect start
- $ git bisect bad # Current version is bad
- $ git bisect good v4.8
-
-
-.. [#f1] You can, optionally, provide both good and bad arguments at git
- start with ``git bisect start [BAD] [GOOD]``
-
-For further references, please read:
-
-- The man page for ``git-bisect``
-- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_
-- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_
-- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_
+2. Now build, install, and boot a kernel. This might fail for unrelated reasons,
+ for example, when a compile error happens at the current stage of the
+ bisection a later change resolves. In such cases run ``git bisect skip`` and
+ go back to step 1.
+
+3. Check if the functionality that regressed works in the kernel you just built.
+
+ If it works, execute::
+
+ git bisect good
+
+ If it is broken, run::
+
+ git bisect bad
+
+ Note, getting this wrong just once will send the rest of the bisection
+ totally off course. To prevent having to start anew later you thus want to
+ ensure what you tell Git is correct; it is thus often wise to spend a few
+ minutes more on testing in case your reproducer is unreliable.
+
+ After issuing one of these two commands, Git will usually check out another
+ bisection point and print something like 'Bisecting: 675 revisions left to
+ test after this (roughly 10 steps)'. In that case go back to step 1.
+
+ If Git instead prints something like 'cafecaca0c0dacafecaca0c0dacafecaca0c0da
+ is the first bad commit', then you have finished the bisection. In that case
+ move to the next point below. Note, right after displaying that line Git will
+ show some details about the culprit including its patch description; this can
+ easily fill your terminal, so you might need to scroll up to see the message
+ mentioning the culprit's commit-id.
+
+ In case you missed Git's output, you can always run ``git bisect log`` to
+ print the status: it will show how many steps remain or mention the result of
+ the bisection.
+
+* Recommended complementary task: put the bisection log and the current .config
+ file aside for the bug report; furthermore tell Git to reset the sources to
+ the state before the bisection::
+
+ git bisect log > ~/bisection-log
+ cp .config ~/bisection-config-culprit
+ git bisect reset
+
+* Recommended optional task: try reverting the culprit on top of the latest
+ codebase and check if that fixes your bug; if that is the case, it validates
+ the bisection and enables developers to resolve the regression through a
+ revert.
+
+ To try this, update your clone and check out latest mainline. Then tell Git
+ to revert the change by specifying its commit-id::
+
+ git revert --no-edit cafec0cacaca0
+
+ Git might reject this, for example when the bisection landed on a merge
+ commit. In that case, abandon the attempt. Do the same, if Git fails to revert
+ the culprit on its own because later changes depend on it -- at least unless
+ you bisected a stable or longterm kernel series, in which case you want to
+ check out its latest codebase and try a revert there.
+
+ If a revert succeeds, build and test another kernel to check if reverting
+ resolved your regression.
+
+With that the process is complete. Now report the regression as described by
+Documentation/admin-guide/reporting-issues.rst.
+
+Bisecting linux-next
+--------------------
+
+If you face a problem only happening in linux-next, bisect between the
+linux-next branches 'stable' and 'master'. The following commands will start
+the process for a linux-next tree you added as a remote called 'next'::
+
+ git bisect start
+ git bisect good next/stable
+ git bisect bad next/master
+
+The 'stable' branch refers to the state of linux-mainline that the current
+linux-next release (found in the 'master' branch) is based on -- the former
+thus should be free of any problems that show up in -next, but not in Linus'
+tree.
+
+This will bisect across a wide range of changes, some of which you might have
+used in earlier linux-next releases without problems. Sadly there is no simple
+way to avoid checking them: bisecting from one linux-next release to a later
+one (say between 'next-20241020' and 'next-20241021') is impossible, as they
+share no common history.
+
+Additional reading material
+---------------------------
+
+* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and
+ `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_
+ in the Git documentation.
+* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_
+ from kernel developer Nathan Chancellor.
+* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_.
+* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_.
+
+..
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text -- but for copyright reasons please CC
+ linux-doc@vger.kernel.org and 'sign-off' your contribution as
+ Documentation/process/submitting-patches.rst explains in the section 'Sign
+ your work - the Developer's Certificate of Origin'.
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use 'The Linux kernel development community' for author attribution
+ and link this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/bug-bisect.rst
+
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst
index 95299b08c405..ce6f4e8ca487 100644
--- a/Documentation/admin-guide/bug-hunting.rst
+++ b/Documentation/admin-guide/bug-hunting.rst
@@ -244,14 +244,14 @@ Reporting the bug
Once you find where the bug happened, by inspecting its location,
you could either try to fix it yourself or report it upstream.
-In order to report it upstream, you should identify the mailing list
-used for the development of the affected code. This can be done by using
-the ``get_maintainer.pl`` script.
+In order to report it upstream, you should identify the bug tracker, if any, or
+mailing list used for the development of the affected code. This can be done by
+using the ``get_maintainer.pl`` script.
For example, if you find a bug at the gspca's sonixj.c file, you can get
its maintainers with::
- $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c
+ $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c
Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%)
Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%)
Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%)
@@ -267,11 +267,12 @@ Please notice that it will point to:
- The driver maintainer (Hans Verkuil);
- The subsystem maintainer (Mauro Carvalho Chehab);
- The driver and/or subsystem mailing list (linux-media@vger.kernel.org);
-- the Linux Kernel mailing list (linux-kernel@vger.kernel.org).
+- The Linux Kernel mailing list (linux-kernel@vger.kernel.org);
+- The bug reporting URIs for the driver/subsystem (none in the above example).
-Usually, the fastest way to have your bug fixed is to report it to mailing
-list used for the development of the code (linux-media ML) copying the
-driver maintainer (Hans).
+If the listing contains bug reporting URIs at the end, please prefer them over
+email. Otherwise, please report bugs to the mailing list used for the
+development of the code (linux-media ML) copying the driver maintainer (Hans).
If you are totally stumped as to whom to send the report, and
``get_maintainer.pl`` didn't provide you anything useful, send it to
@@ -367,12 +368,3 @@ processed by ``klogd``::
Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128]
Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3
----------------------------------------------------------------------------
-
-::
-
- Dr. G.W. Wettstein Oncology Research Div. Computing Facility
- Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com
- 820 4th St. N.
- Fargo, ND 58122
- Phone: 701-234-7556
diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst
index 9343148ee993..a3e2edb3d274 100644
--- a/Documentation/admin-guide/cgroup-v1/cgroups.rst
+++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst
@@ -570,7 +570,7 @@ visible to cgroup_for_each_child/descendant_*() iterators. The
subsystem may choose to fail creation by returning -errno. This
callback can be used to implement reliable state sharing and
propagation along the hierarchy. See the comment on
-cgroup_for_each_descendant_pre() for details.
+cgroup_for_each_live_descendant_pre() for details.
``void css_offline(struct cgroup *cgrp);``
(cgroup_mutex held by caller)
diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst
index ae646d621a8a..f401af5e2f09 100644
--- a/Documentation/admin-guide/cgroup-v1/cpusets.rst
+++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst
@@ -179,7 +179,7 @@ files describing that cpuset:
- cpuset.mem_hardwall flag: is memory allocation hardwalled
- cpuset.memory_pressure: measure of how much paging pressure in cpuset
- cpuset.memory_spread_page flag: if set, spread page cache evenly on allowed nodes
- - cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
+ - cpuset.memory_spread_slab flag: OBSOLETE. Doesn't have any function.
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
@@ -568,7 +568,7 @@ on the next tick. For some applications in special situation, waiting
The 'cpuset.sched_relax_domain_level' file allows you to request changing
this searching range as you like. This file takes int value which
-indicates size of searching range in levels ideally as follows,
+indicates size of searching range in levels approximately as follows,
otherwise initial value -1 that indicates the cpuset has no request.
====== ===========================================================
@@ -581,6 +581,11 @@ otherwise initial value -1 that indicates the cpuset has no request.
5 search system wide [on NUMA system]
====== ===========================================================
+Not all levels can be present and values can change depending on the
+system architecture and kernel configuration. Check
+/sys/kernel/debug/sched/domains/cpu*/domain*/ for system-specific
+details.
+
The system default is architecture dependent. The system default
can be changed using the relax_domain_level= boot parameter.
diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
index 0fa724d82abb..493a8e386700 100644
--- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst
+++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst
@@ -65,10 +65,12 @@ files include::
1. Page fault accounting
-hugetlb.<hugepagesize>.limit_in_bytes
-hugetlb.<hugepagesize>.max_usage_in_bytes
-hugetlb.<hugepagesize>.usage_in_bytes
-hugetlb.<hugepagesize>.failcnt
+::
+
+ hugetlb.<hugepagesize>.limit_in_bytes
+ hugetlb.<hugepagesize>.max_usage_in_bytes
+ hugetlb.<hugepagesize>.usage_in_bytes
+ hugetlb.<hugepagesize>.failcnt
The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
control group and enforces the limit during page fault. Since HugeTLB
@@ -82,10 +84,12 @@ getting SIGBUS.
2. Reservation accounting
-hugetlb.<hugepagesize>.rsvd.limit_in_bytes
-hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
-hugetlb.<hugepagesize>.rsvd.usage_in_bytes
-hugetlb.<hugepagesize>.rsvd.failcnt
+::
+
+ hugetlb.<hugepagesize>.rsvd.limit_in_bytes
+ hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes
+ hugetlb.<hugepagesize>.rsvd.usage_in_bytes
+ hugetlb.<hugepagesize>.rsvd.failcnt
The HugeTLB controller allows to limit the HugeTLB reservations per control
group and enforces the controller limit at reservation time and at the fault of
diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
index 1f128458ddea..9f8e27355cba 100644
--- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst
+++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst
@@ -102,7 +102,7 @@ Under below explanation, we assume CONFIG_SWAP=y.
The logic is very clear. (About migration, see below)
Note:
- __remove_from_page_cache() is called by remove_from_page_cache()
+ __filemap_remove_folio() is called by filemap_remove_folio()
and __remove_mapping().
6. Shmem(tmpfs) Page Cache
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index ca7d9402f6be..286d16fc22eb 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -78,18 +78,22 @@ Brief summary of control files.
memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded
memory.soft_limit_in_bytes set/show soft limit of memory usage
This knob is not available on CONFIG_PREEMPT_RT systems.
+ This knob is deprecated and shouldn't be
+ used.
memory.stat show various statistics
memory.use_hierarchy set/show hierarchical account enabled
This knob is deprecated and shouldn't be
used.
memory.force_empty trigger forced page reclaim
memory.pressure_level set memory pressure notifications
+ This knob is deprecated and shouldn't be
+ used.
memory.swappiness set/show swappiness parameter of vmscan
(See sysctl's vm.swappiness)
- memory.move_charge_at_immigrate set/show controls of moving charges
+ memory.move_charge_at_immigrate This knob is deprecated.
+ memory.oom_control set/show oom controls.
This knob is deprecated and shouldn't be
used.
- memory.oom_control set/show oom controls.
memory.numa_stat show the number of memory usage per numa
node
memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel
@@ -105,10 +109,18 @@ Brief summary of control files.
memory.kmem.max_usage_in_bytes show max kernel memory usage recorded
memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory
+ This knob is deprecated and shouldn't be
+ used.
memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation
+ This knob is deprecated and shouldn't be
+ used.
memory.kmem.tcp.failcnt show the number of tcp buf memory usage
hits limits
+ This knob is deprecated and shouldn't be
+ used.
memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded
+ This knob is deprecated and shouldn't be
+ used.
==================================== ==========================================
1. History
@@ -229,10 +241,6 @@ behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).
-But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a
-task to another cgroup, its pages may be recharged to the new cgroup, if
-move_charge_at_immigrate has been chosen.
-
2.4 Swap Extension
--------------------------------------
@@ -300,14 +308,14 @@ When oom event notifier is registered, event will be delivered.
Lock order is as follows::
- Page lock (PG_locked bit of page->flags)
+ folio_lock
mm->page_table_lock or split pte_lock
folio_memcg_lock (memcg->move_lock)
mapping->i_pages lock
lruvec->lru_lock.
Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
-lruvec->lru_lock; PG_lru bit of page->flags is cleared before
+lruvec->lru_lock; the folio LRU flag is cleared before
isolating a page from its LRU under lruvec->lru_lock.
.. _cgroup-v1-memory-kernel-extension:
@@ -693,8 +701,10 @@ For compatibility reasons writing 1 to memory.use_hierarchy will always pass::
# echo 1 > memory.use_hierarchy
-7. Soft limits
-==============
+7. Soft limits (DEPRECATED)
+===========================
+
+THIS IS DEPRECATED!
Soft limits allow for greater sharing of memory. The idea behind soft limits
is to allow control groups to use as much of the memory as needed, provided
@@ -740,78 +750,8 @@ If we want to change this to 1G, we can at any time use::
THIS IS DEPRECATED!
-It's expensive and unreliable! It's better practice to launch workload
-tasks directly from inside their target cgroup. Use dedicated workload
-cgroups to allow fine-grained policy adjustments without having to
-move physical pages between control domains.
-
-Users can move charges associated with a task along with task migration, that
-is, uncharge task's pages from the old cgroup and charge them to the new cgroup.
-This feature is not supported in !CONFIG_MMU environments because of lack of
-page tables.
-
-8.1 Interface
--------------
-
-This feature is disabled by default. It can be enabled (and disabled again) by
-writing to memory.move_charge_at_immigrate of the destination cgroup.
-
-If you want to enable it::
-
- # echo (some positive value) > memory.move_charge_at_immigrate
-
-.. note::
- Each bits of move_charge_at_immigrate has its own meaning about what type
- of charges should be moved. See :ref:`section 8.2
- <cgroup-v1-memory-movable-charges>` for details.
-
-.. note::
- Charges are moved only when you move mm->owner, in other words,
- a leader of a thread group.
-
-.. note::
- If we cannot find enough space for the task in the destination cgroup, we
- try to make space by reclaiming memory. Task migration may fail if we
- cannot make enough space.
-
-.. note::
- It can take several seconds if you move charges much.
-
-And if you want disable it again::
-
- # echo 0 > memory.move_charge_at_immigrate
-
-.. _cgroup-v1-memory-movable-charges:
-
-8.2 Type of charges which can be moved
---------------------------------------
-
-Each bit in move_charge_at_immigrate has its own meaning about what type of
-charges should be moved. But in any case, it must be noted that an account of
-a page or a swap can be moved only when it is charged to the task's current
-(old) memory cgroup.
-
-+---+--------------------------------------------------------------------------+
-|bit| what type of charges would be moved ? |
-+===+==========================================================================+
-| 0 | A charge of an anonymous page (or swap of it) used by the target task. |
-| | You must enable Swap Extension (see 2.4) to enable move of swap charges. |
-+---+--------------------------------------------------------------------------+
-| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) |
-| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of |
-| | anonymous pages, file pages (and swaps) in the range mmapped by the task |
-| | will be moved even if the task hasn't done page fault, i.e. they might |
-| | not be the task's "RSS", but other task's "RSS" that maps the same file. |
-| | And mapcount of the page is ignored (the page can be moved even if |
-| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to |
-| | enable move of swap charges. |
-+---+--------------------------------------------------------------------------+
-
-8.3 TODO
---------
-
-- All of moving charge operations are done under cgroup_mutex. It's not good
- behavior to hold the mutex too long, so we may need some trick.
+Reading memory.move_charge_at_immigrate will always return 0 and writing
+to it will always return -EINVAL.
9. Memory thresholds
====================
@@ -834,8 +774,10 @@ It's applicable for root and non-root cgroup.
.. _cgroup-v1-memory-oom-control:
-10. OOM Control
-===============
+10. OOM Control (DEPRECATED)
+============================
+
+THIS IS DEPRECATED!
memory.oom_control file is for OOM notification and other controls.
@@ -882,8 +824,10 @@ At reading, current status of OOM is shown.
The number of processes belonging to this cgroup killed by any
kind of OOM killer.
-11. Memory Pressure
-===================
+11. Memory Pressure (DEPRECATED)
+================================
+
+THIS IS DEPRECATED!
The pressure level notifications can be used to monitor the memory
allocation cost; based on the pressure, applications can implement
diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst
index 6acebd9e72c8..0f9f9a7b1f6c 100644
--- a/Documentation/admin-guide/cgroup-v1/pids.rst
+++ b/Documentation/admin-guide/cgroup-v1/pids.rst
@@ -36,7 +36,8 @@ superset of parent/child/pids.current.
The pids.events file contains event counters:
- - max: Number of times fork failed because limit was hit.
+ - max: Number of times fork failed in the cgroup because limit was hit in
+ self or ancestors.
Example
-------
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 17e6e9565156..cb1b4e759b7e 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou
5-6. Device
5-7. RDMA
5-7-1. RDMA Interface Files
- 5-8. HugeTLB
- 5.8-1. HugeTLB Interface Files
- 5-9. Misc
- 5.9-1 Miscellaneous cgroup Interface Files
- 5.9-2 Migration and Ownership
- 5-10. Others
- 5-10-1. perf_event
+ 5-8. DMEM
+ 5-9. HugeTLB
+ 5.9-1. HugeTLB Interface Files
+ 5-10. Misc
+ 5.10-1 Miscellaneous cgroup Interface Files
+ 5.10-2 Migration and Ownership
+ 5-11. Others
+ 5-11-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -239,6 +240,13 @@ cgroup v2 currently supports the following mount options.
will not be tracked by the memory controller (even if cgroup
v2 is remounted later on).
+ pids_localevents
+ The option restores v1-like behavior of pids.events:max, that is only
+ local (inside cgroup proper) fork failures are counted. Without this
+ option pids.events.max represents any pids.max enforcemnt across
+ cgroup's subtree.
+
+
Organizing Processes and Threads
--------------------------------
@@ -526,10 +534,12 @@ cgroup namespace on namespace creation.
Because the resource control interface files in a given directory
control the distribution of the parent's resources, the delegatee
shouldn't be allowed to write to them. For the first method, this is
-achieved by not granting access to these files. For the second, the
-kernel rejects writes to all files other than "cgroup.procs" and
-"cgroup.subtree_control" on a namespace root from inside the
-namespace.
+achieved by not granting access to these files. For the second, files
+outside the namespace should be hidden from the delegatee by the means
+of at least mount namespacing, and the kernel rejects writes to all
+files on a namespace root from inside the cgroup namespace, except for
+those files listed in "/sys/kernel/cgroup/delegate" (including
+"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.).
The end results are equivalent for both delegation types. Once
delegated, the user can build sub-hierarchy under the directory,
@@ -974,6 +984,14 @@ All cgroup core files are prefixed with "cgroup."
A dying cgroup can consume system resources not exceeding
limits, which were active at the moment of cgroup deletion.
+ nr_subsys_<cgroup_subsys>
+ Total number of live cgroup subsystems (e.g memory
+ cgroup) at and beneath the current cgroup.
+
+ nr_dying_subsys_<cgroup_subsys>
+ Total number of dying cgroup subsystems (e.g. memory
+ cgroup) at and beneath the current cgroup.
+
cgroup.freeze
A read-write single value file which exists on non-root cgroups.
Allowed values are "0" and "1". The default is "0".
@@ -1058,12 +1076,15 @@ cpufreq governor about the minimum desired frequency which should always be
provided by a CPU, as well as the maximum desired frequency, which should not
be exceeded by a CPU.
-WARNING: cgroup2 doesn't yet support control of realtime processes and
-the cpu controller can only be enabled when all RT processes are in
-the root cgroup. Be aware that system management software may already
-have placed RT processes into nonroot cgroups during the system boot
-process, and these processes may need to be moved to the root cgroup
-before the cpu controller can be enabled.
+WARNING: cgroup2 doesn't yet support control of realtime processes. For
+a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group
+scheduling of realtime processes, the cpu controller can only be enabled
+when all RT processes are in the root cgroup. This limitation does
+not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system
+management software may already have placed RT processes into nonroot
+cgroups during the system boot process, and these processes may need
+to be moved to the root cgroup before the cpu controller can be enabled
+with a CONFIG_RT_GROUP_SCHED enabled kernel.
CPU Interface Files
@@ -1296,17 +1317,10 @@ PAGE_SIZE multiple when read back.
This is a simple interface to trigger memory reclaim in the
target cgroup.
- This file accepts a single key, the number of bytes to reclaim.
- No nested keys are currently supported.
-
Example::
echo "1G" > memory.reclaim
- The interface can be later extended with nested keys to
- configure the reclaim behavior. For example, specify the
- type of memory to reclaim from (anon, file, ..).
-
Please note that the kernel can over or under reclaim from
the target cgroup. If less bytes are reclaimed than the
specified amount, -EAGAIN is returned.
@@ -1318,12 +1332,26 @@ PAGE_SIZE multiple when read back.
This means that the networking layer will not adapt based on
reclaim induced by memory.reclaim.
+The following nested keys are defined.
+
+ ========== ================================
+ swappiness Swappiness value to reclaim with
+ ========== ================================
+
+ Specifying a swappiness value instructs the kernel to perform
+ the reclaim with that swappiness value. Note that this has the
+ same semantics as vm.swappiness applied to memcg reclaim with
+ all the existing limitations and potential future extensions.
+
memory.peak
- A read-only single value file which exists on non-root
- cgroups.
+ A read-write single value file which exists on non-root cgroups.
- The max memory usage recorded for the cgroup and its
- descendants since the creation of the cgroup.
+ The max memory usage recorded for the cgroup and its descendants since
+ either the creation of the cgroup or the most recent reset for that FD.
+
+ A write of any non-empty string to this file resets it to the
+ current memory usage for subsequent reads through the same
+ file descriptor.
memory.oom.group
A read-write single value file which exists on non-root
@@ -1432,7 +1460,7 @@ PAGE_SIZE multiple when read back.
sec_pagetables
Amount of memory allocated for secondary page tables,
this currently includes KVM mmu allocations on x86
- and arm64.
+ and arm64 and IOMMU page tables.
percpu (npn)
Amount of memory used for storing per-cpu kernel
@@ -1572,6 +1600,24 @@ PAGE_SIZE multiple when read back.
pglazyfreed (npn)
Amount of reclaimed lazyfree pages
+ swpin_zero
+ Number of pages swapped into memory and filled with zero, where I/O
+ was optimized out because the page content was detected to be zero
+ during swapout.
+
+ swpout_zero
+ Number of zero-filled pages swapped out with I/O skipped due to the
+ content being detected as zero.
+
+ zswpin
+ Number of pages moved in to memory from zswap.
+
+ zswpout
+ Number of pages moved out of memory to zswap.
+
+ zswpwb
+ Number of pages written from zswap to swap.
+
thp_fault_alloc (npn)
Number of transparent hugepages which were allocated to satisfy
a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
@@ -1591,6 +1637,30 @@ PAGE_SIZE multiple when read back.
Usually because failed to allocate some continuous swap space
for the huge page.
+ numa_pages_migrated (npn)
+ Number of pages migrated by NUMA balancing.
+
+ numa_pte_updates (npn)
+ Number of pages whose page table entries are modified by
+ NUMA balancing to produce NUMA hinting faults on access.
+
+ numa_hint_faults (npn)
+ Number of NUMA hinting faults.
+
+ pgdemote_kswapd
+ Number of pages demoted by kswapd.
+
+ pgdemote_direct
+ Number of pages demoted directly.
+
+ pgdemote_khugepaged
+ Number of pages demoted by khugepaged.
+
+ hugetlb
+ Amount of memory used by hugetlb pages. This metric only shows
+ up if hugetlb usage is accounted for in memory.current (i.e.
+ cgroup is mounted with the memory_hugetlb_accounting option).
+
memory.numa_stat
A read-only nested-keyed file which exists on non-root cgroups.
@@ -1640,11 +1710,14 @@ PAGE_SIZE multiple when read back.
Healthy workloads are not expected to reach this limit.
memory.swap.peak
- A read-only single value file which exists on non-root
- cgroups.
+ A read-write single value file which exists on non-root cgroups.
+
+ The max swap usage recorded for the cgroup and its descendants since
+ the creation of the cgroup or the most recent reset for that FD.
- The max swap usage recorded for the cgroup and its
- descendants since the creation of the cgroup.
+ A write of any non-empty string to this file resets it to the
+ current memory usage for subsequent reads through the same
+ file descriptor.
memory.swap.max
A read-write single value file which exists on non-root
@@ -1694,9 +1767,10 @@ PAGE_SIZE multiple when read back.
entries fault back in or are written out to disk.
memory.zswap.writeback
- A read-write single value file. The default value is "1". The
- initial value of the root cgroup is 1, and when a new cgroup is
- created, it inherits the current value of its parent.
+ A read-write single value file. The default value is "1".
+ Note that this setting is hierarchical, i.e. the writeback would be
+ implicitly disabled for child cgroups if the upper hierarchy
+ does so.
When this is set to 0, all swapping attempts to swapping devices
are disabled. This included both zswap writebacks, and swapping due
@@ -1707,6 +1781,8 @@ PAGE_SIZE multiple when read back.
Note that this is subtly different from setting memory.swap.max to
0, as it still allows for pages to be written to the zswap pool.
+ This setting has no effect if zswap is disabled, and swapping
+ is allowed unless memory.swap.max is set to 0.
memory.pressure
A read-only nested-keyed file.
@@ -2181,11 +2257,31 @@ PID Interface Files
Hard limit of number of processes.
pids.current
- A read-only single value file which exists on all cgroups.
+ A read-only single value file which exists on non-root cgroups.
The number of processes currently in the cgroup and its
descendants.
+ pids.peak
+ A read-only single value file which exists on non-root cgroups.
+
+ The maximum value that the number of processes in the cgroup and its
+ descendants has ever reached.
+
+ pids.events
+ A read-only flat-keyed file which exists on non-root cgroups. Unless
+ specified otherwise, a value change in this file generates a file
+ modified event. The following entries are defined.
+
+ max
+ The number of times the cgroup's total number of processes hit the pids.max
+ limit (see also pids_localevents).
+
+ pids.events.local
+ Similar to pids.events but the fields in the file are local
+ to the cgroup i.e. not hierarchical. The file modified event
+ generated on this file reflects only the local events.
+
Organisational operations are not blocked by cgroup policies, so it is
possible to have pids.current > pids.max. This can be done by either
setting the limit to be smaller than pids.current, or attaching enough
@@ -2320,8 +2416,12 @@ Cpuset Interface Files
is always a subset of it.
Users can manually set it to a value that is different from
- "cpuset.cpus". The only constraint in setting it is that the
- list of CPUs must be exclusive with respect to its sibling.
+ "cpuset.cpus". One constraint in setting it is that the list of
+ CPUs must be exclusive with respect to "cpuset.cpus.exclusive"
+ of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup
+ isn't set, its "cpuset.cpus" value, if set, cannot be a subset
+ of it to leave at least one CPU available when the exclusive
+ CPUs are taken away.
For a parent cgroup, any one of its exclusive CPUs can only
be distributed to at most one of its child cgroups. Having an
@@ -2337,8 +2437,8 @@ Cpuset Interface Files
cpuset-enabled cgroups.
This file shows the effective set of exclusive CPUs that
- can be used to create a partition root. The content of this
- file will always be a subset of "cpuset.cpus" and its parent's
+ can be used to create a partition root. The content
+ of this file will always be a subset of its parent's
"cpuset.cpus.exclusive.effective" if its parent is not the root
cgroup. It will also be a subset of "cpuset.cpus.exclusive"
if it is set. If "cpuset.cpus.exclusive" is not set, it is
@@ -2527,6 +2627,49 @@ RDMA Interface Files
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
+DMEM
+----
+
+The "dmem" controller regulates the distribution and accounting of
+device memory regions. Because each memory region may have its own page size,
+which does not have to be equal to the system page size, the units are always bytes.
+
+DMEM Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+ dmem.max, dmem.min, dmem.low
+ A readwrite nested-keyed file that exists for all the cgroups
+ except root that describes current configured resource limit
+ for a region.
+
+ An example for xe follows::
+
+ drm/0000:03:00.0/vram0 1073741824
+ drm/0000:03:00.0/stolen max
+
+ The semantics are the same as for the memory cgroup controller, and are
+ calculated in the same way.
+
+ dmem.capacity
+ A read-only file that describes maximum region capacity.
+ It only exists on the root cgroup. Not all memory can be
+ allocated by cgroups, as the kernel reserves some for
+ internal use.
+
+ An example for xe follows::
+
+ drm/0000:03:00.0/vram0 8514437120
+ drm/0000:03:00.0/stolen 67108864
+
+ dmem.current
+ A read-only file that describes current resource usage.
+ It exists for all the cgroup except root.
+
+ An example for xe follows::
+
+ drm/0000:03:00.0/vram0 12550144
+ drm/0000:03:00.0/stolen 8650752
+
HugeTLB
-------
@@ -2599,6 +2742,15 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
res_a 3
res_b 0
+ misc.peak
+ A read-only flat-keyed file shown in all cgroups. It shows the
+ historical maximum usage of the resources in the cgroup and its
+ children.::
+
+ $ cat misc.peak
+ res_a 10
+ res_b 8
+
misc.max
A read-write flat-keyed file shown in the non root cgroups. Allowed
maximum usage of the resources in the cgroup and its children.::
@@ -2628,6 +2780,11 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_
The number of times the cgroup's resource usage was
about to go over the max boundary.
+ misc.events.local
+ Similar to misc.events but the fields in the file are local to the
+ cgroup i.e. not hierarchical. The file modified event generated on
+ this file reflects only the local events.
+
Migration and Ownership
~~~~~~~~~~~~~~~~~~~~~~~
@@ -2846,7 +3003,7 @@ following two functions.
a queue (device) has been associated with the bio and
before submission.
- wbc_account_cgroup_owner(@wbc, @page, @bytes)
+ wbc_account_cgroup_owner(@wbc, @folio, @bytes)
Should be called for each data segment being written out.
While this function doesn't care exactly when it's called
during the writeback session, it's the easiest and most
@@ -2878,8 +3035,8 @@ Deprecated v1 Core Features
- "cgroup.clone_children" is removed.
-- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
- at the root instead.
+- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
+ "cgroup.stat" files at the root instead.
Issues with v1 and Rationales for v2
diff --git a/Documentation/admin-guide/cifs/introduction.rst b/Documentation/admin-guide/cifs/introduction.rst
index 53ea62906aa5..ffc6e2564dd5 100644
--- a/Documentation/admin-guide/cifs/introduction.rst
+++ b/Documentation/admin-guide/cifs/introduction.rst
@@ -28,7 +28,7 @@ Introduction
high performance safe distributed caching (leases/oplocks), optional packet
signing, large files, Unicode support and other internationalization
improvements. Since both Samba server and this filesystem client support the
- CIFS Unix extensions, and the Linux client also suppors SMB3 POSIX extensions,
+ CIFS Unix extensions, and the Linux client also supports SMB3 POSIX extensions,
the combination can provide a reasonable alternative to other network and
cluster file systems for fileserving in some Linux to Linux environments,
not just in Linux to Windows (or Linux to Mac) environments.
diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst
index aa8290a29dc8..c09674a75a9e 100644
--- a/Documentation/admin-guide/cifs/usage.rst
+++ b/Documentation/admin-guide/cifs/usage.rst
@@ -723,40 +723,26 @@ Configuration pseudo-files:
======================= =======================================================
SecurityFlags Flags which control security negotiation and
also packet signing. Authentication (may/must)
- flags (e.g. for NTLM and/or NTLMv2) may be combined with
+ flags (e.g. for NTLMv2) may be combined with
the signing flags. Specifying two different password
hashing mechanisms (as "must use") on the other hand
does not make much sense. Default flags are::
- 0x07007
-
- (NTLM, NTLMv2 and packet signing allowed). The maximum
- allowable flags if you want to allow mounts to servers
- using weaker password hashes is 0x37037 (lanman,
- plaintext, ntlm, ntlmv2, signing allowed). Some
- SecurityFlags require the corresponding menuconfig
- options to be enabled. Enabling plaintext
- authentication currently requires also enabling
- lanman authentication in the security flags
- because the cifs module only supports sending
- laintext passwords using the older lanman dialect
- form of the session setup SMB. (e.g. for authentication
- using plain text passwords, set the SecurityFlags
- to 0x30030)::
+ 0x00C5
+
+ (NTLMv2 and packet signing allowed). Some SecurityFlags
+ may require enabling a corresponding menuconfig option.
may use packet signing 0x00001
must use packet signing 0x01001
- may use NTLM (most common password hash) 0x00002
- must use NTLM 0x02002
may use NTLMv2 0x00004
must use NTLMv2 0x04004
- may use Kerberos security 0x00008
- must use Kerberos 0x08008
- may use lanman (weak) password hash 0x00010
- must use lanman password hash 0x10010
- may use plaintext passwords 0x00020
- must use plaintext passwords 0x20020
- (reserved for future packet encryption) 0x00040
+ may use Kerberos security (krb5) 0x00008
+ must use Kerberos 0x08008
+ may use NTLMSSP 0x00080
+ must use NTLMSSP 0x80080
+ seal (packet encryption) 0x00040
+ must seal 0x40040
cifsFYI If set to non-zero value, additional debug information
will be logged to the system error log. This field
diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst
index 917ba8c33359..4d667228e744 100644
--- a/Documentation/admin-guide/device-mapper/delay.rst
+++ b/Documentation/admin-guide/device-mapper/delay.rst
@@ -3,29 +3,52 @@ dm-delay
========
Device-Mapper's "delay" target delays reads and/or writes
-and maps them to different devices.
+and/or flushs and optionally maps them to different devices.
-Parameters::
+Arguments::
<device> <offset> <delay> [<write_device> <write_offset> <write_delay>
[<flush_device> <flush_offset> <flush_delay>]]
-With separate write parameters, the first set is only used for reads.
+Table line has to either have 3, 6 or 9 arguments:
+
+3: apply offset and delay to read, write and flush operations on device
+
+6: apply offset and delay to device, also apply write_offset and write_delay
+ to write and flush operations on optionally different write_device with
+ optionally different sector offset
+
+9: same as 6 arguments plus define flush_offset and flush_delay explicitely
+ on/with optionally different flush_device/flush_offset.
+
Offsets are specified in sectors.
+
Delays are specified in milliseconds.
+
Example scripts
===============
::
-
#!/bin/sh
- # Create device delaying rw operation for 500ms
- echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed
+ #
+ # Create mapped device named "delayed" delaying read, write and flush operations for 500ms.
+ #
+ dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 500"
::
+ #!/bin/sh
+ #
+ # Create mapped device delaying write and flush operations for 400ms and
+ # splitting reads to device $1 but writes and flushs to different device $2
+ # to different offsets of 2048 and 4096 sectors respectively.
+ #
+ dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400"
+::
#!/bin/sh
- # Create device delaying only write operation for 500ms and
- # splitting reads and writes to different devices $1 $2
- echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed
+ #
+ # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms
+ # onto the same backing device at offset 0 sectors.
+ #
+ dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333"
diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst
index aa2d04d95df6..9f8139ff97d6 100644
--- a/Documentation/admin-guide/device-mapper/dm-crypt.rst
+++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst
@@ -113,6 +113,11 @@ same_cpu_crypt
The default is to use an unbound workqueue so that encryption work
is automatically balanced between available CPUs.
+high_priority
+ Set dm-crypt workqueues and the writer thread to high priority. This
+ improves throughput and latency of dm-crypt while degrading general
+ responsiveness of the system.
+
submit_from_crypt_cpus
Disable offloading writes to a separate thread after encryption.
There are some situations where offloading write bios from the
@@ -155,6 +160,27 @@ iv_large_sectors
The <iv_offset> must be multiple of <sector_size> (in 512 bytes units)
if this flag is specified.
+integrity_key_size:<bytes>
+ Use an integrity key of <bytes> size instead of using an integrity key size
+ of the digest size of the used HMAC algorithm.
+
+
+Module parameters::
+ max_read_size
+ Maximum size of read requests. When a request larger than this size
+ is received, dm-crypt will split the request. The splitting improves
+ concurrency (the split requests could be encrypted in parallel by multiple
+ cores), but it also causes overhead. The user should tune this parameters to
+ fit the actual workload.
+
+ max_write_size
+ Maximum size of write requests. When a request larger than this size
+ is received, dm-crypt will split the request. The splitting improves
+ concurrency (the split requests could be encrypted in parallel by multiple
+ cores), but it also causes overhead. The user should tune this parameters to
+ fit the actual workload.
+
+
Example scripts
===============
LUKS (Linux Unified Key Setup) is now the preferred way to set up disk
diff --git a/Documentation/admin-guide/device-mapper/index.rst b/Documentation/admin-guide/device-mapper/index.rst
index cde52cc09645..cc5aec861576 100644
--- a/Documentation/admin-guide/device-mapper/index.rst
+++ b/Documentation/admin-guide/device-mapper/index.rst
@@ -34,6 +34,8 @@ Device Mapper
switch
thin-provisioning
unstriped
+ vdo-design
+ vdo
verity
writecache
zero
diff --git a/Documentation/admin-guide/device-mapper/vdo-design.rst b/Documentation/admin-guide/device-mapper/vdo-design.rst
new file mode 100644
index 000000000000..3cd59decbec0
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/vdo-design.rst
@@ -0,0 +1,633 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+================
+Design of dm-vdo
+================
+
+The dm-vdo (virtual data optimizer) target provides inline deduplication,
+compression, zero-block elimination, and thin provisioning. A dm-vdo target
+can be backed by up to 256TB of storage, and can present a logical size of
+up to 4PB. This target was originally developed at Permabit Technology
+Corp. starting in 2009. It was first released in 2013 and has been used in
+production environments ever since. It was made open-source in 2017 after
+Permabit was acquired by Red Hat. This document describes the design of
+dm-vdo. For usage, see vdo.rst in the same directory as this file.
+
+Because deduplication rates fall drastically as the block size increases, a
+vdo target has a maximum block size of 4K. However, it can achieve
+deduplication rates of 254:1, i.e. up to 254 copies of a given 4K block can
+reference a single 4K of actual storage. It can achieve compression rates
+of 14:1. All zero blocks consume no storage at all.
+
+Theory of Operation
+===================
+
+The design of dm-vdo is based on the idea that deduplication is a two-part
+problem. The first is to recognize duplicate data. The second is to avoid
+storing multiple copies of those duplicates. Therefore, dm-vdo has two main
+parts: a deduplication index (called UDS) that is used to discover
+duplicate data, and a data store with a reference counted block map that
+maps from logical block addresses to the actual storage location of the
+data.
+
+Zones and Threading
+-------------------
+
+Due to the complexity of data optimization, the number of metadata
+structures involved in a single write operation to a vdo target is larger
+than most other targets. Furthermore, because vdo must operate on small
+block sizes in order to achieve good deduplication rates, acceptable
+performance can only be achieved through parallelism. Therefore, vdo's
+design attempts to be lock-free.
+
+Most of a vdo's main data structures are designed to be easily divided into
+"zones" such that any given bio must only access a single zone of any zoned
+structure. Safety with minimal locking is achieved by ensuring that during
+normal operation, each zone is assigned to a specific thread, and only that
+thread will access the portion of the data structure in that zone.
+Associated with each thread is a work queue. Each bio is associated with a
+request object (the "data_vio") which will be added to a work queue when
+the next phase of its operation requires access to the structures in the
+zone associated with that queue.
+
+Another way of thinking about this arrangement is that the work queue for
+each zone has an implicit lock on the structures it manages for all its
+operations, because vdo guarantees that no other thread will alter those
+structures.
+
+Although each structure is divided into zones, this division is not
+reflected in the on-disk representation of each data structure. Therefore,
+the number of zones for each structure, and hence the number of threads,
+can be reconfigured each time a vdo target is started.
+
+The Deduplication Index
+-----------------------
+
+In order to identify duplicate data efficiently, vdo was designed to
+leverage some common characteristics of duplicate data. From empirical
+observations, we gathered two key insights. The first is that in most data
+sets with significant amounts of duplicate data, the duplicates tend to
+have temporal locality. When a duplicate appears, it is more likely that
+other duplicates will be detected, and that those duplicates will have been
+written at about the same time. This is why the index keeps records in
+temporal order. The second insight is that new data is more likely to
+duplicate recent data than it is to duplicate older data and in general,
+there are diminishing returns to looking further back in time. Therefore,
+when the index is full, it should cull its oldest records to make space for
+new ones. Another important idea behind the design of the index is that the
+ultimate goal of deduplication is to reduce storage costs. Since there is a
+trade-off between the storage saved and the resources expended to achieve
+those savings, vdo does not attempt to find every last duplicate block. It
+is sufficient to find and eliminate most of the redundancy.
+
+Each block of data is hashed to produce a 16-byte block name. An index
+record consists of this block name paired with the presumed location of
+that data on the underlying storage. However, it is not possible to
+guarantee that the index is accurate. In the most common case, this occurs
+because it is too costly to update the index when a block is over-written
+or discarded. Doing so would require either storing the block name along
+with the blocks, which is difficult to do efficiently in block-based
+storage, or reading and rehashing each block before overwriting it.
+Inaccuracy can also result from a hash collision where two different blocks
+have the same name. In practice, this is extremely unlikely, but because
+vdo does not use a cryptographic hash, a malicious workload could be
+constructed. Because of these inaccuracies, vdo treats the locations in the
+index as hints, and reads each indicated block to verify that it is indeed
+a duplicate before sharing the existing block with a new one.
+
+Records are collected into groups called chapters. New records are added to
+the newest chapter, called the open chapter. This chapter is stored in a
+format optimized for adding and modifying records, and the content of the
+open chapter is not finalized until it runs out of space for new records.
+When the open chapter fills up, it is closed and a new open chapter is
+created to collect new records.
+
+Closing a chapter converts it to a different format which is optimized for
+reading. The records are written to a series of record pages based on the
+order in which they were received. This means that records with temporal
+locality should be on a small number of pages, reducing the I/O required to
+retrieve them. The chapter also compiles an index that indicates which
+record page contains any given name. This index means that a request for a
+name can determine exactly which record page may contain that record,
+without having to load the entire chapter from storage. This index uses
+only a subset of the block name as its key, so it cannot guarantee that an
+index entry refers to the desired block name. It can only guarantee that if
+there is a record for this name, it will be on the indicated page. Closed
+chapters are read-only structures and their contents are never altered in
+any way.
+
+Once enough records have been written to fill up all the available index
+space, the oldest chapter is removed to make space for new chapters. Any
+time a request finds a matching record in the index, that record is copied
+into the open chapter. This ensures that useful block names remain available
+in the index, while unreferenced block names are forgotten over time.
+
+In order to find records in older chapters, the index also maintains a
+higher level structure called the volume index, which contains entries
+mapping each block name to the chapter containing its newest record. This
+mapping is updated as records for the block name are copied or updated,
+ensuring that only the newest record for a given block name can be found.
+An older record for a block name will no longer be found even though it has
+not been deleted from its chapter. Like the chapter index, the volume index
+uses only a subset of the block name as its key and can not definitively
+say that a record exists for a name. It can only say which chapter would
+contain the record if a record exists. The volume index is stored entirely
+in memory and is saved to storage only when the vdo target is shut down.
+
+From the viewpoint of a request for a particular block name, it will first
+look up the name in the volume index. This search will either indicate that
+the name is new, or which chapter to search. If it returns a chapter, the
+request looks up its name in the chapter index. This will indicate either
+that the name is new, or which record page to search. Finally, if it is not
+new, the request will look for its name in the indicated record page.
+This process may require up to two page reads per request (one for the
+chapter index page and one for the request page). However, recently
+accessed pages are cached so that these page reads can be amortized across
+many block name requests.
+
+The volume index and the chapter indexes are implemented using a
+memory-efficient structure called a delta index. Instead of storing the
+entire block name (the key) for each entry, the entries are sorted by name
+and only the difference between adjacent keys (the delta) is stored.
+Because we expect the hashes to be randomly distributed, the size of the
+deltas follows an exponential distribution. Because of this distribution,
+the deltas are expressed using a Huffman code to take up even less space.
+The entire sorted list of keys is called a delta list. This structure
+allows the index to use many fewer bytes per entry than a traditional hash
+table, but it is slightly more expensive to look up entries, because a
+request must read every entry in a delta list to add up the deltas in order
+to find the record it needs. The delta index reduces this lookup cost by
+splitting its key space into many sub-lists, each starting at a fixed key
+value, so that each individual list is short.
+
+The default index size can hold 64 million records, corresponding to about
+256GB of data. This means that the index can identify duplicate data if the
+original data was written within the last 256GB of writes. This range is
+called the deduplication window. If new writes duplicate data that is older
+than that, the index will not be able to find it because the records of the
+older data have been removed. This means that if an application writes a
+200 GB file to a vdo target and then immediately writes it again, the two
+copies will deduplicate perfectly. Doing the same with a 500 GB file will
+result in no deduplication, because the beginning of the file will no
+longer be in the index by the time the second write begins (assuming there
+is no duplication within the file itself).
+
+If an application anticipates a data workload that will see useful
+deduplication beyond the 256GB threshold, vdo can be configured to use a
+larger index with a correspondingly larger deduplication window. (This
+configuration can only be set when the target is created, not altered
+later. It is important to consider the expected workload for a vdo target
+before configuring it.) There are two ways to do this.
+
+One way is to increase the memory size of the index, which also increases
+the amount of backing storage required. Doubling the size of the index will
+double the length of the deduplication window at the expense of doubling
+the storage size and the memory requirements.
+
+The other option is to enable sparse indexing. Sparse indexing increases
+the deduplication window by a factor of 10, at the expense of also
+increasing the storage size by a factor of 10. However with sparse
+indexing, the memory requirements do not increase. The trade-off is
+slightly more computation per request and a slight decrease in the amount
+of deduplication detected. For most workloads with significant amounts of
+duplicate data, sparse indexing will detect 97-99% of the deduplication
+that a standard index will detect.
+
+The vio and data_vio Structures
+-------------------------------
+
+A vio (short for Vdo I/O) is conceptually similar to a bio, with additional
+fields and data to track vdo-specific information. A struct vio maintains a
+pointer to a bio but also tracks other fields specific to the operation of
+vdo. The vio is kept separate from its related bio because there are many
+circumstances where vdo completes the bio but must continue to do work
+related to deduplication or compression.
+
+Metadata reads and writes, and other writes that originate within vdo, use
+a struct vio directly. Application reads and writes use a larger structure
+called a data_vio to track information about their progress. A struct
+data_vio contain a struct vio and also includes several other fields
+related to deduplication and other vdo features. The data_vio is the
+primary unit of application work in vdo. Each data_vio proceeds through a
+set of steps to handle the application data, after which it is reset and
+returned to a pool of data_vios for reuse.
+
+There is a fixed pool of 2048 data_vios. This number was chosen to bound
+the amount of work that is required to recover from a crash. In addition,
+benchmarks have indicated that increasing the size of the pool does not
+significantly improve performance.
+
+The Data Store
+--------------
+
+The data store is implemented by three main data structures, all of which
+work in concert to reduce or amortize metadata updates across as many data
+writes as possible.
+
+*The Slab Depot*
+
+Most of the vdo volume belongs to the slab depot. The depot contains a
+collection of slabs. The slabs can be up to 32GB, and are divided into
+three sections. Most of a slab consists of a linear sequence of 4K blocks.
+These blocks are used either to store data, or to hold portions of the
+block map (see below). In addition to the data blocks, each slab has a set
+of reference counters, using 1 byte for each data block. Finally each slab
+has a journal.
+
+Reference updates are written to the slab journal. Slab journal blocks are
+written out either when they are full, or when the recovery journal
+requests they do so in order to allow the main recovery journal (see below)
+to free up space. The slab journal is used both to ensure that the main
+recovery journal can regularly free up space, and also to amortize the cost
+of updating individual reference blocks. The reference counters are kept in
+memory and are written out, a block at a time in oldest-dirtied-order, only
+when there is a need to reclaim slab journal space. The write operations
+are performed in the background as needed so they do not add latency to
+particular I/O operations.
+
+Each slab is independent of every other. They are assigned to "physical
+zones" in round-robin fashion. If there are P physical zones, then slab n
+is assigned to zone n mod P.
+
+The slab depot maintains an additional small data structure, the "slab
+summary," which is used to reduce the amount of work needed to come back
+online after a crash. The slab summary maintains an entry for each slab
+indicating whether or not the slab has ever been used, whether all of its
+reference count updates have been persisted to storage, and approximately
+how full it is. During recovery, each physical zone will attempt to recover
+at least one slab, stopping whenever it has recovered a slab which has some
+free blocks. Once each zone has some space, or has determined that none is
+available, the target can resume normal operation in a degraded mode. Read
+and write requests can be serviced, perhaps with degraded performance,
+while the remainder of the dirty slabs are recovered.
+
+*The Block Map*
+
+The block map contains the logical to physical mapping. It can be thought
+of as an array with one entry per logical address. Each entry is 5 bytes,
+36 bits of which contain the physical block number which holds the data for
+the given logical address. The other 4 bits are used to indicate the nature
+of the mapping. Of the 16 possible states, one represents a logical address
+which is unmapped (i.e. it has never been written, or has been discarded),
+one represents an uncompressed block, and the other 14 states are used to
+indicate that the mapped data is compressed, and which of the compression
+slots in the compressed block contains the data for this logical address.
+
+In practice, the array of mapping entries is divided into "block map
+pages," each of which fits in a single 4K block. Each block map page
+consists of a header and 812 mapping entries. Each mapping page is actually
+a leaf of a radix tree which consists of block map pages at each level.
+There are 60 radix trees which are assigned to "logical zones" in round
+robin fashion. (If there are L logical zones, tree n will belong to zone n
+mod L.) At each level, the trees are interleaved, so logical addresses
+0-811 belong to tree 0, logical addresses 812-1623 belong to tree 1, and so
+on. The interleaving is maintained all the way up to the 60 root nodes.
+Choosing 60 trees results in an evenly distributed number of trees per zone
+for a large number of possible logical zone counts. The storage for the 60
+tree roots is allocated at format time. All other block map pages are
+allocated out of the slabs as needed. This flexible allocation avoids the
+need to pre-allocate space for the entire set of logical mappings and also
+makes growing the logical size of a vdo relatively easy.
+
+In operation, the block map maintains two caches. It is prohibitive to keep
+the entire leaf level of the trees in memory, so each logical zone
+maintains its own cache of leaf pages. The size of this cache is
+configurable at target start time. The second cache is allocated at start
+time, and is large enough to hold all the non-leaf pages of the entire
+block map. This cache is populated as pages are needed.
+
+*The Recovery Journal*
+
+The recovery journal is used to amortize updates across the block map and
+slab depot. Each write request causes an entry to be made in the journal.
+Entries are either "data remappings" or "block map remappings." For a data
+remapping, the journal records the logical address affected and its old and
+new physical mappings. For a block map remapping, the journal records the
+block map page number and the physical block allocated for it. Block map
+pages are never reclaimed or repurposed, so the old mapping is always 0.
+
+Each journal entry is an intent record summarizing the metadata updates
+that are required for a data_vio. The recovery journal issues a flush
+before each journal block write to ensure that the physical data for the
+new block mappings in that block are stable on storage, and journal block
+writes are all issued with the FUA bit set to ensure the recovery journal
+entries themselves are stable. The journal entry and the data write it
+represents must be stable on disk before the other metadata structures may
+be updated to reflect the operation. These entries allow the vdo device to
+reconstruct the logical to physical mappings after an unexpected
+interruption such as a loss of power.
+
+*Write Path*
+
+All write I/O to vdo is asynchronous. Each bio will be acknowledged as soon
+as vdo has done enough work to guarantee that it can complete the write
+eventually. Generally, the data for acknowledged but unflushed write I/O
+can be treated as though it is cached in memory. If an application
+requires data to be stable on storage, it must issue a flush or write the
+data with the FUA bit set like any other asynchronous I/O. Shutting down
+the vdo target will also flush any remaining I/O.
+
+Application write bios follow the steps outlined below.
+
+1. A data_vio is obtained from the data_vio pool and associated with the
+ application bio. If there are no data_vios available, the incoming bio
+ will block until a data_vio is available. This provides back pressure
+ to the application. The data_vio pool is protected by a spin lock.
+
+ The newly acquired data_vio is reset and the bio's data is copied into
+ the data_vio if it is a write and the data is not all zeroes. The data
+ must be copied because the application bio can be acknowledged before
+ the data_vio processing is complete, which means later processing steps
+ will no longer have access to the application bio. The application bio
+ may also be smaller than 4K, in which case the data_vio will have
+ already read the underlying block and the data is instead copied over
+ the relevant portion of the larger block.
+
+2. The data_vio places a claim (the "logical lock") on the logical address
+ of the bio. It is vital to prevent simultaneous modifications of the
+ same logical address, because deduplication involves sharing blocks.
+ This claim is implemented as an entry in a hashtable where the key is
+ the logical address and the value is a pointer to the data_vio
+ currently handling that address.
+
+ If a data_vio looks in the hashtable and finds that another data_vio is
+ already operating on that logical address, it waits until the previous
+ operation finishes. It also sends a message to inform the current
+ lock holder that it is waiting. Most notably, a new data_vio waiting
+ for a logical lock will flush the previous lock holder out of the
+ compression packer (step 8d) rather than allowing it to continue
+ waiting to be packed.
+
+ This stage requires the data_vio to get an implicit lock on the
+ appropriate logical zone to prevent concurrent modifications of the
+ hashtable. This implicit locking is handled by the zone divisions
+ described above.
+
+3. The data_vio traverses the block map tree to ensure that all the
+ necessary internal tree nodes have been allocated, by trying to find
+ the leaf page for its logical address. If any interior tree page is
+ missing, it is allocated at this time out of the same physical storage
+ pool used to store application data.
+
+ a. If any page-node in the tree has not yet been allocated, it must be
+ allocated before the write can continue. This step requires the
+ data_vio to lock the page-node that needs to be allocated. This
+ lock, like the logical block lock in step 2, is a hashtable entry
+ that causes other data_vios to wait for the allocation process to
+ complete.
+
+ The implicit logical zone lock is released while the allocation is
+ happening, in order to allow other operations in the same logical
+ zone to proceed. The details of allocation are the same as in
+ step 4. Once a new node has been allocated, that node is added to
+ the tree using a similar process to adding a new data block mapping.
+ The data_vio journals the intent to add the new node to the block
+ map tree (step 10), updates the reference count of the new block
+ (step 11), and reacquires the implicit logical zone lock to add the
+ new mapping to the parent tree node (step 12). Once the tree is
+ updated, the data_vio proceeds down the tree. Any other data_vios
+ waiting on this allocation also proceed.
+
+ b. In the steady-state case, the block map tree nodes will already be
+ allocated, so the data_vio just traverses the tree until it finds
+ the required leaf node. The location of the mapping (the "block map
+ slot") is recorded in the data_vio so that later steps do not need
+ to traverse the tree again. The data_vio then releases the implicit
+ logical zone lock.
+
+4. If the block is a zero block, skip to step 9. Otherwise, an attempt is
+ made to allocate a free data block. This allocation ensures that the
+ data_vio can write its data somewhere even if deduplication and
+ compression are not possible. This stage gets an implicit lock on a
+ physical zone to search for free space within that zone.
+
+ The data_vio will search each slab in a zone until it finds a free
+ block or decides there are none. If the first zone has no free space,
+ it will proceed to search the next physical zone by taking the implicit
+ lock for that zone and releasing the previous one until it finds a
+ free block or runs out of zones to search. The data_vio will acquire a
+ struct pbn_lock (the "physical block lock") on the free block. The
+ struct pbn_lock also has several fields to record the various kinds of
+ claims that data_vios can have on physical blocks. The pbn_lock is
+ added to a hashtable like the logical block locks in step 2. This
+ hashtable is also covered by the implicit physical zone lock. The
+ reference count of the free block is updated to prevent any other
+ data_vio from considering it free. The reference counters are a
+ sub-component of the slab and are thus also covered by the implicit
+ physical zone lock.
+
+5. If an allocation was obtained, the data_vio has all the resources it
+ needs to complete the write. The application bio can safely be
+ acknowledged at this point. The acknowledgment happens on a separate
+ thread to prevent the application callback from blocking other data_vio
+ operations.
+
+ If an allocation could not be obtained, the data_vio continues to
+ attempt to deduplicate or compress the data, but the bio is not
+ acknowledged because the vdo device may be out of space.
+
+6. At this point vdo must determine where to store the application data.
+ The data_vio's data is hashed and the hash (the "record name") is
+ recorded in the data_vio.
+
+7. The data_vio reserves or joins a struct hash_lock, which manages all of
+ the data_vios currently writing the same data. Active hash locks are
+ tracked in a hashtable similar to the way logical block locks are
+ tracked in step 2. This hashtable is covered by the implicit lock on
+ the hash zone.
+
+ If there is no existing hash lock for this data_vio's record_name, the
+ data_vio obtains a hash lock from the pool, adds it to the hashtable,
+ and sets itself as the new hash lock's "agent." The hash_lock pool is
+ also covered by the implicit hash zone lock. The hash lock agent will
+ do all the work to decide where the application data will be
+ written. If a hash lock for the data_vio's record_name already exists,
+ and the data_vio's data is the same as the agent's data, the new
+ data_vio will wait for the agent to complete its work and then share
+ its result.
+
+ In the rare case that a hash lock exists for the data_vio's hash but
+ the data does not match the hash lock's agent, the data_vio skips to
+ step 8h and attempts to write its data directly. This can happen if two
+ different data blocks produce the same hash, for example.
+
+8. The hash lock agent attempts to deduplicate or compress its data with
+ the following steps.
+
+ a. The agent initializes and sends its embedded deduplication request
+ (struct uds_request) to the deduplication index. This does not
+ require the data_vio to get any locks because the index components
+ manage their own locking. The data_vio waits until it either gets a
+ response from the index or times out.
+
+ b. If the deduplication index returns advice, the data_vio attempts to
+ obtain a physical block lock on the indicated physical address, in
+ order to read the data and verify that it is the same as the
+ data_vio's data, and that it can accept more references. If the
+ physical address is already locked by another data_vio, the data at
+ that address may soon be overwritten so it is not safe to use the
+ address for deduplication.
+
+ c. If the data matches and the physical block can add references, the
+ agent and any other data_vios waiting on it will record this
+ physical block as their new physical address and proceed to step 9
+ to record their new mapping. If there are more data_vios in the hash
+ lock than there are references available, one of the remaining
+ data_vios becomes the new agent and continues to step 8d as if no
+ valid advice was returned.
+
+ d. If no usable duplicate block was found, the agent first checks that
+ it has an allocated physical block (from step 3) that it can write
+ to. If the agent does not have an allocation, some other data_vio in
+ the hash lock that does have an allocation takes over as agent. If
+ none of the data_vios have an allocated physical block, these writes
+ are out of space, so they proceed to step 13 for cleanup.
+
+ e. The agent attempts to compress its data. If the data does not
+ compress, the data_vio will continue to step 8h to write its data
+ directly.
+
+ If the compressed size is small enough, the agent will release the
+ implicit hash zone lock and go to the packer (struct packer) where
+ it will be placed in a bin (struct packer_bin) along with other
+ data_vios. All compression operations require the implicit lock on
+ the packer zone.
+
+ The packer can combine up to 14 compressed blocks in a single 4k
+ data block. Compression is only helpful if vdo can pack at least 2
+ data_vios into a single data block. This means that a data_vio may
+ wait in the packer for an arbitrarily long time for other data_vios
+ to fill out the compressed block. There is a mechanism for vdo to
+ evict waiting data_vios when continuing to wait would cause
+ problems. Circumstances causing an eviction include an application
+ flush, device shutdown, or a subsequent data_vio trying to overwrite
+ the same logical block address. A data_vio may also be evicted from
+ the packer if it cannot be paired with any other compressed block
+ before more compressible blocks need to use its bin. An evicted
+ data_vio will proceed to step 8h to write its data directly.
+
+ f. If the agent fills a packer bin, either because all 14 of its slots
+ are used or because it has no remaining space, it is written out
+ using the allocated physical block from one of its data_vios. Step
+ 8d has already ensured that an allocation is available.
+
+ g. Each data_vio sets the compressed block as its new physical address.
+ The data_vio obtains an implicit lock on the physical zone and
+ acquires the struct pbn_lock for the compressed block, which is
+ modified to be a shared lock. Then it releases the implicit physical
+ zone lock and proceeds to step 8i.
+
+ h. Any data_vio evicted from the packer will have an allocation from
+ step 3. It will write its data to that allocated physical block.
+
+ i. After the data is written, if the data_vio is the agent of a hash
+ lock, it will reacquire the implicit hash zone lock and share its
+ physical address with as many other data_vios in the hash lock as
+ possible. Each data_vio will then proceed to step 9 to record its
+ new mapping.
+
+ j. If the agent actually wrote new data (whether compressed or not),
+ the deduplication index is updated to reflect the location of the
+ new data. The agent then releases the implicit hash zone lock.
+
+9. The data_vio determines the previous mapping of the logical address.
+ There is a cache for block map leaf pages (the "block map cache"),
+ because there are usually too many block map leaf nodes to store
+ entirely in memory. If the desired leaf page is not in the cache, the
+ data_vio will reserve a slot in the cache and load the desired page
+ into it, possibly evicting an older cached page. The data_vio then
+ finds the current physical address for this logical address (the "old
+ physical mapping"), if any, and records it. This step requires a lock
+ on the block map cache structures, covered by the implicit logical zone
+ lock.
+
+10. The data_vio makes an entry in the recovery journal containing the
+ logical block address, the old physical mapping, and the new physical
+ mapping. Making this journal entry requires holding the implicit
+ recovery journal lock. The data_vio will wait in the journal until all
+ recovery blocks up to the one containing its entry have been written
+ and flushed to ensure the transaction is stable on storage.
+
+11. Once the recovery journal entry is stable, the data_vio makes two slab
+ journal entries: an increment entry for the new mapping, and a
+ decrement entry for the old mapping. These two operations each require
+ holding a lock on the affected physical slab, covered by its implicit
+ physical zone lock. For correctness during recovery, the slab journal
+ entries in any given slab journal must be in the same order as the
+ corresponding recovery journal entries. Therefore, if the two entries
+ are in different zones, they are made concurrently, and if they are in
+ the same zone, the increment is always made before the decrement in
+ order to avoid underflow. After each slab journal entry is made in
+ memory, the associated reference count is also updated in memory.
+
+12. Once both of the reference count updates are done, the data_vio
+ acquires the implicit logical zone lock and updates the
+ logical-to-physical mapping in the block map to point to the new
+ physical block. At this point the write operation is complete.
+
+13. If the data_vio has a hash lock, it acquires the implicit hash zone
+ lock and releases its hash lock to the pool.
+
+ The data_vio then acquires the implicit physical zone lock and releases
+ the struct pbn_lock it holds for its allocated block. If it had an
+ allocation that it did not use, it also sets the reference count for
+ that block back to zero to free it for use by subsequent data_vios.
+
+ The data_vio then acquires the implicit logical zone lock and releases
+ the logical block lock acquired in step 2.
+
+ The application bio is then acknowledged if it has not previously been
+ acknowledged, and the data_vio is returned to the pool.
+
+*Read Path*
+
+An application read bio follows a much simpler set of steps. It does steps
+1 and 2 in the write path to obtain a data_vio and lock its logical
+address. If there is already a write data_vio in progress for that logical
+address that is guaranteed to complete, the read data_vio will copy the
+data from the write data_vio and return it. Otherwise, it will look up the
+logical-to-physical mapping by traversing the block map tree as in step 3,
+and then read and possibly decompress the indicated data at the indicated
+physical block address. A read data_vio will not allocate block map tree
+nodes if they are missing. If the interior block map nodes do not exist
+yet, the logical block map address must still be unmapped and the read
+data_vio will return all zeroes. A read data_vio handles cleanup and
+acknowledgment as in step 13, although it only needs to release the logical
+lock and return itself to the pool.
+
+*Small Writes*
+
+All storage within vdo is managed as 4KB blocks, but it can accept writes
+as small as 512 bytes. Processing a write that is smaller than 4K requires
+a read-modify-write operation that reads the relevant 4K block, copies the
+new data over the approriate sectors of the block, and then launches a
+write operation for the modified data block. The read and write stages of
+this operation are nearly identical to the normal read and write
+operations, and a single data_vio is used throughout this operation.
+
+*Recovery*
+
+When a vdo is restarted after a crash, it will attempt to recover from the
+recovery journal. During the pre-resume phase of the next start, the
+recovery journal is read. The increment portion of valid entries are played
+into the block map. Next, valid entries are played, in order as required,
+into the slab journals. Finally, each physical zone attempts to replay at
+least one slab journal to reconstruct the reference counts of one slab.
+Once each zone has some free space (or has determined that it has none),
+the vdo comes back online, while the remainder of the slab journals are
+used to reconstruct the rest of the reference counts in the background.
+
+*Read-only Rebuild*
+
+If a vdo encounters an unrecoverable error, it will enter read-only mode.
+This mode indicates that some previously acknowledged data may have been
+lost. The vdo may be instructed to rebuild as best it can in order to
+return to a writable state. However, this is never done automatically due
+to the possibility that data has been lost. During a read-only rebuild, the
+block map is recovered from the recovery journal as before. However, the
+reference counts are not rebuilt from the slab journals. Instead, the
+reference counts are zeroed, the entire block map is traversed, and the
+reference counts are updated from the block mappings. While this may lose
+some data, it ensures that the block map and reference counts are
+consistent with each other. This allows vdo to resume normal operation and
+accept further writes.
diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst
new file mode 100644
index 000000000000..a14e6d3e787c
--- /dev/null
+++ b/Documentation/admin-guide/device-mapper/vdo.rst
@@ -0,0 +1,412 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+dm-vdo
+======
+
+The dm-vdo (virtual data optimizer) device mapper target provides
+block-level deduplication, compression, and thin provisioning. As a device
+mapper target, it can add these features to the storage stack, compatible
+with any file system. The vdo target does not protect against data
+corruption, relying instead on integrity protection of the storage below
+it. It is strongly recommended that lvm be used to manage vdo volumes. See
+lvmvdo(7).
+
+Userspace component
+===================
+
+Formatting a vdo volume requires the use of the 'vdoformat' tool, available
+at:
+
+https://github.com/dm-vdo/vdo/
+
+In most cases, a vdo target will recover from a crash automatically the
+next time it is started. In cases where it encountered an unrecoverable
+error (either during normal operation or crash recovery) the target will
+enter or come up in read-only mode. Because read-only mode is indicative of
+data-loss, a positive action must be taken to bring vdo out of read-only
+mode. The 'vdoforcerebuild' tool, available from the same repo, is used to
+prepare a read-only vdo to exit read-only mode. After running this tool,
+the vdo target will rebuild its metadata the next time it is
+started. Although some data may be lost, the rebuilt vdo's metadata will be
+internally consistent and the target will be writable again.
+
+The repo also contains additional userspace tools which can be used to
+inspect a vdo target's on-disk metadata. Fortunately, these tools are
+rarely needed except by dm-vdo developers.
+
+Metadata requirements
+=====================
+
+Each vdo volume reserves 3GB of space for metadata, or more depending on
+its configuration. It is helpful to check that the space saved by
+deduplication and compression is not cancelled out by the metadata
+requirements. An estimation of the space saved for a specific dataset can
+be computed with the vdo estimator tool, which is available at:
+
+https://github.com/dm-vdo/vdoestimator/
+
+Target interface
+================
+
+Table line
+----------
+
+::
+
+ <offset> <logical device size> vdo V4 <storage device>
+ <storage device size> <minimum I/O size> <block map cache size>
+ <block map era length> [optional arguments]
+
+
+Required parameters:
+
+ offset:
+ The offset, in sectors, at which the vdo volume's logical
+ space begins.
+
+ logical device size:
+ The size of the device which the vdo volume will service,
+ in sectors. Must match the current logical size of the vdo
+ volume.
+
+ storage device:
+ The device holding the vdo volume's data and metadata.
+
+ storage device size:
+ The size of the device holding the vdo volume, as a number
+ of 4096-byte blocks. Must match the current size of the vdo
+ volume.
+
+ minimum I/O size:
+ The minimum I/O size for this vdo volume to accept, in
+ bytes. Valid values are 512 or 4096. The recommended value
+ is 4096.
+
+ block map cache size:
+ The size of the block map cache, as a number of 4096-byte
+ blocks. The minimum and recommended value is 32768 blocks.
+ If the logical thread count is non-zero, the cache size
+ must be at least 4096 blocks per logical thread.
+
+ block map era length:
+ The speed with which the block map cache writes out
+ modified block map pages. A smaller era length is likely to
+ reduce the amount of time spent rebuilding, at the cost of
+ increased block map writes during normal operation. The
+ maximum and recommended value is 16380; the minimum value
+ is 1.
+
+Optional parameters:
+--------------------
+Some or all of these parameters may be specified as <key> <value> pairs.
+
+Thread related parameters:
+
+Different categories of work are assigned to separate thread groups, and
+the number of threads in each group can be configured separately.
+
+If <hash>, <logical>, and <physical> are all set to 0, the work handled by
+all three thread types will be handled by a single thread. If any of these
+values are non-zero, all of them must be non-zero.
+
+ ack:
+ The number of threads used to complete bios. Since
+ completing a bio calls an arbitrary completion function
+ outside the vdo volume, threads of this type allow the vdo
+ volume to continue processing requests even when bio
+ completion is slow. The default is 1.
+
+ bio:
+ The number of threads used to issue bios to the underlying
+ storage. Threads of this type allow the vdo volume to
+ continue processing requests even when bio submission is
+ slow. The default is 4.
+
+ bioRotationInterval:
+ The number of bios to enqueue on each bio thread before
+ switching to the next thread. The value must be greater
+ than 0 and not more than 1024; the default is 64.
+
+ cpu:
+ The number of threads used to do CPU-intensive work, such
+ as hashing and compression. The default is 1.
+
+ hash:
+ The number of threads used to manage data comparisons for
+ deduplication based on the hash value of data blocks. The
+ default is 0.
+
+ logical:
+ The number of threads used to manage caching and locking
+ based on the logical address of incoming bios. The default
+ is 0; the maximum is 60.
+
+ physical:
+ The number of threads used to manage administration of the
+ underlying storage device. At format time, a slab size for
+ the vdo is chosen; the vdo storage device must be large
+ enough to have at least 1 slab per physical thread. The
+ default is 0; the maximum is 16.
+
+Miscellaneous parameters:
+
+ maxDiscard:
+ The maximum size of discard bio accepted, in 4096-byte
+ blocks. I/O requests to a vdo volume are normally split
+ into 4096-byte blocks, and processed up to 2048 at a time.
+ However, discard requests to a vdo volume can be
+ automatically split to a larger size, up to <maxDiscard>
+ 4096-byte blocks in a single bio, and are limited to 1500
+ at a time. Increasing this value may provide better overall
+ performance, at the cost of increased latency for the
+ individual discard requests. The default and minimum is 1;
+ the maximum is UINT_MAX / 4096.
+
+ deduplication:
+ Whether deduplication is enabled. The default is 'on'; the
+ acceptable values are 'on' and 'off'.
+
+ compression:
+ Whether compression is enabled. The default is 'off'; the
+ acceptable values are 'on' and 'off'.
+
+Device modification
+-------------------
+
+A modified table may be loaded into a running, non-suspended vdo volume.
+The modifications will take effect when the device is next resumed. The
+modifiable parameters are <logical device size>, <physical device size>,
+<maxDiscard>, <compression>, and <deduplication>.
+
+If the logical device size or physical device size are changed, upon
+successful resume vdo will store the new values and require them on future
+startups. These two parameters may not be decreased. The logical device
+size may not exceed 4 PB. The physical device size must increase by at
+least 32832 4096-byte blocks if at all, and must not exceed the size of the
+underlying storage device. Additionally, when formatting the vdo device, a
+slab size is chosen: the physical device size may never increase above the
+size which provides 8192 slabs, and each increase must be large enough to
+add at least one new slab.
+
+Examples:
+
+Start a previously-formatted vdo volume with 1 GB logical space and 1 GB
+physical space, storing to /dev/dm-1 which has more than 1 GB of space.
+
+::
+
+ dmsetup create vdo0 --table \
+ "0 2097152 vdo V4 /dev/dm-1 262144 4096 32768 16380"
+
+Grow the logical size to 4 GB.
+
+::
+
+ dmsetup reload vdo0 --table \
+ "0 8388608 vdo V4 /dev/dm-1 262144 4096 32768 16380"
+ dmsetup resume vdo0
+
+Grow the physical size to 2 GB.
+
+::
+
+ dmsetup reload vdo0 --table \
+ "0 8388608 vdo V4 /dev/dm-1 524288 4096 32768 16380"
+ dmsetup resume vdo0
+
+Grow the physical size by 1 GB more and increase max discard sectors.
+
+::
+
+ dmsetup reload vdo0 --table \
+ "0 10485760 vdo V4 /dev/dm-1 786432 4096 32768 16380 maxDiscard 8"
+ dmsetup resume vdo0
+
+Stop the vdo volume.
+
+::
+
+ dmsetup remove vdo0
+
+Start the vdo volume again. Note that the logical and physical device sizes
+must still match, but other parameters can change.
+
+::
+
+ dmsetup create vdo1 --table \
+ "0 10485760 vdo V4 /dev/dm-1 786432 512 65550 5000 hash 1 logical 3 physical 2"
+
+Messages
+--------
+All vdo devices accept messages in the form:
+
+::
+
+ dmsetup message <target-name> 0 <message-name> <message-parameters>
+
+The messages are:
+
+ stats:
+ Outputs the current view of the vdo statistics. Mostly used
+ by the vdostats userspace program to interpret the output
+ buffer.
+
+ config:
+ Outputs useful vdo configuration information. Mostly used
+ by users who want to recreate a similar VDO volume and
+ want to know the creation configuration used.
+
+ dump:
+ Dumps many internal structures to the system log. This is
+ not always safe to run, so it should only be used to debug
+ a hung vdo. Optional parameters to specify structures to
+ dump are:
+
+ viopool: The pool of I/O requests incoming bios
+ pools: A synonym of 'viopool'
+ vdo: Most of the structures managing on-disk data
+ queues: Basic information about each vdo thread
+ threads: A synonym of 'queues'
+ default: Equivalent to 'queues vdo'
+ all: All of the above.
+
+ dump-on-shutdown:
+ Perform a default dump next time vdo shuts down.
+
+
+Status
+------
+
+::
+
+ <device> <operating mode> <in recovery> <index state>
+ <compression state> <physical blocks used> <total physical blocks>
+
+ device:
+ The name of the vdo volume.
+
+ operating mode:
+ The current operating mode of the vdo volume; values may be
+ 'normal', 'recovering' (the volume has detected an issue
+ with its metadata and is attempting to repair itself), and
+ 'read-only' (an error has occurred that forces the vdo
+ volume to only support read operations and not writes).
+
+ in recovery:
+ Whether the vdo volume is currently in recovery mode;
+ values may be 'recovering' or '-' which indicates not
+ recovering.
+
+ index state:
+ The current state of the deduplication index in the vdo
+ volume; values may be 'closed', 'closing', 'error',
+ 'offline', 'online', 'opening', and 'unknown'.
+
+ compression state:
+ The current state of compression in the vdo volume; values
+ may be 'offline' and 'online'.
+
+ used physical blocks:
+ The number of physical blocks in use by the vdo volume.
+
+ total physical blocks:
+ The total number of physical blocks the vdo volume may use;
+ the difference between this value and the
+ <used physical blocks> is the number of blocks the vdo
+ volume has left before being full.
+
+Memory Requirements
+===================
+
+A vdo target requires a fixed 38 MB of RAM along with the following amounts
+that scale with the target:
+
+- 1.15 MB of RAM for each 1 MB of configured block map cache size. The
+ block map cache requires a minimum of 150 MB.
+- 1.6 MB of RAM for each 1 TB of logical space.
+- 268 MB of RAM for each 1 TB of physical storage managed by the volume.
+
+The deduplication index requires additional memory which scales with the
+size of the deduplication window. For dense indexes, the index requires 1
+GB of RAM per 1 TB of window. For sparse indexes, the index requires 1 GB
+of RAM per 10 TB of window. The index configuration is set when the target
+is formatted and may not be modified.
+
+Module Parameters
+=================
+
+The vdo driver has a numeric parameter 'log_level' which controls the
+verbosity of logging from the driver. The default setting is 6
+(LOGLEVEL_INFO and more severe messages).
+
+Run-time Usage
+==============
+
+When using dm-vdo, it is important to be aware of the ways in which its
+behavior differs from other storage targets.
+
+- There is no guarantee that over-writes of existing blocks will succeed.
+ Because the underlying storage may be multiply referenced, over-writing
+ an existing block generally requires a vdo to have a free block
+ available.
+
+- When blocks are no longer in use, sending a discard request for those
+ blocks lets the vdo release references for those blocks. If the vdo is
+ thinly provisioned, discarding unused blocks is essential to prevent the
+ target from running out of space. However, due to the sharing of
+ duplicate blocks, no discard request for any given logical block is
+ guaranteed to reclaim space.
+
+- Assuming the underlying storage properly implements flush requests, vdo
+ is resilient against crashes, however, unflushed writes may or may not
+ persist after a crash.
+
+- Each write to a vdo target entails a significant amount of processing.
+ However, much of the work is paralellizable. Therefore, vdo targets
+ achieve better throughput at higher I/O depths, and can support up 2048
+ requests in parallel.
+
+Tuning
+======
+
+The vdo device has many options, and it can be difficult to make optimal
+choices without perfect knowledge of the workload. Additionally, most
+configuration options must be set when a vdo target is started, and cannot
+be changed without shutting it down completely; the configuration cannot be
+changed while the target is active. Ideally, tuning with simulated
+workloads should be performed before deploying vdo in production
+environments.
+
+The most important value to adjust is the block map cache size. In order to
+service a request for any logical address, a vdo must load the portion of
+the block map which holds the relevant mapping. These mappings are cached.
+Performance will suffer when the working set does not fit in the cache. By
+default, a vdo allocates 128 MB of metadata cache in RAM to support
+efficient access to 100 GB of logical space at a time. It should be scaled
+up proportionally for larger working sets.
+
+The logical and physical thread counts should also be adjusted. A logical
+thread controls a disjoint section of the block map, so additional logical
+threads increase parallelism and can increase throughput. Physical threads
+control a disjoint section of the data blocks, so additional physical
+threads can also increase throughput. However, excess threads can waste
+resources and increase contention.
+
+Bio submission threads control the parallelism involved in sending I/O to
+the underlying storage; fewer threads mean there is more opportunity to
+reorder I/O requests for performance benefit, but also that each I/O
+request has to wait longer before being submitted.
+
+Bio acknowledgment threads are used for finishing I/O requests. This is
+done on dedicated threads since the amount of work required to execute a
+bio's callback can not be controlled by the vdo itself. Usually one thread
+is sufficient but additional threads may be beneficial, particularly when
+bios have CPU-heavy callbacks.
+
+CPU threads are used for hashing and for compression; in workloads with
+compression enabled, more threads may result in higher throughput.
+
+Hash threads are used to sort active requests by hash and determine whether
+they should deduplicate; the most CPU intensive actions done by these
+threads are comparison of 4096-byte data blocks. In most cases, a single
+hash thread is sufficient.
diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst
index 0e9b48daf690..7c036590cd07 100644
--- a/Documentation/admin-guide/dynamic-debug-howto.rst
+++ b/Documentation/admin-guide/dynamic-debug-howto.rst
@@ -26,6 +26,11 @@ Dynamic debug provides:
- format string
- class name (as known/declared by each module)
+NOTE: To actually get the debug-print output on the console, you may
+need to adjust the kernel ``loglevel=``, or use ``ignore_loglevel``.
+Read about these kernel parameters in
+Documentation/admin-guide/kernel-parameters.rst.
+
Viewing Dynamic Debug Behaviour
===============================
diff --git a/Documentation/admin-guide/edid.rst b/Documentation/admin-guide/edid.rst
index 80deeb21a265..1a9b965aa486 100644
--- a/Documentation/admin-guide/edid.rst
+++ b/Documentation/admin-guide/edid.rst
@@ -24,37 +24,4 @@ restrictions later on.
As a remedy for such situations, the kernel configuration item
CONFIG_DRM_LOAD_EDID_FIRMWARE was introduced. It allows to provide an
individually prepared or corrected EDID data set in the /lib/firmware
-directory from where it is loaded via the firmware interface. The code
-(see drivers/gpu/drm/drm_edid_load.c) contains built-in data sets for
-commonly used screen resolutions (800x600, 1024x768, 1280x1024, 1600x1200,
-1680x1050, 1920x1080) as binary blobs, but the kernel source tree does
-not contain code to create these data. In order to elucidate the origin
-of the built-in binary EDID blobs and to facilitate the creation of
-individual data for a specific misbehaving monitor, commented sources
-and a Makefile environment are given here.
-
-To create binary EDID and C source code files from the existing data
-material, simply type "make" in tools/edid/.
-
-If you want to create your own EDID file, copy the file 1024x768.S,
-replace the settings with your own data and add a new target to the
-Makefile. Please note that the EDID data structure expects the timing
-values in a different way as compared to the standard X11 format.
-
-X11:
- HTimings:
- hdisp hsyncstart hsyncend htotal
- VTimings:
- vdisp vsyncstart vsyncend vtotal
-
-EDID::
-
- #define XPIX hdisp
- #define XBLANK htotal-hdisp
- #define XOFFSET hsyncstart-hdisp
- #define XPULSE hsyncend-hsyncstart
-
- #define YPIX vdisp
- #define YBLANK vtotal-vdisp
- #define YOFFSET vsyncstart-vdisp
- #define YPULSE vsyncend-vsyncstart
+directory from where it is loaded via the firmware interface.
diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst
index 5740d85439ff..2418b0c2d3df 100644
--- a/Documentation/admin-guide/ext4.rst
+++ b/Documentation/admin-guide/ext4.rst
@@ -212,16 +212,6 @@ When mounting an ext4 filesystem, the following option are accepted:
that ext4's inode table readahead algorithm will pre-read into the
buffer cache. The default value is 32 blocks.
- nouser_xattr
- Disables Extended User Attributes. See the attr(5) manual page for
- more information about extended attributes.
-
- noacl
- This option disables POSIX Access Control List support. If ACL support
- is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL
- is enabled by default on mount. See the acl(5) manual page for more
- information about acl.
-
bsddf (*)
Make 'df' act like BSD.
diff --git a/Documentation/admin-guide/gpio/gpio-mockup.rst b/Documentation/admin-guide/gpio/gpio-mockup.rst
index 493071da1738..d6e7438a7550 100644
--- a/Documentation/admin-guide/gpio/gpio-mockup.rst
+++ b/Documentation/admin-guide/gpio/gpio-mockup.rst
@@ -3,6 +3,14 @@
GPIO Testing Driver
===================
+.. note::
+
+ This module has been obsoleted by the more flexible gpio-sim.rst.
+ New developments should use that API and existing developments are
+ encouraged to migrate as soon as possible.
+ This module will continue to be maintained but no new features will be
+ added.
+
The GPIO Testing Driver (gpio-mockup) provides a way to create simulated GPIO
chips for testing purposes. The lines exposed by these chips can be accessed
using the standard GPIO character device interface as well as manipulated
diff --git a/Documentation/admin-guide/gpio/gpio-virtuser.rst b/Documentation/admin-guide/gpio/gpio-virtuser.rst
new file mode 100644
index 000000000000..2aca70db9f3b
--- /dev/null
+++ b/Documentation/admin-guide/gpio/gpio-virtuser.rst
@@ -0,0 +1,177 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+Virtual GPIO Consumer
+=====================
+
+The virtual GPIO Consumer module allows users to instantiate virtual devices
+that request GPIOs and then control their behavior over debugfs. Virtual
+consumer devices can be instantiated from device-tree or over configfs.
+
+A virtual consumer uses the driver-facing GPIO APIs and allows to cover it with
+automated tests driven by user-space. The GPIOs are requested using
+``gpiod_get_array()`` and so we support multiple GPIOs per connector ID.
+
+Creating GPIO consumers
+-----------------------
+
+The gpio-consumer module registers a configfs subsystem called
+``'gpio-virtuser'``. For details of the configfs filesystem, please refer to
+the configfs documentation.
+
+The user can create a hierarchy of configfs groups and items as well as modify
+values of exposed attributes. Once the consumer is instantiated, this hierarchy
+will be translated to appropriate device properties. The general structure is:
+
+**Group:** ``/config/gpio-virtuser``
+
+This is the top directory of the gpio-consumer configfs tree.
+
+**Group:** ``/config/gpio-consumer/example-name``
+
+**Attribute:** ``/config/gpio-consumer/example-name/live``
+
+**Attribute:** ``/config/gpio-consumer/example-name/dev_name``
+
+This is a directory representing a GPIO consumer device.
+
+The read-only ``dev_name`` attribute exposes the name of the device as it will
+appear in the system on the platform bus. This is useful for locating the
+associated debugfs directory under
+``/sys/kernel/debug/gpio-virtuser/$dev_name``.
+
+The ``'live'`` attribute allows to trigger the actual creation of the device
+once it's fully configured. The accepted values are: ``'1'`` to enable the
+virtual device and ``'0'`` to disable and tear it down.
+
+Creating GPIO lookup tables
+---------------------------
+
+Users can create a number of configfs groups under the device group:
+
+**Group:** ``/config/gpio-consumer/example-name/con_id``
+
+The ``'con_id'`` directory represents a single GPIO lookup and its value maps
+to the ``'con_id'`` argument of the ``gpiod_get()`` function. For example:
+``con_id`` == ``'reset'`` maps to the ``reset-gpios`` device property.
+
+Users can assign a number of GPIOs to each lookup. Each GPIO is a sub-directory
+with a user-defined name under the ``'con_id'`` group.
+
+**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/key``
+
+**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/offset``
+
+**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/drive``
+
+**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/pull``
+
+**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/active_low``
+
+**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/transitory``
+
+This is a group describing a single GPIO in the ``con_id-gpios`` property.
+
+For virtual consumers created using configfs we use machine lookup tables so
+this group can be considered as a mapping between the filesystem and the fields
+of a single entry in ``'struct gpiod_lookup'``.
+
+The ``'key'`` attribute represents either the name of the chip this GPIO
+belongs to or the GPIO line name. This depends on the value of the ``'offset'``
+attribute: if its value is >= 0, then ``'key'`` represents the label of the
+chip to lookup while ``'offset'`` represents the offset of the line in that
+chip. If ``'offset'`` is < 0, then ``'key'`` represents the name of the line.
+
+The remaining attributes map to the ``'flags'`` field of the GPIO lookup
+struct. The first two take string values as arguments:
+
+**``'drive'``:** ``'push-pull'``, ``'open-drain'``, ``'open-source'``
+**``'pull'``:** ``'pull-up'``, ``'pull-down'``, ``'pull-disabled'``, ``'as-is'``
+
+``'active_low'`` and ``'transitory'`` are boolean attributes.
+
+Activating GPIO consumers
+-------------------------
+
+Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in
+order to instantiate the consumer. It can be set back to 0 to destroy the
+virtual device. The module will synchronously wait for the new simulated device
+to be successfully probed and if this doesn't happen, writing to ``'live'`` will
+result in an error.
+
+Device-tree
+-----------
+
+Virtual GPIO consumers can also be defined in device-tree. The compatible string
+must be: ``"gpio-virtuser"`` with at least one property following the
+standardized GPIO pattern.
+
+An example device-tree code defining a virtual GPIO consumer:
+
+.. code-block :: none
+
+ gpio-virt-consumer {
+ compatible = "gpio-virtuser";
+
+ foo-gpios = <&gpio0 5 GPIO_ACTIVE_LOW>, <&gpio1 2 0>;
+ bar-gpios = <&gpio0 6 0>;
+ };
+
+Controlling virtual GPIO consumers
+----------------------------------
+
+Once active, the device will export debugfs attributes for controlling GPIO
+arrays as well as each requested GPIO line separately. Let's consider the
+following device property: ``foo-gpios = <&gpio0 0 0>, <&gpio0 4 0>;``.
+
+The following debugfs attribute groups will be created:
+
+**Group:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/``
+
+This is the group that will contain the attributes for the entire GPIO array.
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/values``
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/values_atomic``
+
+Both attributes allow to read and set arrays of GPIO values. User must pass
+exactly the number of values that the array contains in the form of a string
+containing zeroes and ones representing inactive and active GPIO states
+respectively. In this example: ``echo 11 > values``.
+
+The ``values_atomic`` attribute works the same as ``values`` but the kernel
+will execute the GPIO driver callbacks in interrupt context.
+
+**Group:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/``
+
+This is a group that represents a single GPIO with ``$index`` being its offset
+in the array.
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/consumer``
+
+Allows to set and read the consumer label of the GPIO line.
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/debounce``
+
+Allows to set and read the debounce period of the GPIO line.
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/direction``
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/direction_atomic``
+
+These two attributes allow to set the direction of the GPIO line. They accept
+"input" and "output" as values. The atomic variant executes the driver callback
+in interrupt context.
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/interrupts``
+
+If the line is requested in input mode, writing ``1`` to this attribute will
+make the module listen for edge interrupts on the GPIO. Writing ``0`` disables
+the monitoring. Reading this attribute returns the current number of registered
+interrupts (both edges).
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/value``
+
+**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/value_atomic``
+
+Both attributes allow to read and set values of individual requested GPIO lines.
+They accept the following values: ``1`` and ``0``.
diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst
index f6861ca16ffe..712f379731cb 100644
--- a/Documentation/admin-guide/gpio/index.rst
+++ b/Documentation/admin-guide/gpio/index.rst
@@ -1,16 +1,17 @@
.. SPDX-License-Identifier: GPL-2.0
====
-gpio
+GPIO
====
.. toctree::
:maxdepth: 1
+ Character Device Userspace API <../../userspace-api/gpio/chardev>
gpio-aggregator
- sysfs
- gpio-mockup
gpio-sim
+ gpio-virtuser
+ Obsolete APIs <obsolete>
.. only:: subproject and html
diff --git a/Documentation/admin-guide/gpio/obsolete.rst b/Documentation/admin-guide/gpio/obsolete.rst
new file mode 100644
index 000000000000..5adbff02d61f
--- /dev/null
+++ b/Documentation/admin-guide/gpio/obsolete.rst
@@ -0,0 +1,13 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================
+Obsolete GPIO APIs
+==================
+
+.. toctree::
+ :maxdepth: 1
+
+ Character Device Userspace API (v1) <../../userspace-api/gpio/chardev_v1>
+ Sysfs Interface <../../userspace-api/gpio/sysfs>
+ Mockup Testing Module <gpio-mockup>
+
diff --git a/Documentation/admin-guide/gpio/sysfs.rst b/Documentation/admin-guide/gpio/sysfs.rst
deleted file mode 100644
index 35171d15f78d..000000000000
--- a/Documentation/admin-guide/gpio/sysfs.rst
+++ /dev/null
@@ -1,167 +0,0 @@
-GPIO Sysfs Interface for Userspace
-==================================
-
-.. warning::
-
- THIS ABI IS DEPRECATED, THE ABI DOCUMENTATION HAS BEEN MOVED TO
- Documentation/ABI/obsolete/sysfs-gpio AND NEW USERSPACE CONSUMERS
- ARE SUPPOSED TO USE THE CHARACTER DEVICE ABI. THIS OLD SYSFS ABI WILL
- NOT BE DEVELOPED (NO NEW FEATURES), IT WILL JUST BE MAINTAINED.
-
-Refer to the examples in tools/gpio/* for an introduction to the new
-character device ABI. Also see the userspace header in
-include/uapi/linux/gpio.h
-
-The deprecated sysfs ABI
-------------------------
-Platforms which use the "gpiolib" implementors framework may choose to
-configure a sysfs user interface to GPIOs. This is different from the
-debugfs interface, since it provides control over GPIO direction and
-value instead of just showing a gpio state summary. Plus, it could be
-present on production systems without debugging support.
-
-Given appropriate hardware documentation for the system, userspace could
-know for example that GPIO #23 controls the write protect line used to
-protect boot loader segments in flash memory. System upgrade procedures
-may need to temporarily remove that protection, first importing a GPIO,
-then changing its output state, then updating the code before re-enabling
-the write protection. In normal use, GPIO #23 would never be touched,
-and the kernel would have no need to know about it.
-
-Again depending on appropriate hardware documentation, on some systems
-userspace GPIO can be used to determine system configuration data that
-standard kernels won't know about. And for some tasks, simple userspace
-GPIO drivers could be all that the system really needs.
-
-DO NOT ABUSE SYSFS TO CONTROL HARDWARE THAT HAS PROPER KERNEL DRIVERS.
-PLEASE READ THE DOCUMENT AT Documentation/driver-api/gpio/drivers-on-gpio.rst
-TO AVOID REINVENTING KERNEL WHEELS IN USERSPACE. I MEAN IT. REALLY.
-
-Paths in Sysfs
---------------
-There are three kinds of entries in /sys/class/gpio:
-
- - Control interfaces used to get userspace control over GPIOs;
-
- - GPIOs themselves; and
-
- - GPIO controllers ("gpio_chip" instances).
-
-That's in addition to standard files including the "device" symlink.
-
-The control interfaces are write-only:
-
- /sys/class/gpio/
-
- "export" ...
- Userspace may ask the kernel to export control of
- a GPIO to userspace by writing its number to this file.
-
- Example: "echo 19 > export" will create a "gpio19" node
- for GPIO #19, if that's not requested by kernel code.
-
- "unexport" ...
- Reverses the effect of exporting to userspace.
-
- Example: "echo 19 > unexport" will remove a "gpio19"
- node exported using the "export" file.
-
-GPIO signals have paths like /sys/class/gpio/gpio42/ (for GPIO #42)
-and have the following read/write attributes:
-
- /sys/class/gpio/gpioN/
-
- "direction" ...
- reads as either "in" or "out". This value may
- normally be written. Writing as "out" defaults to
- initializing the value as low. To ensure glitch free
- operation, values "low" and "high" may be written to
- configure the GPIO as an output with that initial value.
-
- Note that this attribute *will not exist* if the kernel
- doesn't support changing the direction of a GPIO, or
- it was exported by kernel code that didn't explicitly
- allow userspace to reconfigure this GPIO's direction.
-
- "value" ...
- reads as either 0 (low) or 1 (high). If the GPIO
- is configured as an output, this value may be written;
- any nonzero value is treated as high.
-
- If the pin can be configured as interrupt-generating interrupt
- and if it has been configured to generate interrupts (see the
- description of "edge"), you can poll(2) on that file and
- poll(2) will return whenever the interrupt was triggered. If
- you use poll(2), set the events POLLPRI and POLLERR. If you
- use select(2), set the file descriptor in exceptfds. After
- poll(2) returns, either lseek(2) to the beginning of the sysfs
- file and read the new value or close the file and re-open it
- to read the value.
-
- "edge" ...
- reads as either "none", "rising", "falling", or
- "both". Write these strings to select the signal edge(s)
- that will make poll(2) on the "value" file return.
-
- This file exists only if the pin can be configured as an
- interrupt generating input pin.
-
- "active_low" ...
- reads as either 0 (false) or 1 (true). Write
- any nonzero value to invert the value attribute both
- for reading and writing. Existing and subsequent
- poll(2) support configuration via the edge attribute
- for "rising" and "falling" edges will follow this
- setting.
-
-GPIO controllers have paths like /sys/class/gpio/gpiochip42/ (for the
-controller implementing GPIOs starting at #42) and have the following
-read-only attributes:
-
- /sys/class/gpio/gpiochipN/
-
- "base" ...
- same as N, the first GPIO managed by this chip
-
- "label" ...
- provided for diagnostics (not always unique)
-
- "ngpio" ...
- how many GPIOs this manages (N to N + ngpio - 1)
-
-Board documentation should in most cases cover what GPIOs are used for
-what purposes. However, those numbers are not always stable; GPIOs on
-a daughtercard might be different depending on the base board being used,
-or other cards in the stack. In such cases, you may need to use the
-gpiochip nodes (possibly in conjunction with schematics) to determine
-the correct GPIO number to use for a given signal.
-
-
-Exporting from Kernel code
---------------------------
-Kernel code can explicitly manage exports of GPIOs which have already been
-requested using gpio_request()::
-
- /* export the GPIO to userspace */
- int gpiod_export(struct gpio_desc *desc, bool direction_may_change);
-
- /* reverse gpiod_export() */
- void gpiod_unexport(struct gpio_desc *desc);
-
- /* create a sysfs link to an exported GPIO node */
- int gpiod_export_link(struct device *dev, const char *name,
- struct gpio_desc *desc);
-
-After a kernel driver requests a GPIO, it may only be made available in
-the sysfs interface by gpiod_export(). The driver can control whether the
-signal direction may change. This helps drivers prevent userspace code
-from accidentally clobbering important system state.
-
-This explicit exporting can help with debugging (by making some kinds
-of experiments easier), or can provide an always-there interface that's
-suitable for documenting as part of a board support package.
-
-After the GPIO has been exported, gpiod_export_link() allows creating
-symlinks from elsewhere in sysfs to the GPIO sysfs node. Drivers can
-use this to provide the interface under their own device in sysfs with
-a descriptive name.
diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
index cf1eeefdfc32..a92e10ec402e 100644
--- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst
+++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst
@@ -67,8 +67,8 @@ arg4:
will be performed for all tasks in the task group of ``pid``.
arg5:
- userspace pointer to an unsigned long for storing the cookie returned by
- ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
+ userspace pointer to an unsigned long long for storing the cookie returned
+ by ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands.
In order for a process to push a cookie to, or pull a cookie from a process, it
is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the
diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst
index de99caabf65a..ff0b440ef2dc 100644
--- a/Documentation/admin-guide/hw-vuln/index.rst
+++ b/Documentation/admin-guide/hw-vuln/index.rst
@@ -21,3 +21,4 @@ are configurable at compile, boot or run time.
cross-thread-rsb
srso
gather_data_sampling
+ reg-file-data-sampling
diff --git a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
new file mode 100644
index 000000000000..0585d02b9a6c
--- /dev/null
+++ b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
@@ -0,0 +1,104 @@
+==================================
+Register File Data Sampling (RFDS)
+==================================
+
+Register File Data Sampling (RFDS) is a microarchitectural vulnerability that
+only affects Intel Atom parts(also branded as E-cores). RFDS may allow
+a malicious actor to infer data values previously used in floating point
+registers, vector registers, or integer registers. RFDS does not provide the
+ability to choose which data is inferred. CVE-2023-28746 is assigned to RFDS.
+
+Affected Processors
+===================
+Below is the list of affected Intel processors [#f1]_:
+
+ =================== ============
+ Common name Family_Model
+ =================== ============
+ ATOM_GOLDMONT 06_5CH
+ ATOM_GOLDMONT_D 06_5FH
+ ATOM_GOLDMONT_PLUS 06_7AH
+ ATOM_TREMONT_D 06_86H
+ ATOM_TREMONT 06_96H
+ ALDERLAKE 06_97H
+ ALDERLAKE_L 06_9AH
+ ATOM_TREMONT_L 06_9CH
+ RAPTORLAKE 06_B7H
+ RAPTORLAKE_P 06_BAH
+ ATOM_GRACEMONT 06_BEH
+ RAPTORLAKE_S 06_BFH
+ =================== ============
+
+As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and
+RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as
+vulnerable in Linux because they share the same family/model with an affected
+part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or
+CPUID.HYBRID. This information could be used to distinguish between the
+affected and unaffected parts, but it is deemed not worth adding complexity as
+the reporting is fixed automatically when these parts enumerate RFDS_NO.
+
+Mitigation
+==========
+Intel released a microcode update that enables software to clear sensitive
+information using the VERW instruction. Like MDS, RFDS deploys the same
+mitigation strategy to force the CPU to clear the affected buffers before an
+attacker can extract the secrets. This is achieved by using the otherwise
+unused and obsolete VERW instruction in combination with a microcode update.
+The microcode clears the affected CPU buffers when the VERW instruction is
+executed.
+
+Mitigation points
+-----------------
+VERW is executed by the kernel before returning to user space, and by KVM
+before VMentry. None of the affected cores support SMT, so VERW is not required
+at C-state transitions.
+
+New bits in IA32_ARCH_CAPABILITIES
+----------------------------------
+Newer processors and microcode update on existing affected processors added new
+bits to IA32_ARCH_CAPABILITIES MSR. These bits can be used to enumerate
+vulnerability and mitigation capability:
+
+- Bit 27 - RFDS_NO - When set, processor is not affected by RFDS.
+- Bit 28 - RFDS_CLEAR - When set, processor is affected by RFDS, and has the
+ microcode that clears the affected buffers on VERW execution.
+
+Mitigation control on the kernel command line
+---------------------------------------------
+The kernel command line allows to control RFDS mitigation at boot time with the
+parameter "reg_file_data_sampling=". The valid arguments are:
+
+ ========== =================================================================
+ on If the CPU is vulnerable, enable mitigation; CPU buffer clearing
+ on exit to userspace and before entering a VM.
+ off Disables mitigation.
+ ========== =================================================================
+
+Mitigation default is selected by CONFIG_MITIGATION_RFDS.
+
+Mitigation status information
+-----------------------------
+The Linux kernel provides a sysfs interface to enumerate the current
+vulnerability status of the system: whether the system is vulnerable, and
+which mitigations are active. The relevant sysfs file is:
+
+ /sys/devices/system/cpu/vulnerabilities/reg_file_data_sampling
+
+The possible values in this file are:
+
+ .. list-table::
+
+ * - 'Not affected'
+ - The processor is not vulnerable
+ * - 'Vulnerable'
+ - The processor is vulnerable, but no mitigation enabled
+ * - 'Vulnerable: No microcode'
+ - The processor is vulnerable but microcode is not updated.
+ * - 'Mitigation: Clear Register File'
+ - The processor is vulnerable and the CPU buffer clearing mitigation is
+ enabled.
+
+References
+----------
+.. [#f1] Affected Processors
+ https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html
diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst
index 32a8893e5617..132e0bc6007e 100644
--- a/Documentation/admin-guide/hw-vuln/spectre.rst
+++ b/Documentation/admin-guide/hw-vuln/spectre.rst
@@ -138,11 +138,10 @@ associated with the source address of the indirect branch. Specifically,
the BHB might be shared across privilege levels even in the presence of
Enhanced IBRS.
-Currently the only known real-world BHB attack vector is via
-unprivileged eBPF. Therefore, it's highly recommended to not enable
-unprivileged eBPF, especially when eIBRS is used (without retpolines).
-For a full mitigation against BHB attacks, it's recommended to use
-retpolines (or eIBRS combined with retpolines).
+Previously the only known real-world BHB attack vector was via unprivileged
+eBPF. Further research has found attacks that don't require unprivileged eBPF.
+For a full mitigation against BHB attacks it is recommended to set BHI_DIS_S or
+use the BHB clearing sequence.
Attack scenarios
----------------
@@ -430,6 +429,23 @@ The possible values in this file are:
'PBRSB-eIBRS: Not affected' CPU is not affected by PBRSB
=========================== =======================================================
+ - Branch History Injection (BHI) protection status:
+
+.. list-table::
+
+ * - BHI: Not affected
+ - System is not affected
+ * - BHI: Retpoline
+ - System is protected by retpoline
+ * - BHI: BHI_DIS_S
+ - System is protected by BHI_DIS_S
+ * - BHI: SW loop, KVM SW loop
+ - System is protected by software clearing sequence
+ * - BHI: Vulnerable
+ - System is vulnerable to BHI
+ * - BHI: Vulnerable, KVM: SW loop
+ - System is vulnerable; KVM is protected by software clearing sequence
+
Full mitigation might require a microcode update from the CPU
vendor. When the necessary microcode is not available, the kernel will
report vulnerability.
@@ -473,8 +489,8 @@ Spectre variant 2
-mindirect-branch=thunk-extern -mindirect-branch-register options.
If the kernel is compiled with a Clang compiler, the compiler needs
to support -mretpoline-external-thunk option. The kernel config
- CONFIG_RETPOLINE needs to be turned on, and the CPU needs to run with
- the latest updated microcode.
+ CONFIG_MITIGATION_RETPOLINE needs to be turned on, and the CPU needs
+ to run with the latest updated microcode.
On Intel Skylake-era systems the mitigation covers most, but not all,
cases. See :ref:`[3] <spec_ref3>` for more details.
@@ -484,7 +500,11 @@ Spectre variant 2
Systems which support enhanced IBRS (eIBRS) enable IBRS protection once at
boot, by setting the IBRS bit, and they're automatically protected against
- Spectre v2 variant attacks.
+ some Spectre v2 variant attacks. The BHB can still influence the choice of
+ indirect branch predictor entry, and although branch predictor entries are
+ isolated between modes when eIBRS is enabled, the BHB itself is not isolated
+ between modes. Systems which support BHI_DIS_S will set it to protect against
+ BHI attacks.
On Intel's enhanced IBRS systems, this includes cross-thread branch target
injections on SMT systems (STIBP). In other words, Intel eIBRS enables
@@ -572,73 +592,19 @@ Spectre variant 2
Mitigation control on the kernel command line
---------------------------------------------
-Spectre variant 2 mitigation can be disabled or force enabled at the
-kernel command line.
-
- nospectre_v1
-
- [X86,PPC] Disable mitigations for Spectre Variant 1
- (bounds check bypass). With this option data leaks are
- possible in the system.
-
- nospectre_v2
-
- [X86] Disable all mitigations for the Spectre variant 2
- (indirect branch prediction) vulnerability. System may
- allow data leaks with this option, which is equivalent
- to spectre_v2=off.
-
-
- spectre_v2=
-
- [X86] Control mitigation of Spectre variant 2
- (indirect branch speculation) vulnerability.
- The default operation protects the kernel from
- user space attacks.
-
- on
- unconditionally enable, implies
- spectre_v2_user=on
- off
- unconditionally disable, implies
- spectre_v2_user=off
- auto
- kernel detects whether your CPU model is
- vulnerable
-
- Selecting 'on' will, and 'auto' may, choose a
- mitigation method at run time according to the
- CPU, the available microcode, the setting of the
- CONFIG_RETPOLINE configuration option, and the
- compiler with which the kernel was built.
-
- Selecting 'on' will also enable the mitigation
- against user space to user space task attacks.
-
- Selecting 'off' will disable both the kernel and
- the user space protections.
-
- Specific mitigations can also be selected manually:
-
- retpoline auto pick between generic,lfence
- retpoline,generic Retpolines
- retpoline,lfence LFENCE; indirect branch
- retpoline,amd alias for retpoline,lfence
- eibrs Enhanced/Auto IBRS
- eibrs,retpoline Enhanced/Auto IBRS + Retpolines
- eibrs,lfence Enhanced/Auto IBRS + LFENCE
- ibrs use IBRS to protect kernel
+In general the kernel selects reasonable default mitigations for the
+current CPU.
- Not specifying this option is equivalent to
- spectre_v2=auto.
+Spectre default mitigations can be disabled or changed at the kernel
+command line with the following options:
- In general the kernel by default selects
- reasonable mitigations for the current CPU. To
- disable Spectre variant 2 mitigations, boot with
- spectre_v2=off. Spectre variant 1 mitigations
- cannot be disabled.
+ - nospectre_v1
+ - nospectre_v2
+ - spectre_v2={option}
+ - spectre_v2_user={option}
+ - spectre_bhi={option}
-For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt
+For more details on the available options, refer to Documentation/admin-guide/kernel-parameters.txt
Mitigation selection guide
--------------------------
diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst
index e715bfc09879..2ad1c05b8c88 100644
--- a/Documentation/admin-guide/hw-vuln/srso.rst
+++ b/Documentation/admin-guide/hw-vuln/srso.rst
@@ -135,7 +135,7 @@ and does not want to suffer the performance impact, one can always
disable the mitigation with spec_rstack_overflow=off.
Similarly, 'Mitigation: IBPB' is another full mitigation type employing
-an indrect branch prediction barrier after having applied the required
+an indirect branch prediction barrier after having applied the required
microcode patch for one's system. This mitigation comes also at
a performance cost.
@@ -158,3 +158,72 @@ poisoned BTB entry and using that safe one for all function returns.
In older Zen1 and Zen2, this is accomplished using a reinterpretation
technique similar to Retbleed one: srso_untrain_ret() and
srso_safe_ret().
+
+Checking the safe RET mitigation actually works
+-----------------------------------------------
+
+In case one wants to validate whether the SRSO safe RET mitigation works
+on a kernel, one could use two performance counters
+
+* PMC_0xc8 - Count of RET/RET lw retired
+* PMC_0xc9 - Count of RET/RET lw retired mispredicted
+
+and compare the number of RETs retired properly vs those retired
+mispredicted, in kernel mode. Another way of specifying those events
+is::
+
+ # perf list ex_ret_near_ret
+
+ List of pre-defined events (to be used in -e or -M):
+
+ core:
+ ex_ret_near_ret
+ [Retired Near Returns]
+ ex_ret_near_ret_mispred
+ [Retired Near Returns Mispredicted]
+
+Either the command using the event mnemonics::
+
+ # perf stat -e ex_ret_near_ret:k -e ex_ret_near_ret_mispred:k sleep 10s
+
+or using the raw PMC numbers::
+
+ # perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s
+
+should give the same amount. I.e., every RET retired should be
+mispredicted::
+
+ [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s
+
+ Performance counter stats for 'sleep 10s':
+
+ 137,167 cpu/event=0xc8,umask=0/k
+ 137,173 cpu/event=0xc9,umask=0/k
+
+ 10.004110303 seconds time elapsed
+
+ 0.000000000 seconds user
+ 0.004462000 seconds sys
+
+vs the case when the mitigation is disabled (spec_rstack_overflow=off)
+or not functioning properly, showing usually a lot smaller number of
+mispredicted retired RETs vs the overall count of retired RETs during
+a workload::
+
+ [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s
+
+ Performance counter stats for 'sleep 10s':
+
+ 201,627 cpu/event=0xc8,umask=0/k
+ 4,074 cpu/event=0xc9,umask=0/k
+
+ 10.003267252 seconds time elapsed
+
+ 0.002729000 seconds user
+ 0.000000000 seconds sys
+
+Also, there is a selftest which performs the above, go to
+tools/testing/selftests/x86/ and do::
+
+ make srso
+ ./srso
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index fb40a1f6f79e..c8af32a8f800 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -1,3 +1,4 @@
+=================================================
The Linux kernel user's and administrator's guide
=================================================
@@ -6,6 +7,9 @@ added to the kernel over time. There is, as yet, little overall order or
organization here — this material was not written to be a single, coherent
document! With luck things will improve quickly over time.
+General guides to kernel administration
+---------------------------------------
+
This initial section contains overall information, including the README
file describing the kernel as a whole, documentation on kernel parameters,
etc.
@@ -14,19 +18,44 @@ etc.
:maxdepth: 1
README
- kernel-parameters
devices
- sysctl/index
- abi
features
-This section describes CPU vulnerabilities and their mitigations.
+A big part of the kernel's administrative interface is the /proc and sysfs
+virtual filesystems; these documents describe how to interact with tem
+
+.. toctree::
+ :maxdepth: 1
+
+ sysfs-rules
+ sysctl/index
+ cputopology
+ abi
+
+Security-related documentation:
.. toctree::
:maxdepth: 1
hw-vuln/index
+ LSM/index
+ perf-security
+
+Booting the kernel
+------------------
+
+.. toctree::
+ :maxdepth: 1
+
+ bootconfig
+ kernel-parameters
+ efi-stub
+ initrd
+
+
+Tracking down and identifying problems
+--------------------------------------
Here is a set of documents aimed at users who are trying to track down
problems and bugs in particular.
@@ -37,6 +66,7 @@ problems and bugs in particular.
reporting-issues
reporting-regressions
quickly-build-trimmed-linux
+ verify-bugs-and-bisect-regressions
bug-hunting
bug-bisect
tainted-kernels
@@ -46,95 +76,120 @@ problems and bugs in particular.
kdump/index
perf/index
pstore-blk
+ clearing-warn-once
+ kernel-per-CPU-kthreads
+ lockup-watchdogs
+ RAS/index
+ sysrq
-This is the beginning of a section with information of interest to
-application developers. Documents covering various aspects of the kernel
-ABI will be found here.
+
+Core-kernel subsystems
+----------------------
+
+These documents describe core-kernel administration interfaces that are
+likely to be of interest on almost any system.
.. toctree::
:maxdepth: 1
- sysfs-rules
+ cgroup-v2
+ cgroup-v1/index
+ cpu-load
+ mm/index
+ module-signing
+ namespaces/index
+ numastat
+ pm/index
+ syscall-user-dispatch
-This is the beginning of a section with information of interest to
-application developers and system integrators doing analysis of the
-Linux kernel for safety critical applications. Documents supporting
-analysis of kernel interactions with applications, and key kernel
-subsystems expectations will be found here.
+Support for non-native binary formats. Note that some of these
+documents are ... old ...
.. toctree::
:maxdepth: 1
- workload-tracing
+ binfmt-misc
+ java
+ mono
-The rest of this manual consists of various unordered guides on how to
-configure specific aspects of kernel behavior to your liking.
+
+Block-layer and filesystem administration
+-----------------------------------------
.. toctree::
:maxdepth: 1
- acpi/index
- aoe/index
- auxdisplay/index
bcache
binderfs
- binfmt-misc
blockdev/index
- bootconfig
- braille-console
- btmrvl
- cgroup-v1/index
- cgroup-v2
cifs/index
- clearing-warn-once
- cpu-load
- cputopology
- dell_rbu
device-mapper/index
- edid
- efi-stub
ext4
filesystem-monitoring
nfs/index
- gpio/index
- highuid
- hw_random
- initrd
iostats
- java
jfs
- kernel-per-CPU-kthreads
+ md
+ ufs
+ xfs
+
+Device-specific guides
+----------------------
+
+How to configure your hardware within your Linux system.
+
+.. toctree::
+ :maxdepth: 1
+
+ acpi/index
+ aoe/index
+ auxdisplay/index
+ braille-console
+ btmrvl
+ dell_rbu
+ edid
+ gpio/index
+ hw_random
laptops/index
lcd-panel-cgram
- ldm
- lockup-watchdogs
- LSM/index
- md
media/index
- mm/index
- module-signing
- mono
- namespaces/index
- numastat
+ nvme-multipath
parport
- perf-security
- pm/index
- pmf
pnp
rapidio
- ras
rtc
serial-console
svga
- syscall-user-dispatch
- sysrq
thermal/index
thunderbolt
- ufs
- unicode
vga-softcursor
video-output
- xfs
+
+Workload analysis
+-----------------
+
+This is the beginning of a section with information of interest to
+application developers and system integrators doing analysis of the
+Linux kernel for safety critical applications. Documents supporting
+analysis of kernel interactions with applications, and key kernel
+subsystems expectations will be found here.
+
+.. toctree::
+ :maxdepth: 1
+
+ workload-tracing
+
+Everything else
+---------------
+
+A few hard-to-categorize and generally obsolete documents.
+
+.. toctree::
+ :maxdepth: 1
+
+ highuid
+ ldm
+ unicode
.. only:: subproject and html
diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5762e7477a0c..5376890adbeb 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -136,10 +136,6 @@ System kernel config options
CONFIG_KEXEC_CORE=y
- Subsequently, CRASH_CORE is selected by KEXEC_CORE::
-
- CONFIG_CRASH_CORE=y
-
2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo
filesystems." This is usually enabled by default::
@@ -168,6 +164,10 @@ Dump-capture kernel config options (Arch Independent)
CONFIG_CRASH_DUMP=y
+ And this will select VMCORE_INFO and CRASH_RESERVE::
+ CONFIG_VMCORE_INFO=y
+ CONFIG_CRASH_RESERVE=y
+
2) Enable "/proc/vmcore support" under "Filesystems" -> "Pseudo filesystems"::
CONFIG_PROC_VMCORE=y
@@ -191,9 +191,7 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64)
CPU is enough for kdump kernel to dump vmcore on most of systems.
However, you can also specify nr_cpus=X to enable multiple processors
- in kdump kernel. In this case, "disable_cpu_apicid=" is needed to
- tell kdump kernel which cpu is 1st kernel's BSP. Please refer to
- admin-guide/kernel-parameters.txt for more details.
+ in kdump kernel.
With CONFIG_SMP=n, the above things are not related.
@@ -454,8 +452,7 @@ Notes on loading the dump-capture kernel:
to use multi-thread programs with it, such as parallel dump feature of
makedumpfile. Otherwise, the multi-thread program may have a great
performance degradation. To enable multi-cpu support, you should bring up an
- SMP dump-capture kernel and specify maxcpus/nr_cpus, disable_cpu_apicid=[X]
- options while loading it.
+ SMP dump-capture kernel and specify maxcpus/nr_cpus options while loading it.
* For s390x there are two kdump modes: If a ELF header is specified with
the elfcorehdr= kernel parameter, it is used by the kdump kernel as it
diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst
index bced9e4b6e08..0f714fc945ac 100644
--- a/Documentation/admin-guide/kdump/vmcoreinfo.rst
+++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst
@@ -65,11 +65,11 @@ Defines the beginning of the text section. In general, _stext indicates
the kernel start address. Used to convert a virtual address from the
direct kernel map to a physical address.
-vmap_area_list
---------------
+VMALLOC_START
+-------------
-Stores the virtual area list. makedumpfile gets the vmalloc start value
-from this variable and its value is necessary for vmalloc translation.
+Stores the base address of vmalloc area. makedumpfile gets this value
+since is necessary for vmalloc translation.
mem_map
-------
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index 4410384596a9..39d0e7ff0965 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -27,6 +27,16 @@ kernel command line (/proc/cmdline) and collects module parameters
when it loads a module, so the kernel command line can be used for
loadable modules too.
+This document may not be entirely up to date and comprehensive. The command
+"modinfo -p ${modulename}" shows a current list of all parameters of a loadable
+module. Loadable modules, after being loaded into the running kernel, also
+reveal their parameters in /sys/module/${modulename}/parameters/. Some of these
+parameters may be changed at runtime by the command
+``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``.
+
+Special handling
+----------------
+
Hyphens (dashes) and underscores are equivalent in parameter names, so::
log_buf_len=1M print-fatal-signals=1
@@ -39,8 +49,8 @@ Double-quotes can be used to protect spaces in values, e.g.::
param="spaces in here"
-cpu lists:
-----------
+cpu lists
+~~~~~~~~~
Some kernel parameters take a list of CPUs as a value, e.g. isolcpus,
nohz_full, irqaffinity, rcu_nocbs. The format of this list is:
@@ -82,12 +92,17 @@ so that "nohz_full=all" is the equivalent of "nohz_full=0-N".
The semantics of "N" and "all" is supported on a level of bitmaps and holds for
all users of bitmap_parselist().
-This document may not be entirely up to date and comprehensive. The command
-"modinfo -p ${modulename}" shows a current list of all parameters of a loadable
-module. Loadable modules, after being loaded into the running kernel, also
-reveal their parameters in /sys/module/${modulename}/parameters/. Some of these
-parameters may be changed at runtime by the command
-``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``.
+Metric suffixes
+~~~~~~~~~~~~~~~
+
+The [KMG] suffix is commonly described after a number of kernel
+parameter values. 'K', 'M', 'G', 'T', 'P', and 'E' suffixes are allowed.
+These letters represent the _binary_ multipliers 'Kilo', 'Mega', 'Giga',
+'Tera', 'Peta', and 'Exa', equaling 2^10, 2^20, 2^30, 2^40, 2^50, and
+2^60 bytes respectively. Such letter suffixes can also be entirely omitted.
+
+Kernel Build Options
+--------------------
The parameters listed below are only valid if certain kernel build options
were enabled and if respective hardware is present. This list should be kept
@@ -108,6 +123,7 @@ is applicable::
CMA Contiguous Memory Area support is enabled.
DRM Direct Rendering Management support is enabled.
DYNAMIC_DEBUG Build in debug messages and enable them at runtime
+ EARLY Parameter processed too early to be embedded in initrd.
EDD BIOS Enhanced Disk Drive Services (EDD) is enabled
EFI EFI Partitioning (GPT) is enabled
EVM Extended Verification Module
@@ -117,7 +133,6 @@ is applicable::
HIBERNATION HIBERNATION is enabled.
HW Appropriate hardware is enabled.
HYPER_V HYPERV support is enabled.
- IA-64 IA-64 architecture is enabled.
IMA Integrity measurement architecture is enabled.
IP_PNP IP DHCP, BOOTP, or RARP is enabled.
IPV6 IPv6 support is enabled.
@@ -159,6 +174,7 @@ is applicable::
SCSI Appropriate SCSI support is enabled.
A lot of drivers have their options described inside
the Documentation/scsi/ sub-directory.
+ SDW SoundWire support is enabled.
SECURITY Different security models are enabled.
SELINUX SELinux support is enabled.
SERIAL Serial support is enabled.
@@ -178,8 +194,6 @@ is applicable::
WDT Watchdog support is enabled.
X86-32 X86-32, aka i386 architecture is enabled.
X86-64 X86-64 architecture is enabled.
- More X86-64 boot options can be found in
- Documentation/arch/x86/x86_64/boot-options.rst.
X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
X86_UV SGI UV support is enabled.
XEN Xen support is enabled
@@ -197,7 +211,6 @@ Do not modify the syntax of boot loader parameters without extreme
need or coordination with <Documentation/arch/x86/boot.rst>.
There are also arch-specific kernel-parameters not documented here.
-See for example <Documentation/arch/x86/x86_64/boot-options.rst>.
Note that ALL kernel parameters listed below are CASE SENSITIVE, and that
a trailing = on the name of any parameter states that that parameter will
@@ -211,10 +224,5 @@ a fixed number of characters. This limit depends on the architecture
and is between 256 and 4096 characters. It is defined in the file
./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE.
-Finally, the [KMG] suffix is commonly described after a number of kernel
-parameter values. These 'K', 'M', and 'G' letters represent the _binary_
-multipliers 'Kilo', 'Mega', and 'Giga', equaling 2^10, 2^20, and 2^30
-bytes respectively. Such letter suffixes can also be entirely omitted:
-
.. include:: kernel-parameters.txt
:literal:
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 31b3a25680d0..fb8752b42ec8 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -9,10 +9,10 @@
accept_memory=eager can be used to accept all memory
at once during boot.
- acpi= [HW,ACPI,X86,ARM64,RISCV64]
+ acpi= [HW,ACPI,X86,ARM64,RISCV64,EARLY]
Advanced Configuration and Power Interface
Format: { force | on | off | strict | noirq | rsdt |
- copy_dsdt }
+ copy_dsdt | nospcr }
force -- enable ACPI if default was off
on -- enable ACPI but allow fallback to DT [arm64,riscv64]
off -- disable ACPI if default was on
@@ -21,12 +21,20 @@
strictly ACPI specification compliant.
rsdt -- prefer RSDT over (default) XSDT
copy_dsdt -- copy DSDT to memory
- For ARM64 and RISCV64, ONLY "acpi=off", "acpi=on" or
- "acpi=force" are available
+ nocmcff -- Disable firmware first mode for corrected
+ errors. This disables parsing the HEST CMC error
+ source to check if firmware has set the FF flag. This
+ may result in duplicate corrected error reports.
+ nospcr -- disable console in ACPI SPCR table as
+ default _serial_ console on ARM64
+ For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or
+ "acpi=nospcr" are available
+ For RISCV64, ONLY "acpi=off", "acpi=on" or "acpi=force"
+ are available
See also Documentation/power/runtime_pm.rst, pci=noacpi
- acpi_apic_instance= [ACPI, IOAPIC]
+ acpi_apic_instance= [ACPI,IOAPIC,EARLY]
Format: <int>
2: use 2nd APIC table, if available
1,0: use 1st APIC table
@@ -41,7 +49,7 @@
If set to native, use the device's native backlight mode.
If set to none, disable the ACPI backlight interface.
- acpi_force_32bit_fadt_addr
+ acpi_force_32bit_fadt_addr [ACPI,EARLY]
force FADT to use 32 bit addresses rather than the
64 bit X_* addresses. Some firmware have broken 64
bit addresses for force ACPI ignore these and use
@@ -97,7 +105,7 @@
no: ACPI OperationRegions are not marked as reserved,
no further checks are performed.
- acpi_force_table_verification [HW,ACPI]
+ acpi_force_table_verification [HW,ACPI,EARLY]
Enable table checksum verification during early stage.
By default, this is disabled due to x86 early mapping
size limitation.
@@ -137,7 +145,7 @@
acpi_no_memhotplug [ACPI] Disable memory hotplug. Useful for kdump
kernels.
- acpi_no_static_ssdt [HW,ACPI]
+ acpi_no_static_ssdt [HW,ACPI,EARLY]
Disable installation of static SSDTs at early boot time
By default, SSDTs contained in the RSDT/XSDT will be
installed automatically and they will appear under
@@ -151,7 +159,7 @@
Ignore the ACPI-based watchdog interface (WDAT) and let
a native driver control the watchdog device instead.
- acpi_rsdp= [ACPI,EFI,KEXEC]
+ acpi_rsdp= [ACPI,EFI,KEXEC,EARLY]
Pass the RSDP address to the kernel, mostly used
on machines running EFI runtime service to boot the
second kernel for kdump.
@@ -228,10 +236,10 @@
to assume that this machine's pmtimer latches its value
and always returns good values.
- acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode
+ acpi_sci= [HW,ACPI,EARLY] ACPI System Control Interrupt trigger mode
Format: { level | edge | high | low }
- acpi_skip_timer_override [HW,ACPI]
+ acpi_skip_timer_override [HW,ACPI,EARLY]
Recognize and ignore IRQ0/pin2 Interrupt Override.
For broken nForce2 BIOS resulting in XT-PIC timer.
@@ -266,11 +274,11 @@
behave incorrectly in some ways with respect to system
suspend and resume to be ignored (use wisely).
- acpi_use_timer_override [HW,ACPI]
+ acpi_use_timer_override [HW,ACPI,EARLY]
Use timer override. For some broken Nvidia NF5 boards
that require a timer override, but don't have HPET
- add_efi_memmap [EFI; X86] Include EFI memory map in
+ add_efi_memmap [EFI,X86,EARLY] Include EFI memory map in
kernel's map of available physical RAM.
agp= [AGP]
@@ -307,7 +315,7 @@
do not want to use tracing_snapshot_alloc() as it needs
to be done where GFP_KERNEL allocations are allowed.
- allow_mismatched_32bit_el0 [ARM64]
+ allow_mismatched_32bit_el0 [ARM64,EARLY]
Allow execve() of 32-bit applications and setting of the
PER_LINUX32 personality on systems where only a strict
subset of the CPUs support 32-bit EL0. When this
@@ -329,12 +337,17 @@
allowed anymore to lift isolation
requirements as needed. This option
does not override iommu=pt
- force_enable - Force enable the IOMMU on platforms known
- to be buggy with IOMMU enabled. Use this
- option with care.
- pgtbl_v1 - Use v1 page table for DMA-API (Default).
- pgtbl_v2 - Use v2 page table for DMA-API.
- irtcachedis - Disable Interrupt Remapping Table (IRT) caching.
+ force_enable - Force enable the IOMMU on platforms known
+ to be buggy with IOMMU enabled. Use this
+ option with care.
+ pgtbl_v1 - Use v1 page table for DMA-API (Default).
+ pgtbl_v2 - Use v2 page table for DMA-API.
+ irtcachedis - Disable Interrupt Remapping Table (IRT) caching.
+ nohugepages - Limit page-sizes used for v1 page-tables
+ to 4 KiB.
+ v2_pgsizes_only - Limit page-sizes used for v1 page-tables
+ to 4KiB/2Mib/1GiB.
+
amd_iommu_dump= [HW,X86-64]
Enable AMD IOMMU driver option to dump the ACPI table
@@ -351,7 +364,7 @@
This mode requires kvm-amd.avic=1.
(Default when IOMMU HW support is present.)
- amd_pstate= [X86]
+ amd_pstate= [X86,EARLY]
disable
Do not enable amd_pstate as the default
scaling driver for the supported processors
@@ -374,6 +387,11 @@
selects a performance level in this range and appropriate
to the current workload.
+ amd_prefcore=
+ [X86]
+ disable
+ Disable amd-pstate preferred core.
+
amijoy.map= [HW,JOY] Amiga joystick support
Map of devices attached to JOY0DAT and JOY1DAT
Format: <a>,<b>
@@ -391,7 +409,9 @@
not play well with APC CPU idle - disable it if you have
APC and your system crashes randomly.
- apic= [APIC,X86] Advanced Programmable Interrupt Controller
+ apic [APIC,X86-64] Use IO-APIC. Default.
+
+ apic= [APIC,X86,EARLY] Advanced Programmable Interrupt Controller
Change the output verbosity while booting
Format: { quiet (default) | verbose | debug }
Change the amount of debugging information output
@@ -401,7 +421,7 @@
Format: apic=driver_name
Examples: apic=bigsmp
- apic_extnmi= [APIC,X86] External NMI delivery setting
+ apic_extnmi= [APIC,X86,EARLY] External NMI delivery setting
Format: { bsp (default) | all | none }
bsp: External NMI is delivered only to CPU 0
all: External NMIs are broadcast to all CPUs as a
@@ -410,6 +430,10 @@
useful so that a dump capture kernel won't be
shot down by NMI
+ apicpmtimer Do APIC timer calibration using the pmtimer. Implies
+ apicmaintimer. Useful when your PIT timer is totally
+ broken.
+
autoconf= [IPV6]
See Documentation/networking/ipv6.rst.
@@ -426,9 +450,15 @@
arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards
Format: <io>,<irq>,<nodeID>
+ arm64.no32bit_el0 [ARM64] Unconditionally disable the execution of
+ 32 bit applications.
+
arm64.nobti [ARM64] Unconditionally disable Branch Target
Identification support
+ arm64.nogcs [ARM64] Unconditionally disable Guarded Control Stack
+ support
+
arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory
Set instructions support
@@ -505,24 +535,37 @@
Format: <io>,<irq>,<mode>
See header of drivers/net/hamradio/baycom_ser_hdx.c.
+ bdev_allow_write_mounted=
+ Format: <bool>
+ Control the ability to open a mounted block device
+ for writing, i.e., allow / disallow writes that bypass
+ the FS. This was implemented as a means to prevent
+ fuzzers from crashing the kernel by overwriting the
+ metadata underneath a mounted FS without its awareness.
+ This also prevents destructive formatting of mounted
+ filesystems by naive storage tooling that don't use
+ O_EXCL. Default is Y and can be changed through the
+ Kconfig option CONFIG_BLK_DEV_WRITE_MOUNTED.
+
bert_disable [ACPI]
Disable BERT OS support on buggy BIOSes.
- bgrt_disable [ACPI][X86]
+ bgrt_disable [ACPI,X86,EARLY]
Disable BGRT to avoid flickering OEM logo.
blkdevparts= Manual partition parsing of block device(s) for
embedded devices based on command line input.
See Documentation/block/cmdline-partition.rst
- boot_delay= Milliseconds to delay each printk during boot.
+ boot_delay= [KNL,EARLY]
+ Milliseconds to delay each printk during boot.
Only works if CONFIG_BOOT_PRINTK_DELAY is enabled,
and you may also have to specify "lpj=". Boot_delay
values larger than 10 seconds (10000) are assumed
erroneous and ignored.
Format: integer
- bootconfig [KNL]
+ bootconfig [KNL,EARLY]
Extended command line options can be added to an initrd
and this will cause the kernel to look for it.
@@ -557,7 +600,7 @@
trust validation.
format: { id:<keyid> | builtin }
- cca= [MIPS] Override the kernel pages' cache coherency
+ cca= [MIPS,EARLY] Override the kernel pages' cache coherency
algorithm. Accepted values range from 0 to 7
inclusive. See arch/mips/include/asm/pgtable-bits.h
for platform specific values (SB1, Loongson3 and
@@ -672,19 +715,13 @@
[X86-64] hpet,tsc
clocksource.arm_arch_timer.evtstrm=
- [ARM,ARM64]
+ [ARM,ARM64,EARLY]
Format: <bool>
Enable/disable the eventstream feature of the ARM
architected timer so that code using WFE-based polling
loops can be debugged more effectively on production
systems.
- clocksource.max_cswd_read_retries= [KNL]
- Number of clocksource_watchdog() retries due to
- external delays before the clock will be marked
- unstable. Defaults to two retries, that is,
- three attempts to read the clock under test.
-
clocksource.verify_n_cpus= [KNL]
Limit the number of CPUs checked for clocksources
marked with CLOCK_SOURCE_VERIFY_PERCPU that
@@ -702,7 +739,7 @@
10 seconds when built into the kernel.
cma=nn[MG]@[start[MG][-end[MG]]]
- [KNL,CMA]
+ [KNL,CMA,EARLY]
Sets the size of kernel global memory area for
contiguous memory allocations and optionally the
placement constraint by the physical address range of
@@ -711,7 +748,7 @@
kernel/dma/contiguous.c
cma_pernuma=nn[MG]
- [KNL,CMA]
+ [KNL,CMA,EARLY]
Sets the size of kernel per-numa memory area for
contiguous memory allocations. A value of 0 disables
per-numa CMA altogether. And If this option is not
@@ -722,7 +759,7 @@
they will fallback to the global default memory area.
numa_cma=<node>:nn[MG][,<node>:nn[MG]]
- [KNL,CMA]
+ [KNL,CMA,EARLY]
Sets the size of kernel numa memory area for
contiguous memory allocations. It will reserve CMA
area for the specified node.
@@ -739,7 +776,7 @@
a hypervisor.
Default: yes
- coherent_pool=nn[KMG] [ARM,KNL]
+ coherent_pool=nn[KMG] [ARM,KNL,EARLY]
Sets the size of memory pool for coherent, atomic dma
allocations, by default set to 256K.
@@ -757,7 +794,7 @@
condev= [HW,S390] console device
conmode=
- con3215_drop= [S390] 3215 console drop mode.
+ con3215_drop= [S390,EARLY] 3215 console drop mode.
Format: y|n|Y|N|1|0
When set to true, drop data on the 3215 console when
the console buffer is full. In this case the
@@ -785,6 +822,25 @@
Documentation/networking/netconsole.rst for an
alternative.
+ <DEVNAME>:<n>.<n>[,options]
+ Use the specified serial port on the serial core bus.
+ The addressing uses DEVNAME of the physical serial port
+ device, followed by the serial core controller instance,
+ and the serial port instance. The options are the same
+ as documented for the ttyS addressing above.
+
+ The mapping of the serial ports to the tty instances
+ can be viewed with:
+
+ $ ls -d /sys/bus/serial-base/devices/*:*.*/tty/*
+ /sys/bus/serial-base/devices/00:04:0.0/tty/ttyS0
+
+ In the above example, the console can be addressed with
+ console=00:04:0.0. Note that a console addressed this
+ way will only get added when the related device driver
+ is ready. The use of an earlycon parameter in addition to
+ the console may be desired for console output early on.
+
uart[8250],io,<addr>[,options]
uart[8250],mmio,<addr>[,options]
uart[8250],mmio16,<addr>[,options]
@@ -863,7 +919,7 @@
kernel before the cpufreq driver probes.
cpu_init_udelay=N
- [X86] Delay for N microsec between assert and de-assert
+ [X86,EARLY] Delay for N microsec between assert and de-assert
of APIC INIT to start processors. This delay occurs
on every CPU online, such as boot, and resume from suspend.
Default: 10000
@@ -875,15 +931,19 @@
the parameter has no effect.
crash_kexec_post_notifiers
- Run kdump after running panic-notifiers and dumping
- kmsg. This only for the users who doubt kdump always
- succeeds in any situation.
- Note that this also increases risks of kdump failure,
- because some panic notifiers can make the crashed
- kernel more unstable.
+ Only jump to kdump kernel after running the panic
+ notifiers and dumping kmsg. This option increases
+ the risks of a kdump failure, since some panic
+ notifiers can make the crashed kernel more unstable.
+ In configurations where kdump may not be reliable,
+ running the panic notifiers could allow collecting
+ more data on dmesg, like stack traces from other CPUS
+ or extra data dumped by panic_print. Note that some
+ configurations enable this option unconditionally,
+ like Hyper-V, PowerPC (fadump) and AMD SEV-SNP.
crashkernel=size[KMG][@offset[KMG]]
- [KNL] Using kexec, Linux can switch to a 'crash kernel'
+ [KNL,EARLY] Using kexec, Linux can switch to a 'crash kernel'
upon panic. This parameter reserves the physical
memory region [offset, offset + size] for that kernel
image. If '@offset' is omitted, then a suitable offset
@@ -954,10 +1014,10 @@
Format: <port#>,<type>
See also Documentation/input/devices/joystick-parport.rst
- debug [KNL] Enable kernel debugging (events log level).
+ debug [KNL,EARLY] Enable kernel debugging (events log level).
debug_boot_weak_hash
- [KNL] Enable printing [hashed] pointers early in the
+ [KNL,EARLY] Enable printing [hashed] pointers early in the
boot sequence. If enabled, we use a weak hash instead
of siphash to hash pointers. Use this option if you are
seeing instances of '(___ptrval___)') and need to see a
@@ -974,10 +1034,10 @@
will print _a_lot_ more information - normally only
useful to lockdep developers.
- debug_objects [KNL] Enable object debugging
+ debug_objects [KNL,EARLY] Enable object debugging
debug_guardpage_minorder=
- [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this
+ [KNL,EARLY] When CONFIG_DEBUG_PAGEALLOC is set, this
parameter allows control of the order of pages that will
be intentionally kept free (and hence protected) by the
buddy allocator. Bigger value increase the probability
@@ -996,7 +1056,7 @@
help tracking down these problems.
debug_pagealloc=
- [KNL] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
+ [KNL,EARLY] When CONFIG_DEBUG_PAGEALLOC is set, this parameter
enables the feature at boot time. By default, it is
disabled and the system will work mostly the same as a
kernel built without CONFIG_DEBUG_PAGEALLOC.
@@ -1004,8 +1064,8 @@
useful to also enable the page_owner functionality.
on: enable the feature
- debugfs= [KNL] This parameter enables what is exposed to userspace
- and debugfs internal clients.
+ debugfs= [KNL,EARLY] This parameter enables what is exposed to
+ userspace and debugfs internal clients.
Format: { on, no-mount, off }
on: All functions are enabled.
no-mount:
@@ -1084,7 +1144,7 @@
dhash_entries= [KNL]
Set number of hash buckets for dentry cache.
- disable_1tb_segments [PPC]
+ disable_1tb_segments [PPC,EARLY]
Disables the use of 1TB hash page table segments. This
causes the kernel to fall back to 256MB segments which
can be useful when debugging issues that require an SLB
@@ -1093,41 +1153,32 @@
disable= [IPV6]
See Documentation/networking/ipv6.rst.
- disable_radix [PPC]
+ disable_radix [PPC,EARLY]
Disable RADIX MMU mode on POWER9
disable_tlbie [PPC]
Disable TLBIE instruction. Currently does not work
with KVM, with HASH MMU, or with coherent accelerators.
- disable_cpu_apicid= [X86,APIC,SMP]
- Format: <int>
- The number of initial APIC ID for the
- corresponding CPU to be disabled at boot,
- mostly used for the kdump 2nd kernel to
- disable BSP to wake up multiple CPUs without
- causing system reset or hang due to sending
- INIT from AP to BSP.
-
- disable_ddw [PPC/PSERIES]
+ disable_ddw [PPC/PSERIES,EARLY]
Disable Dynamic DMA Window support. Use this
to workaround buggy firmware.
disable_ipv6= [IPV6]
See Documentation/networking/ipv6.rst.
- disable_mtrr_cleanup [X86]
+ disable_mtrr_cleanup [X86,EARLY]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter disables that.
- disable_mtrr_trim [X86, Intel and AMD only]
+ disable_mtrr_trim [X86, Intel and AMD only,EARLY]
By default the kernel will trim any uncacheable
memory out of your available memory pool based on
MTRR settings. This parameter disables that behavior,
possibly causing your machine to run very slowly.
- disable_timer_pin_1 [X86]
+ disable_timer_pin_1 [X86,EARLY]
Disable PIN 1 of APIC timer
Can be useful to work around chipset bugs.
@@ -1150,6 +1201,26 @@
The filter can be disabled or changed to another
driver later using sysfs.
+ reg_file_data_sampling=
+ [X86] Controls mitigation for Register File Data
+ Sampling (RFDS) vulnerability. RFDS is a CPU
+ vulnerability which may allow userspace to infer
+ kernel data values previously stored in floating point
+ registers, vector registers, or integer registers.
+ RFDS only affects Intel Atom processors.
+
+ on: Turns ON the mitigation.
+ off: Turns OFF the mitigation.
+
+ This parameter overrides the compile time default set
+ by CONFIG_MITIGATION_RFDS. Mitigation cannot be
+ disabled when other VERW based mitigations (like MDS)
+ are enabled. In order to disable RFDS mitigation all
+ VERW based mitigations need to be disabled.
+
+ For details see:
+ Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst
+
driver_async_probe= [KNL]
List of driver names to be probed asynchronously. *
matches with all driver names. If * is specified, the
@@ -1162,22 +1233,16 @@
panels may send no or incorrect EDID data sets.
This parameter allows to specify an EDID data sets
in the /lib/firmware directory that are used instead.
- Generic built-in EDID data sets are used, if one of
- edid/1024x768.bin, edid/1280x1024.bin,
- edid/1680x1050.bin, or edid/1920x1080.bin is given
- and no file with the same name exists. Details and
- instructions how to build your own EDID data are
- available in Documentation/admin-guide/edid.rst. An EDID
- data set will only be used for a particular connector,
- if its name and a colon are prepended to the EDID
- name. Each connector may use a unique EDID data
- set by separating the files with a comma. An EDID
+ An EDID data set will only be used for a particular
+ connector, if its name and a colon are prepended to
+ the EDID name. Each connector may use a unique EDID
+ data set by separating the files with a comma. An EDID
data set with no connector name will be used for
any connectors not explicitly specified.
dscc4.setup= [NET]
- dt_cpu_ftrs= [PPC]
+ dt_cpu_ftrs= [PPC,EARLY]
Format: {"off" | "known"}
Control how the dt_cpu_ftrs device-tree binding is
used for CPU feature discovery and setup (if it
@@ -1197,12 +1262,12 @@
Documentation/admin-guide/dynamic-debug-howto.rst
for details.
- early_ioremap_debug [KNL]
+ early_ioremap_debug [KNL,EARLY]
Enable debug messages in early_ioremap support. This
is useful for tracking down temporary early mappings
which are not unmapped.
- earlycon= [KNL] Output early console device and options.
+ earlycon= [KNL,EARLY] Output early console device and options.
When used with no options, the early console is
determined by stdout-path property in device tree's
@@ -1338,7 +1403,7 @@
address must be provided, and the serial port must
already be setup and configured.
- earlyprintk= [X86,SH,ARM,M68k,S390]
+ earlyprintk= [X86,SH,ARM,M68k,S390,UM,EARLY]
earlyprintk=vga
earlyprintk=sclp
earlyprintk=xen
@@ -1396,7 +1461,7 @@
edd= [EDD]
Format: {"off" | "on" | "skip[mbr]"}
- efi= [EFI]
+ efi= [EFI,EARLY]
Format: { "debug", "disable_early_pci_dma",
"nochunk", "noruntime", "nosoftreserve",
"novamap", "no_disable_early_pci_dma" }
@@ -1417,33 +1482,12 @@
no_disable_early_pci_dma: Leave the busmaster bit set
on all PCI bridges while in the EFI boot stub
- efi_no_storage_paranoia [EFI; X86]
+ efi_no_storage_paranoia [EFI,X86,EARLY]
Using this parameter you can use more than 50% of
your efi variable storage. Use this parameter only if
you are really sure that your UEFI does sane gc and
fulfills the spec otherwise your board may brick.
- efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI; X86]
- Add arbitrary attribute to specific memory range by
- updating original EFI memory map.
- Region of memory which aa attribute is added to is
- from ss to ss+nn.
-
- If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000
- is specified, EFI_MEMORY_MORE_RELIABLE(0x10000)
- attribute is added to range 0x100000000-0x180000000 and
- 0x10a0000000-0x1120000000.
-
- If efi_fake_mem=8G@9G:0x40000 is specified, the
- EFI_MEMORY_SP(0x40000) attribute is added to
- range 0x240000000-0x43fffffff.
-
- Using this parameter you can do debugging of EFI memmap
- related features. For example, you can do debugging of
- Address Range Mirroring feature even if your box
- doesn't support it, or mark specific memory as
- "soft reserved".
-
efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT
that is to be dynamically loaded by Linux. If there are
multiple variables with the same name but with different
@@ -1454,7 +1498,7 @@
eisa_irq_edge= [PARISC,HW]
See header of drivers/parisc/eisa.c.
- ekgdboc= [X86,KGDB] Allow early kernel console debugging
+ ekgdboc= [X86,KGDB,EARLY] Allow early kernel console debugging
Format: ekgdboc=kbd
This is designed to be used in conjunction with
@@ -1469,13 +1513,13 @@
See comment before function elanfreq_setup() in
arch/x86/kernel/cpu/cpufreq/elanfreq.c.
- elfcorehdr=[size[KMG]@]offset[KMG] [PPC,SH,X86,S390]
+ elfcorehdr=[size[KMG]@]offset[KMG] [PPC,SH,X86,S390,EARLY]
Specifies physical address of start of kernel core
image elf header and optionally the size. Generally
kexec loader will pass this option to capture kernel.
See Documentation/admin-guide/kdump/kdump.rst for details.
- enable_mtrr_cleanup [X86]
+ enable_mtrr_cleanup [X86,EARLY]
The kernel tries to adjust MTRR layout from continuous
to discrete, to make X server driver able to add WB
entry later. This parameter enables that.
@@ -1508,7 +1552,7 @@
Permit 'security.evm' to be updated regardless of
current integrity status.
- early_page_ext [KNL] Enforces page_ext initialization to earlier
+ early_page_ext [KNL,EARLY] Enforces page_ext initialization to earlier
stages so cover more early boot allocations.
Please note that as side effect some optimizations
might be disabled to achieve that (e.g. parallelized
@@ -1519,6 +1563,7 @@
failslab=
fail_usercopy=
fail_page_alloc=
+ fail_skb_realloc=
fail_make_request=[KNL]
General fault injection mechanism.
Format: <interval>,<probability>,<space>,<times>
@@ -1539,6 +1584,12 @@
Warning: use of this parameter will taint the kernel
and may cause unknown problems.
+ fred= [X86-64]
+ Enable/disable Flexible Return and Event Delivery.
+ Format: { on | off }
+ on: enable FRED when it's present.
+ off: disable FRED, the default setting.
+
ftrace=[tracer]
[FTRACE] will set and start the specified tracer
as early as possible in order to facilitate early
@@ -1561,12 +1612,28 @@
The above will cause the "foo" tracing instance to trigger
a snapshot at the end of boot up.
- ftrace_dump_on_oops[=orig_cpu]
+ ftrace_dump_on_oops[=2(orig_cpu) | =<instance>][,<instance> |
+ ,<instance>=2(orig_cpu)]
[FTRACE] will dump the trace buffers on oops.
- If no parameter is passed, ftrace will dump
- buffers of all CPUs, but if you pass orig_cpu, it will
- dump only the buffer of the CPU that triggered the
- oops.
+ If no parameter is passed, ftrace will dump global
+ buffers of all CPUs, if you pass 2 or orig_cpu, it
+ will dump only the buffer of the CPU that triggered
+ the oops, or the specific instance will be dumped if
+ its name is passed. Multiple instance dump is also
+ supported, and instances are separated by commas. Each
+ instance supports only dump on CPU that triggered the
+ oops by passing 2 or orig_cpu to it.
+
+ ftrace_dump_on_oops=foo=orig_cpu
+
+ The above will dump only the buffer of "foo" instance
+ on CPU that triggered the oops.
+
+ ftrace_dump_on_oops,foo,bar=orig_cpu
+
+ The above will dump global buffer on all CPUs, the
+ buffer of "foo" instance on all CPUs and the buffer
+ of "bar" instance on CPU that triggered the oops.
ftrace_filter=[function-list]
[FTRACE] Limit the functions traced by the function
@@ -1600,7 +1667,7 @@
can be changed at run time by the max_graph_depth file
in the tracefs tracing directory. default: 0 (no limit)
- fw_devlink= [KNL] Create device links between consumer and supplier
+ fw_devlink= [KNL,EARLY] Create device links between consumer and supplier
devices by scanning the firmware to infer the
consumer/supplier relationships. This feature is
especially useful when drivers are loaded as modules as
@@ -1619,12 +1686,12 @@
rpm -- Like "on", but also use to order runtime PM.
fw_devlink.strict=<bool>
- [KNL] Treat all inferred dependencies as mandatory
+ [KNL,EARLY] Treat all inferred dependencies as mandatory
dependencies. This only applies for fw_devlink=on|rpm.
Format: <bool>
fw_devlink.sync_state =
- [KNL] When all devices that could probe have finished
+ [KNL,EARLY] When all devices that could probe have finished
probing, this parameter controls what to do with
devices that haven't yet received their sync_state()
calls.
@@ -1645,12 +1712,12 @@
gamma= [HW,DRM]
- gart_fix_e820= [X86-64] disable the fix e820 for K8 GART
+ gart_fix_e820= [X86-64,EARLY] disable the fix e820 for K8 GART
Format: off | on
default: on
gather_data_sampling=
- [X86,INTEL] Control the Gather Data Sampling (GDS)
+ [X86,INTEL,EARLY] Control the Gather Data Sampling (GDS)
mitigation.
Gather Data Sampling is a hardware vulnerability which
@@ -1669,6 +1736,8 @@
off: Disable GDS mitigation.
+ gbpages [X86] Use GB pages for kernel direct mappings.
+
gcov_persist= [GCOV] When non-zero (default), profiling data for
kernel modules is saved and remains accessible via
debugfs, even when the module is unloaded/reloaded.
@@ -1729,8 +1798,6 @@
for 64-bit NUMA, off otherwise.
Format: 0 | 1 (for off | on)
- hcl= [IA-64] SGI's Hardware Graph compatibility layer
-
hd= [EIDE] (E)IDE hard drive subsystem geometry
Format: <cyl>,<head>,<sect>
@@ -1748,7 +1815,18 @@
(that will set all pages holding image data
during restoration read-only).
- highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
+ hibernate.compressor= [HIBERNATION] Compression algorithm to be
+ used with hibernation.
+ Format: { lzo | lz4 }
+ Default: lzo
+
+ lzo: Select LZO compression algorithm to
+ compress/decompress hibernation image.
+
+ lz4: Select LZ4 compression algorithm to
+ compress/decompress hibernation image.
+
+ highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact
size of <nn>. This works even on boxes that have no
highmem otherwise. This also works to reduce highmem
size on bigger boxes.
@@ -1759,7 +1837,7 @@
hlt [BUGS=ARM,SH]
- hostname= [KNL] Set the hostname (aka UTS nodename).
+ hostname= [KNL,EARLY] Set the hostname (aka UTS nodename).
Format: <string>
This allows setting the system's hostname during early
startup. This sets the name returned by gethostname.
@@ -1804,7 +1882,7 @@
Documentation/admin-guide/mm/hugetlbpage.rst.
Format: size[KMG]
- hugetlb_cma= [HW,CMA] The size of a CMA area used for allocation
+ hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation
of gigantic hugepages. Or using node format, the size
of a CMA area per node can be specified.
Format: nn[KMGTPE] or (node format)
@@ -1850,9 +1928,10 @@
If specified, z/VM IUCV HVC accepts connections
from listed z/VM user IDs only.
- hv_nopvspin [X86,HYPER_V] Disables the paravirt spinlock optimizations
- which allow the hypervisor to 'idle' the
- guest on lock contention.
+ hv_nopvspin [X86,HYPER_V,EARLY]
+ Disables the paravirt spinlock optimizations
+ which allow the hypervisor to 'idle' the guest
+ on lock contention.
i2c_bus= [HW] Override the default board specific I2C bus speed
or register an additional I2C bus that is not
@@ -1860,6 +1939,28 @@
Format:
<bus_id>,<clkrate>
+ i2c_touchscreen_props= [HW,ACPI,X86]
+ Set device-properties for ACPI-enumerated I2C-attached
+ touchscreen, to e.g. fix coordinates of upside-down
+ mounted touchscreens. If you need this option please
+ submit a drivers/platform/x86/touchscreen_dmi.c patch
+ adding a DMI quirk for this.
+
+ Format:
+ <ACPI_HW_ID>:<prop_name>=<val>[:prop_name=val][:...]
+ Where <val> is one of:
+ Omit "=<val>" entirely Set a boolean device-property
+ Unsigned number Set a u32 device-property
+ Anything else Set a string device-property
+
+ Examples (split over multiple lines):
+ i2c_touchscreen_props=GDIX1001:touchscreen-inverted-x:
+ touchscreen-inverted-y
+
+ i2c_touchscreen_props=MSSL1680:touchscreen-size-x=1920:
+ touchscreen-size-y=1080:touchscreen-inverted-y:
+ firmware-name=gsl1680-vendor-model.fw:silead,home-button
+
i8042.debug [HW] Toggle i8042 debug mode
i8042.unmask_kbd_data
[HW] Enable printing of interrupt data from the KBD port
@@ -1917,14 +2018,23 @@
Format: <io>[,<membase>[,<icn_id>[,<icn_id2>]]]
- idle= [X86]
+ idle= [X86,EARLY]
Format: idle=poll, idle=halt, idle=nomwait
- Poll forces a polling idle loop that can slightly
- improve the performance of waking up a idle CPU, but
- will use a lot of power and make the system run hot.
- Not recommended.
+
+ idle=poll: Don't do power saving in the idle loop
+ using HLT, but poll for rescheduling event. This will
+ make the CPUs eat a lot more power, but may be useful
+ to get slightly better performance in multiprocessor
+ benchmarks. It also makes some profiling using
+ performance counters more accurate. Please note that
+ on systems with MONITOR/MWAIT support (like Intel
+ EM64T CPUs) this option has no performance advantage
+ over the normal idle loop. It may also interact badly
+ with hyperthreading.
+
idle=halt: Halt is forced to be used for CPU idle.
In such case C2/C3 won't be used again.
+
idle=nomwait: Disable mwait for CPU C-states
idxd.sva= [HW]
@@ -1939,7 +2049,7 @@
for the device. By default it is set to false (0).
ieee754= [MIPS] Select IEEE Std 754 conformance mode
- Format: { strict | legacy | 2008 | relaxed }
+ Format: { strict | legacy | 2008 | relaxed | emulated }
Default: strict
Choose which programs will be accepted for execution
@@ -1959,6 +2069,8 @@
by the FPU
relaxed accept any binaries regardless of whether
supported by the FPU
+ emulated accept any binaries but enable FPU emulator
+ if binary mode is unsupported by the FPU.
The FPU emulator is always able to support both NaN
encodings, so if no FPU hardware is present or it has
@@ -1973,7 +2085,7 @@
mode generally follows that for the NaN encoding,
except where unsupported by hardware.
- ignore_loglevel [KNL]
+ ignore_loglevel [KNL,EARLY]
Ignore loglevel setting - this will print /all/
kernel messages to the console. Useful for debugging.
We also add it as printk module parameter, so users
@@ -2091,21 +2203,21 @@
unpacking being completed before device_ and
late_ initcalls.
- initrd= [BOOT] Specify the location of the initial ramdisk
+ initrd= [BOOT,EARLY] Specify the location of the initial ramdisk
- initrdmem= [KNL] Specify a physical address and size from which to
+ initrdmem= [KNL,EARLY] Specify a physical address and size from which to
load the initrd. If an initrd is compiled in or
specified in the bootparams, it takes priority over this
setting.
Format: ss[KMG],nn[KMG]
Default is 0, 0
- init_on_alloc= [MM] Fill newly allocated pages and heap objects with
+ init_on_alloc= [MM,EARLY] Fill newly allocated pages and heap objects with
zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_ALLOC_DEFAULT_ON.
- init_on_free= [MM] Fill freed pages and heap objects with zeroes.
+ init_on_free= [MM,EARLY] Fill freed pages and heap objects with zeroes.
Format: 0 | 1
Default set by CONFIG_INIT_ON_FREE_DEFAULT_ON.
@@ -2161,7 +2273,7 @@
0 disables intel_idle and fall back on acpi_idle.
1 to 9 specify maximum depth of C-state.
- intel_pstate= [X86]
+ intel_pstate= [X86,EARLY]
disable
Do not enable intel_pstate as the default
scaling driver for the supported processors
@@ -2205,34 +2317,89 @@
Allow per-logical-CPU P-State performance control limits using
cpufreq sysfs interface
- intremap= [X86-64, Intel-IOMMU]
+ intremap= [X86-64,Intel-IOMMU,EARLY]
on enable Interrupt Remapping (default)
off disable Interrupt Remapping
nosid disable Source ID checking
no_x2apic_optout
BIOS x2APIC opt-out request will be ignored
nopost disable Interrupt Posting
+ posted_msi
+ enable MSIs delivered as posted interrupts
iomem= Disable strict checking of access to MMIO memory
strict regions from userspace.
relaxed
- iommu= [X86]
+ iommu= [X86,EARLY]
+
off
+ Don't initialize and use any kind of IOMMU.
+
force
+ Force the use of the hardware IOMMU even when
+ it is not actually needed (e.g. because < 3 GB
+ memory).
+
noforce
+ Don't force hardware IOMMU usage when it is not
+ needed. (default).
+
biomerge
panic
nopanic
merge
nomerge
+
soft
- pt [X86]
- nopt [X86]
- nobypass [PPC/POWERNV]
+ Use software bounce buffering (SWIOTLB) (default for
+ Intel machines). This can be used to prevent the usage
+ of an available hardware IOMMU.
+
+ [X86]
+ pt
+ [X86]
+ nopt
+ [PPC/POWERNV]
+ nobypass
Disable IOMMU bypass, using IOMMU for PCI devices.
- iommu.forcedac= [ARM64, X86] Control IOVA allocation for PCI devices.
+ [X86]
+ AMD Gart HW IOMMU-specific options:
+
+ <size>
+ Set the size of the remapping area in bytes.
+
+ allowed
+ Overwrite iommu off workarounds for specific chipsets
+
+ fullflush
+ Flush IOMMU on each allocation (default).
+
+ nofullflush
+ Don't use IOMMU fullflush.
+
+ memaper[=<order>]
+ Allocate an own aperture over RAM with size
+ 32MB<<order. (default: order=1, i.e. 64MB)
+
+ merge
+ Do scatter-gather (SG) merging. Implies "force"
+ (experimental).
+
+ nomerge
+ Don't do scatter-gather (SG) merging.
+
+ noaperture
+ Ask the IOMMU not to touch the aperture for AGP.
+
+ noagp
+ Don't initialize the AGP driver and use full aperture.
+
+ panic
+ Always panic when IOMMU overflows.
+
+ iommu.forcedac= [ARM64,X86,EARLY] Control IOVA allocation for PCI devices.
Format: { "0" | "1" }
0 - Try to allocate a 32-bit DMA address first, before
falling back to the full range if needed.
@@ -2240,7 +2407,7 @@
forcing Dual Address Cycle for PCI cards supporting
greater than 32-bit addressing.
- iommu.strict= [ARM64, X86, S390] Configure TLB invalidation behaviour
+ iommu.strict= [ARM64,X86,S390,EARLY] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
Request that DMA unmap operations use deferred
@@ -2256,7 +2423,7 @@
legacy driver-specific options takes precedence.
iommu.passthrough=
- [ARM64, X86] Configure DMA to bypass the IOMMU by default.
+ [ARM64,X86,EARLY] Configure DMA to bypass the IOMMU by default.
Format: { "0" | "1" }
0 - Use IOMMU translation for DMA.
1 - Bypass the IOMMU for DMA.
@@ -2266,7 +2433,7 @@
See comment before marvel_specify_io7 in
arch/alpha/kernel/core_marvel.c.
- io_delay= [X86] I/O delay method
+ io_delay= [X86,EARLY] I/O delay method
0x80
Standard port 0x80 based delay
0xed
@@ -2279,28 +2446,40 @@
ip= [IP_PNP]
See Documentation/admin-guide/nfs/nfsroot.rst.
- ipcmni_extend [KNL] Extend the maximum number of unique System V
+ ipcmni_extend [KNL,EARLY] Extend the maximum number of unique System V
IPC identifiers from 32,768 to 16,777,216.
+ ipe.enforce= [IPE]
+ Format: <bool>
+ Determine whether IPE starts in permissive (0) or
+ enforce (1) mode. The default is enforce.
+
+ ipe.success_audit=
+ [IPE]
+ Format: <bool>
+ Start IPE with success auditing enabled, emitting
+ an audit event when a binary is allowed. The default
+ is 0.
+
irqaffinity= [SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
irqchip.gicv2_force_probe=
- [ARM, ARM64]
+ [ARM,ARM64,EARLY]
Format: <bool>
Force the kernel to look for the second 4kB page
of a GICv2 controller even if the memory range
exposed by the device tree is too small.
irqchip.gicv3_nolpi=
- [ARM, ARM64]
+ [ARM,ARM64,EARLY]
Force the kernel to ignore the availability of
LPIs (and by consequence ITSs). Intended for system
that use the kernel as a bootloader, and thus want
to let secondary kernels in charge of setting up
LPIs.
- irqchip.gicv3_pseudo_nmi= [ARM64]
+ irqchip.gicv3_pseudo_nmi= [ARM64,EARLY]
Enables support for pseudo-NMIs in the kernel. This
requires the kernel to be built with
CONFIG_ARM64_PSEUDO_NMI.
@@ -2327,7 +2506,9 @@
specified in the flag list (default: domain):
nohz
- Disable the tick when a single task runs.
+ Disable the tick when a single task runs as well as
+ disabling other kernel noises like having RCU callbacks
+ offloaded. This is equivalent to the nohz_full parameter.
A residual 1Hz tick is offloaded to workqueues, which you
need to affine to housekeeping through the global
@@ -2445,7 +2626,7 @@
parameter KASAN will print report only for the first
invalid access.
- keep_bootcon [KNL]
+ keep_bootcon [KNL,EARLY]
Do not unregister boot console at start. This is only
useful for debugging when something happens in the window
between unregistering the boot console and initializing
@@ -2453,7 +2634,7 @@
keepinitrd [HW,ARM] See retain_initrd.
- kernelcore= [KNL,X86,IA-64,PPC]
+ kernelcore= [KNL,X86,PPC,EARLY]
Format: nn[KMGTPE] | nn% | "mirror"
This parameter specifies the amount of memory usable by
the kernel for non-movable allocations. The requested
@@ -2478,7 +2659,7 @@
for Movable pages. "nn[KMGTPE]", "nn%", and "mirror"
are exclusive, so you cannot specify multiple forms.
- kgdbdbgp= [KGDB,HW] kgdb over EHCI usb debug port.
+ kgdbdbgp= [KGDB,HW,EARLY] kgdb over EHCI usb debug port.
Format: <Controller#>[,poll interval]
The controller # is the number of the ehci usb debug
port as it is probed via PCI. The poll interval is
@@ -2499,7 +2680,7 @@
kms, kbd format: kms,kbd
kms, kbd and serial format: kms,kbd,<ser_dev>[,baud]
- kgdboc_earlycon= [KGDB,HW]
+ kgdboc_earlycon= [KGDB,HW,EARLY]
If the boot console provides the ability to read
characters and can work in polling mode, you can use
this parameter to tell kgdb to use it as a backend
@@ -2514,14 +2695,14 @@
blank and the first boot console that implements
read() will be picked.
- kgdbwait [KGDB] Stop kernel execution and enter the
+ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the
kernel debugger at the earliest opportunity.
kmac= [MIPS] Korina ethernet MAC address.
Configure the RouterBoard 532 series on-chip
Ethernet adapter MAC address.
- kmemleak= [KNL] Boot-time kmemleak enable/disable
+ kmemleak= [KNL,EARLY] Boot-time kmemleak enable/disable
Valid arguments: on, off
Default: on
Built with CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y,
@@ -2540,8 +2721,8 @@
See also Documentation/trace/kprobetrace.rst "Kernel
Boot Parameter" section.
- kpti= [ARM64] Control page table isolation of user
- and kernel address spaces.
+ kpti= [ARM64,EARLY] Control page table isolation of
+ user and kernel address spaces.
Default: enabled on cores which need mitigation.
0: force disabled
1: force enabled
@@ -2580,6 +2761,23 @@
Default is Y (on).
+ kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86]
+ If enabled, KVM will enable virtualization in hardware
+ when KVM is loaded, and disable virtualization when KVM
+ is unloaded (if KVM is built as a module).
+
+ If disabled, KVM will dynamically enable and disable
+ virtualization on-demand when creating and destroying
+ VMs, i.e. on the 0=>1 and 1=>0 transitions of the
+ number of VMs.
+
+ Enabling virtualization at module load avoids potential
+ latency for creation of the 0=>1 VM, as KVM serializes
+ virtualization enabling across all online CPUs. The
+ "cost" of enabling virtualization when KVM is loaded,
+ is that doing so may interfere with using out-of-tree
+ hypervisors that want to "own" virtualization hardware.
+
kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
Default is false (don't support).
@@ -2618,42 +2816,65 @@
for NPT.
kvm-arm.mode=
- [KVM,ARM] Select one of KVM/arm64's modes of operation.
+ [KVM,ARM,EARLY] Select one of KVM/arm64's modes of
+ operation.
none: Forcefully disable KVM.
nvhe: Standard nVHE-based mode, without support for
protected guests.
- protected: nVHE-based mode with support for guests whose
- state is kept private from the host.
+ protected: Mode with support for guests whose state is
+ kept private from the host, using VHE or
+ nVHE depending on HW support.
nested: VHE-based mode with support for nested
- virtualization. Requires at least ARMv8.3
- hardware.
+ virtualization. Requires at least ARMv8.4
+ hardware (with FEAT_NV2).
Defaults to VHE/nVHE based on hardware support. Setting
mode to "protected" will disable kexec and hibernation
- for the host. "nested" is experimental and should be
- used with extreme caution.
+ for the host. To force nVHE on VHE hardware, add
+ "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the
+ command-line.
+ "nested" is experimental and should be used with
+ extreme caution.
kvm-arm.vgic_v3_group0_trap=
- [KVM,ARM] Trap guest accesses to GICv3 group-0
+ [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0
system registers
kvm-arm.vgic_v3_group1_trap=
- [KVM,ARM] Trap guest accesses to GICv3 group-1
+ [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-1
system registers
kvm-arm.vgic_v3_common_trap=
- [KVM,ARM] Trap guest accesses to GICv3 common
+ [KVM,ARM,EARLY] Trap guest accesses to GICv3 common
system registers
kvm-arm.vgic_v4_enable=
- [KVM,ARM] Allow use of GICv4 for direct injection of
- LPIs.
+ [KVM,ARM,EARLY] Allow use of GICv4 for direct
+ injection of LPIs.
+
+ kvm-arm.wfe_trap_policy=
+ [KVM,ARM] Control when to set WFE instruction trap for
+ KVM VMs. Traps are allowed but not guaranteed by the
+ CPU architecture.
- kvm_cma_resv_ratio=n [PPC]
+ trap: set WFE instruction trap
+
+ notrap: clear WFE instruction trap
+
+ kvm-arm.wfi_trap_policy=
+ [KVM,ARM] Control when to set WFI instruction trap for
+ KVM VMs. Traps are allowed but not guaranteed by the
+ CPU architecture.
+
+ trap: set WFI instruction trap
+
+ notrap: clear WFI instruction trap
+
+ kvm_cma_resv_ratio=n [PPC,EARLY]
Reserves given percentage from system memory area for
contiguous memory allocation for KVM hash pagetable
allocation.
@@ -2706,7 +2927,7 @@
(enabled). Disable by KVM if hardware lacks support
for it.
- l1d_flush= [X86,INTEL]
+ l1d_flush= [X86,INTEL,EARLY]
Control mitigation for L1D based snooping vulnerability.
Certain CPUs are vulnerable to an exploit against CPU
@@ -2723,7 +2944,7 @@
on - enable the interface for the mitigation
- l1tf= [X86] Control mitigation of the L1TF vulnerability on
+ l1tf= [X86,EARLY] Control mitigation of the L1TF vulnerability on
affected CPUs
The kernel PTE inversion protection is unconditionally
@@ -2792,7 +3013,7 @@
l3cr= [PPC]
- lapic [X86-32,APIC] Enable the local APIC even if BIOS
+ lapic [X86-32,APIC,EARLY] Enable the local APIC even if BIOS
disabled it.
lapic= [X86,APIC] Do not use TSC deadline
@@ -2800,7 +3021,7 @@
back to the programmable timer unit in the LAPIC.
Format: notscdeadline
- lapic_timer_c2_ok [X86,APIC] trust the local apic timer
+ lapic_timer_c2_ok [X86,APIC,EARLY] trust the local apic timer
in C2 power state.
libata.dma= [LIBATA] DMA control
@@ -2924,7 +3145,7 @@
lockd.nlm_udpport=M [NFS] Assign UDP port.
Format: <integer>
- lockdown= [SECURITY]
+ lockdown= [SECURITY,EARLY]
{ integrity | confidentiality }
Enable the kernel lockdown feature. If set to
integrity, kernel features that allow userland to
@@ -3031,7 +3252,8 @@
logibm.irq= [HW,MOUSE] Logitech Bus Mouse Driver
Format: <irq>
- loglevel= All Kernel Messages with a loglevel smaller than the
+ loglevel= [KNL,EARLY]
+ All Kernel Messages with a loglevel smaller than the
console loglevel will be printed to the console. It can
also be changed with klogd or other programs. The
loglevels are defined as follows:
@@ -3045,13 +3267,15 @@
6 (KERN_INFO) informational
7 (KERN_DEBUG) debug-level messages
- log_buf_len=n[KMG] Sets the size of the printk ring buffer,
- in bytes. n must be a power of two and greater
- than the minimal size. The minimal size is defined
- by LOG_BUF_SHIFT kernel config parameter. There is
- also CONFIG_LOG_CPU_MAX_BUF_SHIFT config parameter
- that allows to increase the default size depending on
- the number of CPUs. See init/Kconfig for more details.
+ log_buf_len=n[KMG] [KNL,EARLY]
+ Sets the size of the printk ring buffer, in bytes.
+ n must be a power of two and greater than the
+ minimal size. The minimal size is defined by
+ LOG_BUF_SHIFT kernel config parameter. There
+ is also CONFIG_LOG_CPU_MAX_BUF_SHIFT config
+ parameter that allows to increase the default size
+ depending on the number of CPUs. See init/Kconfig
+ for more details.
logo.nologo [FB] Disables display of the built-in Linux logo.
This may be used to provide more screen space for
@@ -3089,27 +3313,17 @@
unlikely, in the extreme case this might damage your
hardware.
- ltpc= [NET]
- Format: <io>,<irq>,<dma>
-
lsm.debug [SECURITY] Enable LSM initialization debugging output.
lsm=lsm1,...,lsmN
[SECURITY] Choose order of LSM initialization. This
overrides CONFIG_LSM, and the "security=" parameter.
- machvec= [IA-64] Force the use of a particular machine-vector
- (machvec) in a generic kernel.
- Example: machvec=hpzx1
-
machtype= [Loongson] Share the same kernel image file between
different yeeloong laptops.
Example: machtype=lemote-yeeloong-2f-7inch
- max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater
- than or equal to this physical address is ignored.
-
- maxcpus= [SMP] Maximum number of processors that an SMP kernel
+ maxcpus= [SMP,EARLY] Maximum number of processors that an SMP kernel
will bring up during bootup. maxcpus=n : n >= 0 limits
the kernel to bring up 'n' processors. Surely after
bootup you can bring up the other plugged cpu by executing
@@ -3125,9 +3339,77 @@
devices can be requested on-demand with the
/dev/loop-control interface.
- mce [X86-32] Machine Check Exception
+ mce= [X86-{32,64}]
+
+ Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables.
+
+ off
+ disable machine check
+
+ no_cmci
+ disable CMCI(Corrected Machine Check Interrupt) that
+ Intel processor supports. Usually this disablement is
+ not recommended, but it might be handy if your
+ hardware is misbehaving.
+
+ Note that you'll get more problems without CMCI than
+ with due to the shared banks, i.e. you might get
+ duplicated error logs.
+
+ dont_log_ce
+ don't make logs for corrected errors. All events
+ reported as corrected are silently cleared by OS. This
+ option will be useful if you have no interest in any
+ of corrected errors.
+
+ ignore_ce
+ disable features for corrected errors, e.g.
+ polling timer and CMCI. All events reported as
+ corrected are not cleared by OS and remained in its
+ error banks.
+
+ Usually this disablement is not recommended, however
+ if there is an agent checking/clearing corrected
+ errors (e.g. BIOS or hardware monitoring
+ applications), conflicting with OS's error handling,
+ and you cannot deactivate the agent, then this option
+ will be a help.
+
+ no_lmce
+ do not opt-in to Local MCE delivery. Use legacy method
+ to broadcast MCEs.
+
+ bootlog
+ enable logging of machine checks left over from
+ booting. Disabled by default on AMD Fam10h and older
+ because some BIOS leave bogus ones.
+
+ If your BIOS doesn't do that it's a good idea to
+ enable though to make sure you log even machine check
+ events that result in a reboot. On Intel systems it is
+ enabled by default.
+
+ nobootlog
+ disable boot machine check logging.
+
+ monarchtimeout (number)
+ sets the time in us to wait for other CPUs on machine
+ checks. 0 to disable.
+
+ bios_cmci_threshold
+ don't overwrite the bios-set CMCI threshold. This boot
+ option prevents Linux from overwriting the CMCI
+ threshold set by the bios. Without this option, Linux
+ always sets the CMCI threshold to 1. Enabling this may
+ make memory predictive failure analysis less effective
+ if the bios sets thresholds for memory errors since we
+ will not see details for all errors.
+
+ recovery
+ force-enable recoverable machine check code paths
+
+ Everything else is in sysfs now.
- mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
md= [HW] RAID subsystems devices and level
See Documentation/admin-guide/md.rst.
@@ -3136,7 +3418,7 @@
Format: <first>,<last>
Specifies range of consoles to be captured by the MDA.
- mds= [X86,INTEL]
+ mds= [X86,INTEL,EARLY]
Control mitigation for the Micro-architectural Data
Sampling (MDS) vulnerability.
@@ -3168,11 +3450,12 @@
For details see: Documentation/admin-guide/hw-vuln/mds.rst
- mem=nn[KMG] [HEXAGON] Set the memory size.
+ mem=nn[KMG] [HEXAGON,EARLY] Set the memory size.
Must be specified, otherwise memory size will be 0.
- mem=nn[KMG] [KNL,BOOT] Force usage of a specific amount of memory
- Amount of memory to be used in cases as follows:
+ mem=nn[KMG] [KNL,BOOT,EARLY] Force usage of a specific amount
+ of memory Amount of memory to be used in cases
+ as follows:
1 for test;
2 when the kernel is not able to see the whole system memory;
@@ -3196,8 +3479,8 @@
if system memory of hypervisor is not sufficient.
mem=nn[KMG]@ss[KMG]
- [ARM,MIPS] - override the memory layout reported by
- firmware.
+ [ARM,MIPS,EARLY] - override the memory layout
+ reported by firmware.
Define a memory region of size nn[KMG] starting at
ss[KMG].
Multiple different regions can be specified with
@@ -3206,7 +3489,7 @@
mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel
memory.
- memblock=debug [KNL] Enable memblock debug messages.
+ memblock=debug [KNL,EARLY] Enable memblock debug messages.
memchunk=nn[KMG]
[KNL,SH] Allow user to override the default size for
@@ -3216,18 +3499,18 @@
[KNL] Set the initial state for the memory hotplug
onlining policy. If not specified, the default value is
set according to the
- CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
- option.
+ CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel config
+ options.
See Documentation/admin-guide/mm/memory-hotplug.rst.
- memmap=exactmap [KNL,X86] Enable setting of an exact
+ memmap=exactmap [KNL,X86,EARLY] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
BIOS output or other requirements. See the memmap=nn@ss
option description.
memmap=nn[KMG]@ss[KMG]
- [KNL, X86, MIPS, XTENSA] Force usage of a specific region of memory.
+ [KNL, X86,MIPS,XTENSA,EARLY] Force usage of a specific region of memory.
Region of memory to be used is from ss to ss+nn.
If @ss[KMG] is omitted, it is equivalent to mem=nn[KMG],
which limits max address to nn[KMG].
@@ -3237,11 +3520,11 @@
memmap=100M@2G,100M#3G,1G!1024G
memmap=nn[KMG]#ss[KMG]
- [KNL,ACPI] Mark specific memory as ACPI data.
+ [KNL,ACPI,EARLY] Mark specific memory as ACPI data.
Region of memory to be marked is from ss to ss+nn.
memmap=nn[KMG]$ss[KMG]
- [KNL,ACPI] Mark specific memory as reserved.
+ [KNL,ACPI,EARLY] Mark specific memory as reserved.
Region of memory to be reserved is from ss to ss+nn.
Example: Exclude memory from 0x18690000-0x1869ffff
memmap=64K$0x18690000
@@ -3251,14 +3534,14 @@
like Grub2, otherwise '$' and the following number
will be eaten.
- memmap=nn[KMG]!ss[KMG]
+ memmap=nn[KMG]!ss[KMG,EARLY]
[KNL,X86] Mark specific memory as protected.
Region of memory to be used, from ss to ss+nn.
The memory region may be marked as e820 type 12 (0xc)
and is NVDIMM or ADR memory.
memmap=<size>%<offset>-<oldtype>+<newtype>
- [KNL,ACPI] Convert memory within the specified region
+ [KNL,ACPI,EARLY] Convert memory within the specified region
from <oldtype> to <newtype>. If "-<oldtype>" is left
out, the whole region will be marked as <newtype>,
even if previously unavailable. If "+<newtype>" is left
@@ -3266,7 +3549,7 @@
specified as e820 types, e.g., 1 = RAM, 2 = reserved,
3 = ACPI, 12 = PRAM.
- memory_corruption_check=0/1 [X86]
+ memory_corruption_check=0/1 [X86,EARLY]
Some BIOSes seem to corrupt the first 64k of
memory when doing things like suspend/resume.
Setting this option will scan the memory
@@ -3278,13 +3561,13 @@
affects the same memory, you can use memmap=
to prevent the kernel from using that memory.
- memory_corruption_check_size=size [X86]
+ memory_corruption_check_size=size [X86,EARLY]
By default it checks for corruption in the low
64k, making this memory unavailable for normal
use. Use this parameter to scan for
corruption in more or less memory.
- memory_corruption_check_period=seconds [X86]
+ memory_corruption_check_period=seconds [X86,EARLY]
By default it checks for corruption every 60
seconds. Use this parameter to check at some
other rate. 0 disables periodic checking.
@@ -3308,7 +3591,7 @@
Note that even when enabled, there are a few cases where
the feature is not effective.
- memtest= [KNL,X86,ARM,M68K,PPC,RISCV] Enable memtest
+ memtest= [KNL,X86,ARM,M68K,PPC,RISCV,EARLY] Enable memtest
Format: <integer>
default : 0 <disable>
Specifies the number of memtest passes to be
@@ -3320,9 +3603,7 @@
mem_encrypt= [X86-64] AMD Secure Memory Encryption (SME) control
Valid arguments: on, off
- Default (depends on kernel configuration option):
- on (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=y)
- off (CONFIG_AMD_MEM_ENCRYPT_ACTIVE_BY_DEFAULT=n)
+ Default: off
mem_encrypt=on: Activate SME
mem_encrypt=off: Do not activate SME
@@ -3335,10 +3616,6 @@
deep - Suspend-To-RAM or equivalent (if supported)
See Documentation/admin-guide/pm/sleep-states.rst.
- mfgpt_irq= [IA-32] Specify the IRQ to use for the
- Multi-Function General Purpose Timers on AMD Geode
- platforms.
-
mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when
the BIOS has incorrectly applied a workaround. TinyBIOS
version 0.98 is known to be affected, 0.99 fixes the
@@ -3351,9 +3628,6 @@
Enable or disable the microcode minimal revision
enforcement for the runtime microcode loader.
- min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this
- physical address is ignored.
-
mini2440= [ARM,HW,KNL]
Format:[0..2][b][c][t]
Default: "0tb"
@@ -3376,11 +3650,14 @@
https://repo.or.cz/w/linux-2.6/mini2440.git
mitigations=
- [X86,PPC,S390,ARM64] Control optional mitigations for
+ [X86,PPC,S390,ARM64,EARLY] Control optional mitigations for
CPU vulnerabilities. This is a set of curated,
arch-independent options, each of which is an
aggregation of existing arch-specific options.
+ Note, "mitigations" is supported if and only if the
+ kernel was built with CPU_MITIGATIONS=y.
+
off
Disable all optional CPU mitigations. This
improves system performance, but it may also
@@ -3398,8 +3675,11 @@
nospectre_bhb [ARM64]
nospectre_v1 [X86,PPC]
nospectre_v2 [X86,PPC,S390,ARM64]
+ reg_file_data_sampling=off [X86]
retbleed=off [X86]
+ spec_rstack_overflow=off [X86]
spec_store_bypass_disable=off [X86,PPC]
+ spectre_bhi=off [X86]
spectre_v2_user=off [X86]
srbds=off [X86,INTEL]
ssbd=force-off [ARM64]
@@ -3429,7 +3709,7 @@
retbleed=auto,nosmt [X86]
mminit_loglevel=
- [KNL] When CONFIG_DEBUG_MEMORY_INIT is set, this
+ [KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this
parameter allows control of the logging verbosity for
the additional memory initialisation checks. A value
of 0 disables mminit logging and a level of 4 will
@@ -3437,7 +3717,7 @@
so loglevel=8 may also need to be specified.
mmio_stale_data=
- [X86,INTEL] Control mitigation for the Processor
+ [X86,INTEL,EARLY] Control mitigation for the Processor
MMIO Stale Data vulnerabilities.
Processor MMIO Stale Data is a class of
@@ -3512,7 +3792,7 @@
mousedev.yres= [MOUSE] Vertical screen resolution, used for devices
reporting absolute coordinates, such as tablets
- movablecore= [KNL,X86,IA-64,PPC]
+ movablecore= [KNL,X86,PPC,EARLY]
Format: nn[KMGTPE] | nn%
This parameter is the complement to kernelcore=, it
specifies the amount of memory used for migratable
@@ -3523,7 +3803,7 @@
that the amount of memory usable for all allocations
is not too small.
- movable_node [KNL] Boot-time switch to make hotplugable memory
+ movable_node [KNL,EARLY] Boot-time switch to make hotplugable memory
NUMA nodes to be movable. This means that the memory
of such nodes will be usable only for movable
allocations which rules out almost all kernel
@@ -3538,30 +3818,25 @@
mtdparts= [MTD]
See drivers/mtd/parsers/cmdlinepart.c
- mtdset= [ARM]
- ARM/S3C2412 JIVE boot control
-
- See arch/arm/mach-s3c/mach-jive.c
-
mtouchusb.raw_coordinates=
[HW] Make the MicroTouch USB driver use raw coordinates
('y', default) or cooked coordinates ('n')
- mtrr=debug [X86]
+ mtrr=debug [X86,EARLY]
Enable printing debug information related to MTRR
registers at boot time.
- mtrr_chunk_size=nn[KMG] [X86]
+ mtrr_chunk_size=nn[KMG,X86,EARLY]
used for mtrr cleanup. It is largest continuous chunk
that could hold holes aka. UC entries.
- mtrr_gran_size=nn[KMG] [X86]
+ mtrr_gran_size=nn[KMG,X86,EARLY]
Used for mtrr cleanup. It is granularity of mtrr block.
Default is 1.
Large value could prevent small alignment from
using up MTRRs.
- mtrr_spare_reg_nr=n [X86]
+ mtrr_spare_reg_nr=n [X86,EARLY]
Format: <integer>
Range: 0,7 : spare reg number
Default : 1
@@ -3728,10 +4003,12 @@
Format: [state][,regs][,debounce][,die]
nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels
- Format: [panic,][nopanic,][num]
+ Format: [panic,][nopanic,][rNNN,][num]
Valid num: 0 or 1
0 - turn hardlockup detector in nmi_watchdog off
1 - turn hardlockup detector in nmi_watchdog on
+ rNNN - configure the watchdog with raw perf event 0xNNN
+
When panic is specified, panic when an NMI watchdog
timeout occurs (or 'nopanic' to not panic on an NMI
watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set)
@@ -3747,27 +4024,22 @@
emulation library even if a 387 maths coprocessor
is present.
- no4lvl [RISCV] Disable 4-level and 5-level paging modes. Forces
- kernel to use 3-level paging instead.
+ no4lvl [RISCV,EARLY] Disable 4-level and 5-level paging modes.
+ Forces kernel to use 3-level paging instead.
- no5lvl [X86-64,RISCV] Disable 5-level paging mode. Forces
+ no5lvl [X86-64,RISCV,EARLY] Disable 5-level paging mode. Forces
kernel to use 4-level paging instead.
- noaliencache [MM, NUMA, SLAB] Disables the allocation of alien
- caches in the slab allocator. Saves per-node memory,
- but will impact performance.
-
noalign [KNL,ARM]
- noaltinstr [S390] Disables alternative instructions patching
- (CPU alternatives feature).
-
- noapic [SMP,APIC] Tells the kernel to not make use of any
+ noapic [SMP,APIC,EARLY] Tells the kernel to not make use of any
IOAPICs that may be present in the system.
+ noapictimer [APIC,X86] Don't set up the APIC timer
+
noautogroup Disable scheduler automatic task group creation.
- nocache [ARM]
+ nocache [ARM,EARLY]
no_console_suspend
[HW] Never suspend the console
@@ -3785,15 +4057,13 @@
turn on/off it dynamically.
no_debug_objects
- [KNL] Disable object debugging
+ [KNL,EARLY] Disable object debugging
nodsp [SH] Disable hardware DSP at boot time.
- noefi Disable EFI runtime services support.
+ noefi [EFI,EARLY] Disable EFI runtime services support.
- no_entry_flush [PPC] Don't flush the L1-D cache when entering the kernel.
-
- noexec [IA-64]
+ no_entry_flush [PPC,EARLY] Don't flush the L1-D cache when entering the kernel.
noexec32 [X86-64]
This affects only 32-bit executables.
@@ -3814,14 +4084,10 @@
register save and restore. The kernel will only save
legacy floating-point registers on task switch.
- nohalt [IA-64] Tells the kernel not to use the power saving
- function PAL_HALT_LIGHT when idle. This increases
- power-consumption. On the positive side, it reduces
- interrupt wake-up latency, which may improve performance
- in certain environments such as networked servers or
- real-time systems.
+ nogbpages [X86] Do not use GB pages for kernel direct mappings.
no_hash_pointers
+ [KNL,EARLY]
Force pointers printed to the console or buffers to be
unhashed. By default, when a pointer is printed via %p
format string, that pointer is "hashed", i.e. obscured
@@ -3837,7 +4103,7 @@
nohibernate [HIBERNATION] Disable hibernation and resume.
- nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to
+ nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,RISCV,SH] Forces the kernel to
busy wait in do_idle() and not use the arch_cpu_idle()
implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP
to be effective. This is useful on platforms where the
@@ -3846,9 +4112,11 @@
the impact of the sleep instructions. This is also
useful when using JTAG debugger.
- nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings.
+ nohpet [X86] Don't use the HPET timer.
+
+ nohugeiomap [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge I/O mappings.
- nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings.
+ nohugevmalloc [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge vmalloc mappings.
nohz= [KNL] Boottime enable/disable dynamic ticks
Valid arguments: on, off
@@ -3870,13 +4138,11 @@
noinitrd [RAM] Tells the kernel not to load any configured
initial RAM disk.
- nointremap [X86-64, Intel-IOMMU] Do not enable interrupt
+ nointremap [X86-64,Intel-IOMMU,EARLY] Do not enable interrupt
remapping.
[Deprecated - use intremap=off]
- nointroute [IA-64]
-
- noinvpcid [X86] Disable the INVPCID cpu feature.
+ noinvpcid [X86,EARLY] Disable the INVPCID cpu feature.
noiotrap [SH] Disables trapped I/O port accesses.
@@ -3885,23 +4151,19 @@
noisapnp [ISAPNP] Disables ISA PnP code.
- nojitter [IA-64] Disables jitter checking for ITC timers.
-
- nokaslr [KNL]
+ nokaslr [KNL,EARLY]
When CONFIG_RANDOMIZE_BASE is set, this disables
kernel and module base offset ASLR (Address Space
Layout Randomization).
- no-kvmapf [X86,KVM] Disable paravirtualized asynchronous page
+ no-kvmapf [X86,KVM,EARLY] Disable paravirtualized asynchronous page
fault handling.
- no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
+ no-kvmclock [X86,KVM,EARLY] Disable paravirtualized KVM clock driver
- nolapic [X86-32,APIC] Do not enable or use the local APIC.
+ nolapic [X86-32,APIC,EARLY] Do not enable or use the local APIC.
- nolapic_timer [X86-32,APIC] Do not use the local APIC timer.
-
- nomca [IA-64] Disable machine check abort handling
+ nolapic_timer [X86-32,APIC,EARLY] Do not use the local APIC timer.
nomce [X86-32] Disable Machine Check Exception
@@ -3924,23 +4186,23 @@
shutdown the other cpus. Instead use the REBOOT_VECTOR
irq.
- nopat [X86] Disable PAT (page attribute table extension of
+ nopat [X86,EARLY] Disable PAT (page attribute table extension of
pagetables) support.
- nopcid [X86-64] Disable the PCID cpu feature.
+ nopcid [X86-64,EARLY] Disable the PCID cpu feature.
nopku [X86] Disable Memory Protection Keys CPU feature found
in some Intel CPUs.
- nopti [X86-64]
+ nopti [X86-64,EARLY]
Equivalent to pti=off
- nopv= [X86,XEN,KVM,HYPER_V,VMWARE]
+ nopv= [X86,XEN,KVM,HYPER_V,VMWARE,EARLY]
Disables the PV optimizations forcing the guest to run
as generic guest with no PV drivers. Currently support
XEN HVM, KVM, HYPER_V and VMWARE guest.
- nopvspin [X86,XEN,KVM]
+ nopvspin [X86,XEN,KVM,EARLY]
Disables the qspinlock slow path using PV optimizations
which allow the hypervisor to 'idle' the guest on lock
contention.
@@ -3954,26 +4216,24 @@
noresume [SWSUSP] Disables resume and restores original swap
space.
- nosbagart [IA-64]
-
no-scroll [VGA] Disables scrollback.
This is required for the Braillex ib80-piezo Braille
reader made by F.H. Papenmeier (Germany).
- nosgx [X86-64,SGX] Disables Intel SGX kernel support.
+ nosgx [X86-64,SGX,EARLY] Disables Intel SGX kernel support.
- nosmap [PPC]
+ nosmap [PPC,EARLY]
Disable SMAP (Supervisor Mode Access Prevention)
even if it is supported by processor.
- nosmep [PPC64s]
+ nosmep [PPC64s,EARLY]
Disable SMEP (Supervisor Mode Execution Prevention)
even if it is supported by processor.
- nosmp [SMP] Tells an SMP kernel to act as a UP kernel,
+ nosmp [SMP,EARLY] Tells an SMP kernel to act as a UP kernel,
and disable the IO APIC. legacy for "maxcpus=0".
- nosmt [KNL,MIPS,PPC,S390] Disable symmetric multithreading (SMT).
+ nosmt [KNL,MIPS,PPC,S390,EARLY] Disable symmetric multithreading (SMT).
Equivalent to smt=1.
[KNL,X86,PPC] Disable symmetric multithreading (SMT).
@@ -3983,32 +4243,35 @@
nosoftlockup [KNL] Disable the soft-lockup detector.
nospec_store_bypass_disable
- [HW] Disable all mitigations for the Speculative Store Bypass vulnerability
+ [HW,EARLY] Disable all mitigations for the Speculative
+ Store Bypass vulnerability
- nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch
+ nospectre_bhb [ARM64,EARLY] Disable all mitigations for Spectre-BHB (branch
history injection) vulnerability. System may allow data leaks
with this option.
- nospectre_v1 [X86,PPC] Disable mitigations for Spectre Variant 1
+ nospectre_v1 [X86,PPC,EARLY] Disable mitigations for Spectre Variant 1
(bounds check bypass). With this option data leaks are
possible in the system.
- nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for
- the Spectre variant 2 (indirect branch prediction)
- vulnerability. System may allow data leaks with this
- option.
+ nospectre_v2 [X86,PPC_E500,ARM64,EARLY] Disable all mitigations
+ for the Spectre variant 2 (indirect branch
+ prediction) vulnerability. System may allow data
+ leaks with this option.
- no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV] Disable
- paravirtualized steal time accounting. steal time is
- computed, but won't influence scheduler behaviour
+ no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV,LOONGARCH,EARLY]
+ Disable paravirtualized steal time accounting. steal time
+ is computed, but won't influence scheduler behaviour
nosync [HW,M68K] Disables sync negotiation for all devices.
- no_timer_check [X86,APIC] Disables the code which tests for
- broken timer IRQ sources.
+ no_timer_check [X86,APIC] Disables the code which tests for broken
+ timer IRQ sources, i.e., the IO-APIC timer. This can
+ work around problems with incorrect timer
+ initialization on some boards.
no_uaccess_flush
- [PPC] Don't flush the L1-D cache after accessing user data.
+ [PPC,EARLY] Don't flush the L1-D cache after accessing user data.
novmcoredd [KNL,KDUMP]
Disable device dump. Device dump allows drivers to
@@ -4022,15 +4285,15 @@
is set.
no-vmw-sched-clock
- [X86,PV_OPS] Disable paravirtualized VMware scheduler
- clock and use the default one.
+ [X86,PV_OPS,EARLY] Disable paravirtualized VMware
+ scheduler clock and use the default one.
nowatchdog [KNL] Disable both lockup detectors, i.e.
soft-lockup and NMI watchdog (hard-lockup).
- nowb [ARM]
+ nowb [ARM,EARLY]
- nox2apic [X86-64,APIC] Do not enable x2APIC mode.
+ nox2apic [X86-64,APIC,EARLY] Do not enable x2APIC mode.
NOTE: this parameter will be ignored on systems with the
LEGACY_XAPIC_DISABLED bit set in the
@@ -4055,20 +4318,7 @@
parameter, xsave area per process might occupy more
memory on xsaves enabled systems.
- nps_mtm_hs_ctr= [KNL,ARC]
- This parameter sets the maximum duration, in
- cycles, each HW thread of the CTOP can run
- without interruptions, before HW switches it.
- The actual maximum duration is 16 times this
- parameter's value.
- Format: integer between 1 and 255
- Default: 255
-
- nptcg= [IA-64] Override max number of concurrent global TLB
- purges which is reported from either PAL_VM_SUMMARY or
- SAL PALO.
-
- nr_cpus= [SMP] Maximum number of processors that an SMP kernel
+ nr_cpus= [SMP,EARLY] Maximum number of processors that an SMP kernel
could support. nr_cpus=n : n >= 1 limits the kernel to
support 'n' processors. It could be larger than the
number of already plugged CPU during bootup, later in
@@ -4079,8 +4329,29 @@
nr_uarts= [SERIAL] maximum number of UARTs to be registered.
- numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86] Disable NUMA, Only
- set up a single NUMA node spanning all memory.
+ numa=off [KNL, ARM64, PPC, RISCV, SPARC, X86, EARLY]
+ Disable NUMA, Only set up a single NUMA node
+ spanning all memory.
+
+ numa=fake=<size>[MG]
+ [KNL, ARM64, RISCV, X86, EARLY]
+ If given as a memory unit, fills all system RAM with
+ nodes of size interleaved over physical nodes.
+
+ numa=fake=<N>
+ [KNL, ARM64, RISCV, X86, EARLY]
+ If given as an integer, fills all system RAM with N
+ fake nodes interleaved over physical nodes.
+
+ numa=fake=<N>U
+ [KNL, ARM64, RISCV, X86, EARLY]
+ If given as an integer followed by 'U', it will
+ divide each physical node into N emulated nodes.
+
+ numa=noacpi [X86] Don't parse the SRAT table for NUMA setup
+
+ numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or
+ soft-reserved memory partitioning.
numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic
NUMA balancing.
@@ -4091,7 +4362,7 @@
This can be set from sysctl after boot.
See Documentation/admin-guide/sysctl/vm.rst for details.
- ohci1394_dma=early [HW] enable debugging via the ohci1394 driver.
+ ohci1394_dma=early [HW,EARLY] enable debugging via the ohci1394 driver.
See Documentation/core-api/debugging-via-ohci1394.rst for more
info.
@@ -4117,7 +4388,8 @@
Once locked, the boundary cannot be changed.
1 indicates lock status, 0 indicates unlock status.
- oops=panic Always panic on oopses. Default is to just kill the
+ oops=panic [KNL,EARLY]
+ Always panic on oopses. Default is to just kill the
process, but there is a small probability of
deadlocking the machine.
This will also cause panics on machine check exceptions.
@@ -4125,21 +4397,19 @@
page_alloc.shuffle=
[KNL] Boolean flag to control whether the page allocator
- should randomize its free lists. The randomization may
- be automatically enabled if the kernel detects it is
- running on a platform with a direct-mapped memory-side
- cache, and this parameter can be used to
- override/disable that behavior. The state of the flag
- can be read from sysfs at:
+ should randomize its free lists. This parameter can be
+ used to enable/disable page randomization. The state of
+ the flag can be read from sysfs at:
/sys/module/page_alloc/parameters/shuffle.
+ This parameter is only available if CONFIG_SHUFFLE_PAGE_ALLOCATOR=y.
- page_owner= [KNL] Boot-time page_owner enabling option.
+ page_owner= [KNL,EARLY] Boot-time page_owner enabling option.
Storage of the information about who allocated
each page is disabled in default. With this switch,
we can turn it on.
on: enable the feature
- page_poison= [KNL] Boot-time parameter changing the state of
+ page_poison= [KNL,EARLY] Boot-time parameter changing the state of
poisoning on the buddy allocator, available with
CONFIG_PAGE_POISONING=y.
off: turn off poisoning (default)
@@ -4157,7 +4427,8 @@
timeout < 0: reboot immediately
Format: <timeout>
- panic_on_taint= Bitmask for conditionally calling panic() in add_taint()
+ panic_on_taint= [KNL,EARLY]
+ Bitmask for conditionally calling panic() in add_taint()
Format: <hex>[,nousertaint]
Hexadecimal bitmask representing the set of TAINT flags
that will cause the kernel to panic when add_taint() is
@@ -4182,6 +4453,7 @@
bit 4: print ftrace buffer
bit 5: print all printk messages in buffer
bit 6: print all CPUs backtrace (if available in the arch)
+ bit 7: print only tasks in uninterruptible (blocked) state
*Be aware* that this option may print a _lot_ of lines,
so there are risks of losing older messages in the log.
Use this option carefully, maybe worth to setup a
@@ -4313,7 +4585,7 @@
pcbit= [HW,ISDN]
- pci=option[,option...] [PCI] various PCI subsystem options.
+ pci=option[,option...] [PCI,EARLY] various PCI subsystem options.
Some options herein operate on a specific device
or a set of devices (<pci_dev>). These are
@@ -4539,14 +4811,51 @@
bridges without forcing it upstream. Note:
this removes isolation between devices and
may put more devices in an IOMMU group.
+ config_acs=
+ Format:
+ <ACS flags>@<pci_dev>[; ...]
+ Specify one or more PCI devices (in the format
+ specified above) optionally prepended with flags
+ and separated by semicolons. The respective
+ capabilities will be enabled, disabled or
+ unchanged based on what is specified in
+ flags.
+
+ ACS Flags is defined as follows:
+ bit-0 : ACS Source Validation
+ bit-1 : ACS Translation Blocking
+ bit-2 : ACS P2P Request Redirect
+ bit-3 : ACS P2P Completion Redirect
+ bit-4 : ACS Upstream Forwarding
+ bit-5 : ACS P2P Egress Control
+ bit-6 : ACS Direct Translated P2P
+ Each bit can be marked as:
+ '0' – force disabled
+ '1' – force enabled
+ 'x' – unchanged
+ For example,
+ pci=config_acs=10x@pci:0:0
+ would configure all devices that support
+ ACS to enable P2P Request Redirect, disable
+ Translation Blocking, and leave Source
+ Validation unchanged from whatever power-up
+ or firmware set it to.
+
+ Note: this may remove isolation between devices
+ and may put more devices in an IOMMU group.
force_floating [S390] Force usage of floating interrupts.
nomio [S390] Do not use MIO instructions.
norid [S390] ignore the RID field and force use of
one PCI domain per PCI function
+ notph [PCIE] If the PCIE_TPH kernel config parameter
+ is enabled, this kernel boot option can be used
+ to disable PCIe TLP Processing Hints support
+ system-wide.
- pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power
+ pcie_aspm= [PCIE] Forcibly enable or ignore PCIe Active State Power
Management.
- off Disable ASPM.
+ off Don't touch ASPM configuration at all. Leave any
+ configuration done by firmware unchanged.
force Enable ASPM even on devices that claim not to support it.
WARNING: Forcing ASPM on may cause system lockups.
@@ -4582,7 +4891,8 @@
Format: { 0 | 1 }
See arch/parisc/kernel/pdc_chassis.c
- percpu_alloc= Select which percpu first chunk allocator to use.
+ percpu_alloc= [MM,EARLY]
+ Select which percpu first chunk allocator to use.
Currently supported values are "embed" and "page".
Archs may support subset or none of the selections.
See comments in mm/percpu.c for details on each
@@ -4644,6 +4954,11 @@
may be specified.
Format: <port>,<port>....
+ possible_cpus= [SMP,S390,X86]
+ Format: <unsigned int>
+ Set the number of possible CPUs, overriding the
+ regular discovery mechanisms (such as ACPI/FW, etc).
+
powersave=off [PPC] This option disables power saving features.
It specifically disables cpuidle and sets the
platform machine description specific power_save
@@ -4651,12 +4966,12 @@
execution priority.
ppc_strict_facility_enable
- [PPC] This option catches any kernel floating point,
+ [PPC,ENABLE] This option catches any kernel floating point,
Altivec, VSX and SPE outside of regions specifically
allowed (eg kernel_enable_fpu()/kernel_disable_fpu()).
There is some performance impact when enabling this.
- ppc_tm= [PPC]
+ ppc_tm= [PPC,EARLY]
Format: {"off"}
Disable Hardware Transactional Memory
@@ -4665,7 +4980,14 @@
none - Limited to cond_resched() calls
voluntary - Limited to cond_resched() and might_sleep() calls
full - Any section that isn't explicitly preempt disabled
- can be preempted anytime.
+ can be preempted anytime. Tasks will also yield
+ contended spinlocks (if the critical section isn't
+ explicitly preempt disabled beyond the lock itself).
+ lazy - Scheduler controlled. Similar to full but instead
+ of preempting the task immediately, the task gets
+ one HZ tick time to yield itself before the
+ preemption will be forced. One preemption is when the
+ task returns to user space.
print-fatal-signals=
[KNL] debug: print fatal signals
@@ -4705,6 +5027,16 @@
printk.time= Show timing data prefixed to each printk message line
Format: <bool> (1/Y/y=enable, 0/N/n=disable)
+ proc_mem.force_override= [KNL]
+ Format: {always | ptrace | never}
+ Traditionally /proc/pid/mem allows memory permissions to be
+ overridden without restrictions. This option may be set to
+ restrict that. Can be one of:
+ - 'always': traditional behavior always allows mem overrides.
+ - 'ptrace': only allow mem overrides for active ptracers.
+ - 'never': never allow mem overrides.
+ If not specified, default is the CONFIG_PROC_MEM_* choice.
+
processor.max_cstate= [HW,ACPI]
Limit processor to maximum C-state
max_cstate=9 overrides any DMI blacklist limit.
@@ -4715,11 +5047,9 @@
profile= [KNL] Enable kernel profiling via /proc/profile
Format: [<profiletype>,]<number>
- Param: <profiletype>: "schedule", "sleep", or "kvm"
+ Param: <profiletype>: "schedule" or "kvm"
[defaults to kernel profiling]
Param: "schedule" - profile schedule points.
- Param: "sleep" - profile D-state sleeping (millisecs).
- Requires CONFIG_SCHEDSTATS
Param: "kvm" - profile VM exits.
Param: <number> - step/bucket size as a power of 2 for
statistical time based profiling.
@@ -4728,7 +5058,9 @@
prot_virt= [S390] enable hosting protected virtual machines
isolated from the hypervisor (if hardware supports
- that).
+ that). If enabled, the default kernel base address
+ might be overridden even when Kernel Address Space
+ Layout Randomization is disabled.
Format: <bool>
psi= [KNL] Enable or disable pressure stall information
@@ -4766,7 +5098,7 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
- quiet [KNL] Disable most log messages
+ quiet [KNL,EARLY] Disable most log messages
r128= [HW,DRM]
@@ -4783,17 +5115,17 @@
ramdisk_start= [RAM] RAM disk image start address
random.trust_cpu=off
- [KNL] Disable trusting the use of the CPU's
+ [KNL,EARLY] Disable trusting the use of the CPU's
random number generator (if available) to
initialize the kernel's RNG.
random.trust_bootloader=off
- [KNL] Disable trusting the use of the a seed
+ [KNL,EARLY] Disable trusting the use of the a seed
passed by the bootloader (if available) to
initialize the kernel's RNG.
randomize_kstack_offset=
- [KNL] Enable or disable kernel stack offset
+ [KNL,EARLY] Enable or disable kernel stack offset
randomization, which provides roughly 5 bits of
entropy, frustrating memory corruption attacks
that depend on stack address determinism or
@@ -4852,6 +5184,10 @@
Set maximum number of finished RCU callbacks to
process in one batch.
+ rcutree.csd_lock_suppress_rcu_stall= [KNL]
+ Do only a one-line RCU CPU stall warning when
+ there is an ongoing too-long CSD-lock wait.
+
rcutree.do_rcu_barrier= [KNL]
Request a call to rcu_barrier(). This is
throttled so that userspace tests can safely
@@ -4929,6 +5265,14 @@
the ->nocb_bypass queue. The definition of "too
many" is supplied by this kernel boot parameter.
+ rcutree.nohz_full_patience_delay= [KNL]
+ On callback-offloaded (rcu_nocbs) CPUs, avoid
+ disturbing RCU unless the grace period has
+ reached the specified age in milliseconds.
+ Defaults to zero. Large values will be capped
+ at five seconds. All values will be rounded down
+ to the nearest value representable by jiffies.
+
rcutree.qhimark= [KNL]
Set threshold of queued RCU callbacks beyond which
batch limiting is disabled.
@@ -5034,6 +5378,25 @@
this kernel boot parameter, forcibly setting it
to zero.
+ rcutree.enable_rcu_lazy= [KNL]
+ To save power, batch RCU callbacks and flush after
+ delay, memory pressure or callback list growing too
+ big.
+
+ rcutree.rcu_normal_wake_from_gp= [KNL]
+ Reduces a latency of synchronize_rcu() call. This approach
+ maintains its own track of synchronize_rcu() callers, so it
+ does not interact with regular callbacks because it does not
+ use a call_rcu[_hurry]() path. Please note, this is for a
+ normal grace period.
+
+ How to enable it:
+
+ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp
+ or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1"
+
+ Default is 0.
+
rcuscale.gp_async= [KNL]
Measure performance of asynchronous
grace-period primitives such as call_rcu().
@@ -5165,7 +5528,42 @@
rcutorture.gp_cond= [KNL]
Use conditional/asynchronous update-side
- primitives, if available.
+ normal-grace-period primitives, if available.
+
+ rcutorture.gp_cond_exp= [KNL]
+ Use conditional/asynchronous update-side
+ expedited-grace-period primitives, if available.
+
+ rcutorture.gp_cond_full= [KNL]
+ Use conditional/asynchronous update-side
+ normal-grace-period primitives that also take
+ concurrent expedited grace periods into account,
+ if available.
+
+ rcutorture.gp_cond_exp_full= [KNL]
+ Use conditional/asynchronous update-side
+ expedited-grace-period primitives that also take
+ concurrent normal grace periods into account,
+ if available.
+
+ rcutorture.gp_cond_wi= [KNL]
+ Nominal wait interval for normal conditional
+ grace periods (specified by rcutorture's
+ gp_cond and gp_cond_full module parameters),
+ in microseconds. The actual wait interval will
+ be randomly selected to nanosecond granularity up
+ to this wait interval. Defaults to 16 jiffies,
+ for example, 16,000 microseconds on a system
+ with HZ=1000.
+
+ rcutorture.gp_cond_wi_exp= [KNL]
+ Nominal wait interval for expedited conditional
+ grace periods (specified by rcutorture's
+ gp_cond_exp and gp_cond_exp_full module
+ parameters), in microseconds. The actual wait
+ interval will be randomly selected to nanosecond
+ granularity up to this wait interval. Defaults to
+ 128 microseconds.
rcutorture.gp_exp= [KNL]
Use expedited update-side primitives, if available.
@@ -5174,6 +5572,43 @@
Use normal (non-expedited) asynchronous
update-side primitives, if available.
+ rcutorture.gp_poll= [KNL]
+ Use polled update-side normal-grace-period
+ primitives, if available.
+
+ rcutorture.gp_poll_exp= [KNL]
+ Use polled update-side expedited-grace-period
+ primitives, if available.
+
+ rcutorture.gp_poll_full= [KNL]
+ Use polled update-side normal-grace-period
+ primitives that also take concurrent expedited
+ grace periods into account, if available.
+
+ rcutorture.gp_poll_exp_full= [KNL]
+ Use polled update-side expedited-grace-period
+ primitives that also take concurrent normal
+ grace periods into account, if available.
+
+ rcutorture.gp_poll_wi= [KNL]
+ Nominal wait interval for normal conditional
+ grace periods (specified by rcutorture's
+ gp_poll and gp_poll_full module parameters),
+ in microseconds. The actual wait interval will
+ be randomly selected to nanosecond granularity up
+ to this wait interval. Defaults to 16 jiffies,
+ for example, 16,000 microseconds on a system
+ with HZ=1000.
+
+ rcutorture.gp_poll_wi_exp= [KNL]
+ Nominal wait interval for expedited conditional
+ grace periods (specified by rcutorture's
+ gp_poll_exp and gp_poll_exp_full module
+ parameters), in microseconds. The actual wait
+ interval will be randomly selected to nanosecond
+ granularity up to this wait interval. Defaults to
+ 128 microseconds.
+
rcutorture.gp_sync= [KNL]
Use normal (non-expedited) synchronous
update-side primitives, if available. If all
@@ -5227,10 +5662,21 @@
Set time (jiffies) between CPU-hotplug operations,
or zero to disable CPU-hotplug testing.
- rcutorture.read_exit= [KNL]
- Set the number of read-then-exit kthreads used
- to test the interaction of RCU updaters and
- task-exit processing.
+ rcutorture.preempt_duration= [KNL]
+ Set duration (in milliseconds) of preemptions
+ by a high-priority FIFO real-time task. Set to
+ zero (the default) to disable. The CPUs to
+ preempt are selected randomly from the set that
+ are online at a given point in time. Races with
+ CPUs going offline are ignored, with that attempt
+ at preemption skipped.
+
+ rcutorture.preempt_interval= [KNL]
+ Set interval (in milliseconds, defaulting to one
+ second) between preemptions by a high-priority
+ FIFO real-time task. This delay is mediated
+ by an hrtimer and is further fuzzed to avoid
+ inadvertent synchronizations.
rcutorture.read_exit_burst= [KNL]
The number of times in a given read-then-exit
@@ -5241,6 +5687,14 @@
The delay, in seconds, between successive
read-then-exit testing episodes.
+ rcutorture.reader_flavor= [KNL]
+ A bit mask indicating which readers to use.
+ If there is more than one bit set, the readers
+ are entered from low-order bit up, and are
+ exited in the opposite order. For SRCU, the
+ 0x1 bit is normal readers, 0x2 NMI-safe readers,
+ and 0x4 light-weight readers.
+
rcutorture.shuffle_interval= [KNL]
Set task-shuffle interval (s). Shuffling tasks
allows some CPUs to go into dyntick-idle mode
@@ -5272,7 +5726,13 @@
Time to wait (s) after boot before inducing stall.
rcutorture.stall_cpu_irqsoff= [KNL]
- Disable interrupts while stalling if set.
+ Disable interrupts while stalling if set, but only
+ on the first stall in the set.
+
+ rcutorture.stall_cpu_repeat= [KNL]
+ Number of times to repeat the stall sequence,
+ so that rcutorture.stall_cpu_repeat=3 will result
+ in four stall sequences.
rcutorture.stall_gp_kthread= [KNL]
Duration (s) of forced sleep within RCU
@@ -5460,14 +5920,6 @@
of zero will disable batching. Batching is
always disabled for synchronize_rcu_tasks().
- rcupdate.rcu_tasks_rude_lazy_ms= [KNL]
- Set timeout in milliseconds RCU Tasks
- Rude asynchronous callback batching for
- call_rcu_tasks_rude(). A negative value
- will take the default. A value of zero will
- disable batching. Batching is always disabled
- for synchronize_rcu_tasks_rude().
-
rcupdate.rcu_tasks_trace_lazy_ms= [KNL]
Set timeout in milliseconds RCU Tasks
Trace asynchronous callback batching for
@@ -5484,7 +5936,7 @@
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
- rdrand= [X86]
+ rdrand= [X86,EARLY]
force - Override the decision by the kernel to hide the
advertisement of RDRAND support (this affects
certain AMD processors because of buggy BIOS
@@ -5512,6 +5964,55 @@
reboot_cpu is s[mp]#### with #### being the processor
to be used for rebooting.
+ acpi
+ Use the ACPI RESET_REG in the FADT. If ACPI is not
+ configured or the ACPI reset does not work, the reboot
+ path attempts the reset using the keyboard controller.
+
+ bios
+ Use the CPU reboot vector for warm reset
+
+ cold
+ Set the cold reboot flag
+
+ default
+ There are some built-in platform specific "quirks"
+ - you may see: "reboot: <name> series board detected.
+ Selecting <type> for reboots." In the case where you
+ think the quirk is in error (e.g. you have newer BIOS,
+ or newer board) using this option will ignore the
+ built-in quirk table, and use the generic default
+ reboot actions.
+
+ efi
+ Use efi reset_system runtime service. If EFI is not
+ configured or the EFI reset does not work, the reboot
+ path attempts the reset using the keyboard controller.
+
+ force
+ Don't stop other CPUs on reboot. This can make reboot
+ more reliable in some cases.
+
+ kbd
+ Use the keyboard controller. cold reset (default)
+
+ pci
+ Use a write to the PCI config space register 0xcf9 to
+ trigger reboot.
+
+ triple
+ Force a triple fault (init)
+
+ warm
+ Don't set the cold reboot flag
+
+ Using warm reset will be much faster especially on big
+ memory systems because the BIOS will not go through
+ the memory check. Disadvantage is that not all
+ hardware will be completely reinitialized on reboot so
+ there may be boot problems on some systems.
+
+
refscale.holdoff= [KNL]
Set test-start holdoff period. The purpose of
this parameter is to delay the start of the
@@ -5580,7 +6081,29 @@
them. If <base> is less than 0x10000, the region
is assumed to be I/O ports; otherwise it is memory.
- reservetop= [X86-32]
+ reserve_mem= [RAM]
+ Format: nn[KNG]:<align>:<label>
+ Reserve physical memory and label it with a name that
+ other subsystems can use to access it. This is typically
+ used for systems that do not wipe the RAM, and this command
+ line will try to reserve the same physical memory on
+ soft reboots. Note, it is not guaranteed to be the same
+ location. For example, if anything about the system changes
+ or if booting a different kernel. It can also fail if KASLR
+ places the kernel at the location of where the RAM reservation
+ was from a previous boot, the new reservation will be at a
+ different location.
+ Any subsystem using this feature must add a way to verify
+ that the contents of the physical memory is from a previous
+ boot, as there may be cases where the memory will not be
+ located at the same location.
+
+ The format is size:align:label for example, to request
+ 12 megabytes of 4096 alignment for ramoops:
+
+ reserve_mem=12M:4096:oops ramoops.mem_name=oops
+
+ reservetop= [X86-32,EARLY]
Format: nn[KMG]
Reserves a hole at the top of the kernel virtual
address space.
@@ -5658,14 +6181,11 @@
2 The "airplane mode" button toggles between everything
blocked and everything unblocked.
- rhash_entries= [KNL,NET]
- Set number of hash buckets for route cache
-
ring3mwait=disable
[KNL] Disable ring 3 MONITOR/MWAIT feature on supported
CPUs.
- riscv_isa_fallback [RISCV]
+ riscv_isa_fallback [RISCV,EARLY]
When CONFIG_RISCV_ISA_FALLBACK is not enabled, permit
falling back to detecting extension support by parsing
"riscv,isa" property on devicetree systems when the
@@ -5674,13 +6194,14 @@
ro [KNL] Mount root device read-only on boot
- rodata= [KNL]
+ rodata= [KNL,EARLY]
on Mark read-only kernel memory as read-only (default).
off Leave read-only kernel memory writable for debugging.
full Mark read-only kernel memory and aliases as read-only
[arm64]
rockchip.usb_uart
+ [EARLY]
Enable the uart passthrough on the designated usb port
on Rockchip SoCs. When active, the signals of the
debug-uart get routed to the D+ and D- pins of the usb
@@ -5741,7 +6262,7 @@
sa1100ir [NET]
See drivers/net/irda/sa1100_ir.c.
- sched_verbose [KNL] Enables verbose scheduler debug messages.
+ sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages.
schedstats= [KNL,X86] Enable or disable scheduled statistics.
Allowed values are enable and disable. This feature
@@ -5749,6 +6270,7 @@
but is useful for debugging and performance tuning.
sched_thermal_decay_shift=
+ [Deprecated]
[KNL, SMP] Set a decay shift for scheduler thermal
pressure signal. Thermal pressure signal follows the
default decay period of other scheduler pelt
@@ -5856,7 +6378,11 @@
non-zero "wait" parameter. See weight_single
and weight_many.
- skew_tick= [KNL] Offset the periodic timer tick per cpu to mitigate
+ sdw_mclk_divider=[SDW]
+ Specify the MCLK divider for Intel SoundWire buses in
+ case the BIOS does not provide the clock rate properly.
+
+ skew_tick= [KNL,EARLY] Offset the periodic timer tick per cpu to mitigate
xtime_lock contention on larger systems, and/or RCU lock
contention on all systems with CONFIG_MAXSMP set.
Format: { "0" | "1" }
@@ -5878,7 +6404,16 @@
serialnumber [BUGS=X86-32]
- sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst
+ sev=option[,option...] [X86-64]
+
+ debug
+ Enable debug messages.
+
+ nosnp
+ Do not enable SEV-SNP (applies to host/hypervisor
+ only). Setting 'nosnp' avoids the RMP check overhead
+ in memory accesses when users do not want to run
+ SEV-SNP guests.
shapers= [NET]
Maximal number of shapers.
@@ -5892,68 +6427,68 @@
apic=verbose is specified.
Example: apic=debug show_lapic=all
- simeth= [IA-64]
- simscsi=
-
- slram= [HW,MTD]
-
- slab_merge [MM]
- Enable merging of slabs with similar size when the
- kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
-
- slab_nomerge [MM]
- Disable merging of slabs with similar size. May be
- necessary if there is some reason to distinguish
- allocs to different slabs, especially in hardened
- environments where the risk of heap overflows and
- layout control by attackers can usually be
- frustrated by disabling merging. This will reduce
- most of the exposure of a heap attack to a single
- cache (risks via metadata attacks are mostly
- unchanged). Debug options disable merging on their
- own.
- For more information see Documentation/mm/slub.rst.
-
- slab_max_order= [MM, SLAB]
- Determines the maximum allowed order for slabs.
- A high setting may cause OOMs due to memory
- fragmentation. Defaults to 1 for systems with
- more than 32MB of RAM, 0 otherwise.
-
- slub_debug[=options[,slabs][;[options[,slabs]]...] [MM, SLUB]
- Enabling slub_debug allows one to determine the
+ slab_debug[=options[,slabs][;[options[,slabs]]...] [MM]
+ Enabling slab_debug allows one to determine the
culprit if slab objects become corrupted. Enabling
- slub_debug can create guard zones around objects and
+ slab_debug can create guard zones around objects and
may poison objects when not in use. Also tracks the
last alloc / free. For more information see
Documentation/mm/slub.rst.
+ (slub_debug legacy name also accepted for now)
- slub_max_order= [MM, SLUB]
+ slab_max_order= [MM]
Determines the maximum allowed order for slabs.
A high setting may cause OOMs due to memory
fragmentation. For more information see
Documentation/mm/slub.rst.
+ (slub_max_order legacy name also accepted for now)
- slub_min_objects= [MM, SLUB]
+ slab_merge [MM]
+ Enable merging of slabs with similar size when the
+ kernel is built without CONFIG_SLAB_MERGE_DEFAULT.
+ (slub_merge legacy name also accepted for now)
+
+ slab_min_objects= [MM]
The minimum number of objects per slab. SLUB will
- increase the slab order up to slub_max_order to
+ increase the slab order up to slab_max_order to
generate a sufficiently large slab able to contain
the number of objects indicated. The higher the number
of objects the smaller the overhead of tracking slabs
and the less frequently locks need to be acquired.
For more information see Documentation/mm/slub.rst.
+ (slub_min_objects legacy name also accepted for now)
- slub_min_order= [MM, SLUB]
+ slab_min_order= [MM]
Determines the minimum page order for slabs. Must be
- lower than slub_max_order.
- For more information see Documentation/mm/slub.rst.
+ lower or equal to slab_max_order. For more information see
+ Documentation/mm/slub.rst.
+ (slub_min_order legacy name also accepted for now)
- slub_merge [MM, SLUB]
- Same with slab_merge.
+ slab_nomerge [MM]
+ Disable merging of slabs with similar size. May be
+ necessary if there is some reason to distinguish
+ allocs to different slabs, especially in hardened
+ environments where the risk of heap overflows and
+ layout control by attackers can usually be
+ frustrated by disabling merging. This will reduce
+ most of the exposure of a heap attack to a single
+ cache (risks via metadata attacks are mostly
+ unchanged). Debug options disable merging on their
+ own.
+ For more information see Documentation/mm/slub.rst.
+ (slub_nomerge legacy name also accepted for now)
+
+ slab_strict_numa [MM]
+ Support memory policies on a per object level
+ in the slab allocator. The default is for memory
+ policies to be applied at the folio level when
+ a new folio is needed or a partial folio is
+ retrieved from the lists. Increases overhead
+ in the slab fastpaths but gains more accurate
+ NUMA kernel object placement which helps with slow
+ interconnects in NUMA systems.
- slub_nomerge [MM, SLUB]
- Same with slab_nomerge. This is supported for legacy.
- See slab_nomerge for more information.
+ slram= [HW,MTD]
smart2= [HW]
Format: <io1>[,<io2>[,...,<io8>]]
@@ -5987,10 +6522,10 @@
1: Fast pin select (default)
2: ATC IRMode
- smt= [KNL,MIPS,S390] Set the maximum number of threads (logical
- CPUs) to use per physical CPU on systems capable of
- symmetric multithreading (SMT). Will be capped to the
- actual hardware limit.
+ smt= [KNL,MIPS,S390,EARLY] Set the maximum number of threads
+ (logical CPUs) to use per physical CPU on systems
+ capable of symmetric multithreading (SMT). Will
+ be capped to the actual hardware limit.
Format: <integer>
Default: -1 (no limit)
@@ -6012,7 +6547,22 @@
sonypi.*= [HW] Sony Programmable I/O Control Device driver
See Documentation/admin-guide/laptops/sonypi.rst
- spectre_v2= [X86] Control mitigation of Spectre variant 2
+ spectre_bhi= [X86] Control mitigation of Branch History Injection
+ (BHI) vulnerability. This setting affects the
+ deployment of the HW BHI control and the SW BHB
+ clearing sequence.
+
+ on - (default) Enable the HW or SW mitigation as
+ needed. This protects the kernel from
+ both syscalls and VMs.
+ vmexit - On systems which don't have the HW mitigation
+ available, enable the SW mitigation on vmexit
+ ONLY. On such systems, the host kernel is
+ protected from VM-originated BHI attacks, but
+ may still be vulnerable to syscall attacks.
+ off - Disable the mitigation.
+
+ spectre_v2= [X86,EARLY] Control mitigation of Spectre variant 2
(indirect branch speculation) vulnerability.
The default operation protects the kernel from
user space attacks.
@@ -6027,8 +6577,8 @@
Selecting 'on' will, and 'auto' may, choose a
mitigation method at run time according to the
CPU, the available microcode, the setting of the
- CONFIG_RETPOLINE configuration option, and the
- compiler with which the kernel was built.
+ CONFIG_MITIGATION_RETPOLINE configuration option,
+ and the compiler with which the kernel was built.
Selecting 'on' will also enable the mitigation
against user space to user space task attacks.
@@ -6092,7 +6642,7 @@
spectre_v2_user=auto.
spec_rstack_overflow=
- [X86] Control RAS overflow mitigation on AMD Zen CPUs
+ [X86,EARLY] Control RAS overflow mitigation on AMD Zen CPUs
off - Disable mitigation
microcode - Enable microcode mitigation only
@@ -6103,7 +6653,7 @@
(cloud-specific mitigation)
spec_store_bypass_disable=
- [HW] Control Speculative Store Bypass (SSB) Disable mitigation
+ [HW,EARLY] Control Speculative Store Bypass (SSB) Disable mitigation
(Speculative Store Bypass vulnerability)
Certain CPUs are vulnerable to an exploit against a
@@ -6154,11 +6704,6 @@
Not specifying this option is equivalent to
spec_store_bypass_disable=auto.
- spia_io_base= [HW,MTD]
- spia_fio_base=
- spia_pedr=
- spia_peddr=
-
split_lock_detect=
[X86] Enable split lock detection or bus lock detection
@@ -6199,7 +6744,7 @@
#DB exception for bus lock is triggered only when
CPL > 0.
- srbds= [X86,INTEL]
+ srbds= [X86,INTEL,EARLY]
Control the Special Register Buffer Data Sampling
(SRBDS) mitigation.
@@ -6286,7 +6831,7 @@
srcutree.convert_to_big must have the 0x10 bit
set for contention-based conversions to occur.
- ssbd= [ARM64,HW]
+ ssbd= [ARM64,HW,EARLY]
Speculative Store Bypass Disable control
On CPUs that are vulnerable to the Speculative
@@ -6310,7 +6855,7 @@
growing up) the main stack are reserved for no other
mapping. Default value is 256 pages.
- stack_depot_disable= [KNL]
+ stack_depot_disable= [KNL,EARLY]
Setting this to true through kernel command line will
disable the stack depot thereby saving the static memory
consumed by the stack hash table. By default this is set
@@ -6349,12 +6894,12 @@
be used to filter out binaries which have
not yet been made aware of AT_MINSIGSTKSZ.
- stress_hpt [PPC]
+ stress_hpt [PPC,EARLY]
Limits the number of kernel HPT entries in the hash
page table to increase the rate of hash page table
faults on kernel addresses.
- stress_slb [PPC]
+ stress_slb [PPC,EARLY]
Limits the number of kernel SLB entries, and flushes
them frequently to increase the rate of SLB faults
on kernel addresses.
@@ -6414,7 +6959,7 @@
This parameter controls use of the Protected
Execution Facility on pSeries.
- swiotlb= [ARM,IA-64,PPC,MIPS,X86]
+ swiotlb= [ARM,PPC,MIPS,X86,S390,EARLY]
Format: { <int> [,<int>] | force | noforce }
<int> -- Number of I/O TLB slabs
<int> -- Second integer after comma. Number of swiotlb
@@ -6424,7 +6969,7 @@
wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
- switches= [HW,M68k]
+ switches= [HW,M68k,EARLY]
sysctl.*= [KNL]
Set a sysctl parameter, right before loading the init
@@ -6483,11 +7028,30 @@
<deci-seconds>: poll all this frequency
0: no polling (default)
- threadirqs [KNL]
+ thp_anon= [KNL]
+ Format: <size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>
+ state is one of "always", "madvise", "never" or "inherit".
+ Control the default behavior of the system with respect
+ to anonymous transparent hugepages.
+ Can be used multiple times for multiple anon THP sizes.
+ See Documentation/admin-guide/mm/transhuge.rst for more
+ details.
+
+ threadirqs [KNL,EARLY]
Force threading of all interrupt handlers except those
marked explicitly IRQF_NO_THREAD.
- topology= [S390]
+ thp_shmem= [KNL]
+ Format: <size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy>
+ Control the default policy of each hugepage size for the
+ internal shmem mount. <policy> is one of policies available
+ for the shmem mount ("always", "inherit", "never", "within_size",
+ and "advise").
+ It can be used multiple times for multiple shmem THP sizes.
+ See Documentation/admin-guide/mm/transhuge.rst for more
+ details.
+
+ topology= [S390,EARLY]
Format: {off | on}
Specify if the kernel should make use of the cpu
topology information if the hardware supports this.
@@ -6495,12 +7059,6 @@
e.g. base its process migration decisions on it.
Default is on.
- topology_updates= [KNL, PPC, NUMA]
- Format: {off}
- Specify if the kernel should ignore (off)
- topology updates sent by the hypervisor to this
- LPAR.
-
torture.disable_onoff_at_boot= [KNL]
Prevent the CPU-hotplug component of torturing
until after init has spawned.
@@ -6520,7 +7078,14 @@
torture.verbose_sleep_duration= [KNL]
Duration of each verbose-printk() sleep in jiffies.
- tp720= [HW,PS2]
+ tpm.disable_pcr_integrity= [HW,TPM]
+ Do not protect PCR registers from unintended physical
+ access, or interposers in the bus by the means of
+ having an integrity protected session wrapped around
+ TPM2_PCR_Extend command. Consider this in a situation
+ where TPM is heavily utilized by IMA, thus protection
+ causing a major performance hit, and the space where
+ machines are deployed is by other means guarded.
tpm_suspend_pcr=[HW,TPM]
Format: integer pcr id
@@ -6548,7 +7113,7 @@
To turn off having tracepoints sent to printk,
echo 0 > /proc/sys/kernel/tracepoint_printk
Note, echoing 1 into this file without the
- tracepoint_printk kernel cmdline option has no effect.
+ tp_printk kernel cmdline option has no effect.
The tp_printk_stop_on_boot (see below) can also be used
to stop the printing of events to console at
@@ -6600,6 +7165,14 @@
comma-separated list of trace events to enable. See
also Documentation/trace/events.rst
+ To enable modules, use :mod: keyword:
+
+ trace_event=:mod:<module>
+
+ The value before :mod: will only enable specific events
+ that are part of the module. See the above mentioned
+ document for more information.
+
trace_instance=[instance-info]
[FTRACE] Create a ring buffer instance early in boot up.
This will be listed in:
@@ -6620,6 +7193,57 @@
the same thing would happen if it was left off). The irq_handler_entry
event, and all events under the "initcall" system.
+ Flags can be added to the instance to modify its behavior when it is
+ created. The flags are separated by '^'.
+
+ The available flags are:
+
+ traceoff - Have the tracing instance tracing disabled after it is created.
+ traceprintk - Have trace_printk() write into this trace instance
+ (note, "printk" and "trace_printk" can also be used)
+
+ trace_instance=foo^traceoff^traceprintk,sched,irq
+
+ The flags must come before the defined events.
+
+ If memory has been reserved (see memmap for x86), the instance
+ can use that memory:
+
+ memmap=12M$0x284500000 trace_instance=boot_map@0x284500000:12M
+
+ The above will create a "boot_map" instance that uses the physical
+ memory at 0x284500000 that is 12Megs. The per CPU buffers of that
+ instance will be split up accordingly.
+
+ Alternatively, the memory can be reserved by the reserve_mem option:
+
+ reserve_mem=12M:4096:trace trace_instance=boot_map@trace
+
+ This will reserve 12 megabytes at boot up with a 4096 byte alignment
+ and place the ring buffer in this memory. Note that due to KASLR, the
+ memory may not be the same location each time, which will not preserve
+ the buffer content.
+
+ Also note that the layout of the ring buffer data may change between
+ kernel versions where the validator will fail and reset the ring buffer
+ if the layout is not the same as the previous kernel.
+
+ If the ring buffer is used for persistent bootups and has events enabled,
+ it is recommend to disable tracing so that events from a previous boot do not
+ mix with events of the current boot (unless you are debugging a random crash
+ at boot up).
+
+ reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq
+
+ Note, saving the trace buffer across reboots does require that the system
+ is set up to not wipe memory. For instance, CONFIG_RESET_ATTACK_MITIGATION
+ can force a memory reset on boot which will clear any trace that was stored.
+ This is just one of many ways that can clear memory. Make sure your system
+ keeps the content of memory across reboots before relying on this option.
+
+ See also Documentation/trace/debugging.rst
+
+
trace_options=[option-list]
[FTRACE] Enable or disable tracer options at boot.
The option-list is a comma delimited list of options
@@ -6676,6 +7300,20 @@
See Documentation/admin-guide/mm/transhuge.rst
for more details.
+ transparent_hugepage_shmem= [KNL]
+ Format: [always|within_size|advise|never|deny|force]
+ Can be used to control the hugepage allocation policy for
+ the internal shmem mount.
+ See Documentation/admin-guide/mm/transhuge.rst
+ for more details.
+
+ transparent_hugepage_tmpfs= [KNL]
+ Format: [always|within_size|advise|never]
+ Can be used to control the default hugepage allocation policy
+ for the tmpfs mount.
+ See Documentation/admin-guide/mm/transhuge.rst
+ for more details.
+
trusted.source= [KEYS]
Format: <string>
This parameter identifies the trust source as a backend
@@ -6684,6 +7322,7 @@
- "tpm"
- "tee"
- "caam"
+ - "dcp"
If not specified then it defaults to iterating through
the trust source list starting with TPM and assigns the
first trust source as a backend which is initialized
@@ -6699,6 +7338,18 @@
If not specified, "default" is used. In this case,
the RNG's choice is left to each individual trust source.
+ trusted.dcp_use_otp_key
+ This is intended to be used in combination with
+ trusted.source=dcp and will select the DCP OTP key
+ instead of the DCP UNIQUE key blob encryption.
+
+ trusted.dcp_skip_zk_test
+ This is intended to be used in combination with
+ trusted.source=dcp and will disable the check if the
+ blob key is all zeros. This is helpful for situations where
+ having this key zero'ed is acceptable. E.g. in testing
+ scenarios.
+
tsc= Disable clocksource stability checks for TSC.
Format: <string>
[x86] reliable: mark tsc clocksource as reliable, this
@@ -6728,7 +7379,7 @@
can be overridden by a later tsc=nowatchdog. A console
message will flag any such suppression or overriding.
- tsc_early_khz= [X86] Skip early TSC calibration and use the given
+ tsc_early_khz= [X86,EARLY] Skip early TSC calibration and use the given
value instead. Useful when the early TSC frequency discovery
procedure is not reliable, such as on overclocked systems
with CPUID.16h support and partial CPUID.15h support.
@@ -6763,7 +7414,7 @@
See Documentation/admin-guide/hw-vuln/tsx_async_abort.rst
for more details.
- tsx_async_abort= [X86,INTEL] Control mitigation for the TSX Async
+ tsx_async_abort= [X86,INTEL,EARLY] Control mitigation for the TSX Async
Abort (TAA) vulnerability.
Similar to Micro-architectural Data Sampling (MDS)
@@ -6829,7 +7480,7 @@
unknown_nmi_panic
[X86] Cause panic on unknown NMI.
- unwind_debug [X86-64]
+ unwind_debug [X86-64,EARLY]
Enable unwinder debug output. This can be
useful for debugging certain unwinder error
conditions, including corrupt stacks and
@@ -6952,6 +7603,9 @@
usb-storage.delay_use=
[UMS] The delay in seconds before a new device is
scanned for Logical Units (default 1).
+ Optionally the delay in milliseconds if the value has
+ suffix with "ms".
+ Example: delay_use=2567ms
usb-storage.quirks=
[UMS] A list of quirks entries to supplement or
@@ -7019,7 +7673,7 @@
Example: user_debug=31
userpte=
- [X86] Flags controlling user PTE allocations.
+ [X86,EARLY] Flags controlling user PTE allocations.
nohigh = do not allocate PTE pages in
HIGHMEM regardless of setting
@@ -7045,10 +7699,7 @@
Try vdso32=0 if you encounter an error that says:
dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed!
- vector= [IA-64,SMP]
- vector=percpu: enable percpu vector domain
-
- video= [FB] Frame buffer configuration
+ video= [FB,EARLY] Frame buffer configuration
See Documentation/fb/modedb.rst.
video.brightness_switch_enabled= [ACPI]
@@ -7096,13 +7747,16 @@
P Enable page structure init time poisoning
- Disable all of the above options
- vmalloc=nn[KMG] [KNL,BOOT] Forces the vmalloc area to have an exact
- size of <nn>. This can be used to increase the
- minimum size (128MB on x86). It can also be used to
- decrease the size and leave more room for directly
- mapped kernel RAM.
+ vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an
+ exact size of <nn>. This can be used to increase
+ the minimum size (128MB on x86, arm32 platforms).
+ It can also be used to decrease the size and leave more room
+ for directly mapped kernel RAM. Note that this parameter does
+ not exist on many other platforms (including arm64, alpha,
+ loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc,
+ parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc).
- vmcp_cma=nn[MG] [KNL,S390]
+ vmcp_cma=nn[MG] [KNL,S390,EARLY]
Sets the memory size reserved for contiguous memory
allocations for the vmcp device driver.
@@ -7115,7 +7769,7 @@
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
- vsyscall= [X86-64]
+ vsyscall= [X86-64,EARLY]
Controls the behavior of vsyscalls (i.e. calls to
fixed addresses of 0xffffffffff600x00 from legacy
code). Most statically-linked binaries and older
@@ -7142,7 +7796,7 @@
vt.cur_default= [VT] Default cursor shape.
Format: 0xCCBBAA, where AA, BB, and CC are the same as
the parameters of the <Esc>[?A;B;Cc escape sequence;
- see VGA-softcursor.txt. Default: 2 = underline.
+ see vga-softcursor.rst. Default: 2 = underline.
vt.default_blu= [VT]
Format: <blue0>,<blue1>,<blue2>,...,<blue15>
@@ -7213,6 +7867,13 @@
it can be updated at runtime by writing to the
corresponding sysfs file.
+ workqueue.panic_on_stall=<uint>
+ Panic when workqueue stall is detected by
+ CONFIG_WQ_WATCHDOG. It sets the number times of the
+ stall to trigger panic.
+
+ The default is 0, which disables the panic on stall.
+
workqueue.cpu_intensive_thresh_us=
Per-cpu work items which run for longer than this
threshold are automatically considered CPU intensive
@@ -7225,6 +7886,15 @@
threshold repeatedly. They are likely good
candidates for using WQ_UNBOUND workqueues instead.
+ workqueue.cpu_intensive_warning_thresh=<uint>
+ If CONFIG_WQ_CPU_INTENSIVE_REPORT is set, the kernel
+ will report the work functions which violate the
+ intensive_threshold_us repeatedly. In order to prevent
+ spurious warnings, start printing only after a work
+ function has violated this threshold number of times.
+
+ The default is 4 times. 0 disables the warning.
+
workqueue.power_efficient
Per-cpu workqueues are generally preferred because
they show better performance thanks to cache
@@ -7250,7 +7920,7 @@
This can be changed after boot by writing to the
matching /sys/module/workqueue/parameters file. All
workqueues with the "default" affinity scope will be
- updated accordignly.
+ updated accordingly.
workqueue.debug_force_rr_cpu
Workqueue used to implicitly guarantee that work
@@ -7263,13 +7933,13 @@
When enabled, memory and cache locality will be
impacted.
- writecombine= [LOONGARCH] Control the MAT (Memory Access Type) of
- ioremap_wc().
+ writecombine= [LOONGARCH,EARLY] Control the MAT (Memory Access
+ Type) of ioremap_wc().
on - Enable writecombine, use WUC for ioremap_wc()
off - Disable writecombine, use SUC for ioremap_wc()
- x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
+ x2apic_phys [X86-64,APIC,EARLY] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.
@@ -7280,7 +7950,7 @@
save/restore/migration must be enabled to handle larger
domains.
- xen_emul_unplug= [HW,X86,XEN]
+ xen_emul_unplug= [HW,X86,XEN,EARLY]
Unplug Xen emulated devices
Format: [unplug0,][unplug1]
ide-disks -- unplug primary master IDE devices
@@ -7292,21 +7962,22 @@
the unplug protocol
never -- do not unplug even if version check succeeds
- xen_legacy_crash [X86,XEN]
+ xen_legacy_crash [X86,XEN,EARLY]
Crash from Xen panic notifier, without executing late
panic() code such as dumping handler.
- xen_msr_safe= [X86,XEN]
+ xen_mc_debug [X86,XEN,EARLY]
+ Enable multicall debugging when running as a Xen PV guest.
+ Enabling this feature will reduce performance a little
+ bit, so it should only be enabled for obtaining extended
+ debug data in case of multicall errors.
+
+ xen_msr_safe= [X86,XEN,EARLY]
Format: <bool>
Select whether to always use non-faulting (safe) MSR
access functions when running as Xen PV guest. The
default value is controlled by CONFIG_XEN_PV_MSR_SAFE.
- xen_nopvspin [X86,XEN]
- Disables the qspinlock slowpath using Xen PV optimizations.
- This parameter is obsoleted by "nopvspin" parameter, which
- has equivalent effect for XEN platform.
-
xen_nopv [X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
@@ -7314,7 +7985,7 @@
has equivalent effect for XEN platform.
xen_no_vector_callback
- [KNL,X86,XEN] Disable the vector callback for Xen
+ [KNL,X86,XEN,EARLY] Disable the vector callback for Xen
event channel interrupts.
xen_scrub_pages= [XEN]
@@ -7323,7 +7994,7 @@
with /sys/devices/system/xen_memory/xen_memory0/scrub_pages.
Default value controlled with CONFIG_XEN_SCRUB_PAGES_DEFAULT.
- xen_timer_slop= [X86-64,XEN]
+ xen_timer_slop= [X86-64,XEN,EARLY]
Set the timer slop (in nanoseconds) for the virtual Xen
timers (default is 100000). This adjusts the minimum
delta of virtualized Xen timers, where lower values
@@ -7376,7 +8047,7 @@
host controller quirks. Meaning of each bit can be
consulted in header drivers/usb/host/xhci.h.
- xmon [PPC]
+ xmon [PPC,EARLY]
Format: { early | on | rw | ro | off }
Controls if xmon debugger is enabled. Default is off.
Passing only "xmon" is equivalent to "xmon=early".
@@ -7394,4 +8065,3 @@
memory, and other data can't be written using
xmon commands.
off xmon is disabled.
-
diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
index b6aeae3327ce..ea7fa2a8bbf0 100644
--- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
+++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst
@@ -315,7 +315,7 @@ To reduce its OS jitter, do at least one of the following:
to do.
Name:
- rcuop/%d and rcuos/%d
+ rcuop/%d, rcuos/%d, and rcuog/%d
Purpose:
Offload RCU callbacks from the corresponding CPU.
diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
index 98d304010170..4ab0fef7d440 100644
--- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst
+++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst
@@ -444,7 +444,11 @@ event code Key Notes
0x1008 0x07 FN+F8 IBM: toggle screen expand
Lenovo: configure UltraNav,
- or toggle screen expand
+ or toggle screen expand.
+ On 2024 platforms replaced by
+ 0x131f (see below) and on newer
+ platforms (2025 +) keycode is
+ replaced by 0x1401 (see below).
0x1009 0x08 FN+F9 -
@@ -504,6 +508,11 @@ event code Key Notes
0x1019 0x18 unknown
+0x131f ... FN+F8 Platform Mode change (2024 systems).
+ Implemented in driver.
+
+0x1401 ... FN+F8 Platform Mode change (2025 + systems).
+ Implemented in driver.
... ... ...
0x1020 0x1F unknown
diff --git a/Documentation/admin-guide/media/building.rst b/Documentation/admin-guide/media/building.rst
index a06473429916..7a413ba07f93 100644
--- a/Documentation/admin-guide/media/building.rst
+++ b/Documentation/admin-guide/media/building.rst
@@ -15,7 +15,7 @@ Please notice, however, that, if:
you should use the main media development tree ``master`` branch:
- https://git.linuxtv.org/media_tree.git/
+ https://git.linuxtv.org/media.git/
In this case, you may find some useful information at the
`LinuxTv wiki pages <https://linuxtv.org/wiki>`_:
diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst
index 6b30e355cf23..92690e1f2183 100644
--- a/Documentation/admin-guide/media/cec.rst
+++ b/Documentation/admin-guide/media/cec.rst
@@ -42,10 +42,14 @@ dongles):
``persistent_config``: by default this is off, but when set to 1 the driver
will store the current settings to the device's internal eeprom and restore
it the next time the device is connected to the USB port.
+
- RainShadow Tech. Note: this driver does not support the persistent_config
module option of the Pulse-Eight driver. The hardware supports it, but I
have no plans to add this feature. But I accept patches :-)
+- Extron DA HD 4K PLUS HDMI Distribution Amplifier. See
+ :ref:`extron_da_hd_4k_plus` for more information.
+
Miscellaneous:
- vivid: emulates a CEC receiver and CEC transmitter.
@@ -378,3 +382,86 @@ it later using ``--analyze-pin``.
You can also use this as a full-fledged CEC device by configuring it
using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``.
+
+.. _extron_da_hd_4k_plus:
+
+Extron DA HD 4K PLUS CEC Adapter driver
+=======================================
+
+This driver is for the Extron DA HD 4K PLUS series of HDMI Distribution
+Amplifiers: https://www.extron.com/product/dahd4kplusseries
+
+The 2, 4 and 6 port models are supported.
+
+Firmware version 1.02.0001 or higher is required.
+
+Note that older Extron hardware revisions have a problem with the CEC voltage,
+which may mean that CEC will not work. This is fixed in hardware revisions
+E34814 and up.
+
+The CEC support has two modes: the first is a manual mode where userspace has
+to manually control CEC for the HDMI Input and all HDMI Outputs. While this gives
+full control, it is also complicated.
+
+The second mode is an automatic mode, which is selected if the module option
+``vendor_id`` is set. In that case the driver controls CEC and CEC messages
+received in the input will be distributed to the outputs. It is still possible
+to use the /dev/cecX devices to talk to the connected devices directly, but it is
+the driver that configures everything and deals with things like Hotplug Detect
+changes.
+
+The driver also takes care of the EDIDs: /dev/videoX devices are created to
+read the EDIDs and (for the HDMI Input port) to set the EDID.
+
+By default userspace is responsible to set the EDID for the HDMI Input
+according to the EDIDs of the connected displays. But if the ``manufacturer_name``
+module option is set, then the driver will take care of setting the EDID
+of the HDMI Input based on the supported resolutions of the connected displays.
+Currently the driver only supports resolutions 1080p60 and 4kp60: if all connected
+displays support 4kp60, then it will advertise 4kp60 on the HDMI input, otherwise
+it will fall back to an EDID that just reports 1080p60.
+
+The status of the Extron is reported in ``/sys/kernel/debug/cec/cecX/status``.
+
+The extron-da-hd-4k-plus driver implements the following module options:
+
+``debug``
+---------
+
+If set to 1, then all serial port traffic is shown.
+
+``vendor_id``
+-------------
+
+The CEC Vendor ID to report to connected displays.
+
+If set, then the driver will take care of distributing CEC messages received
+on the input to the HDMI outputs. This is done for the following CEC messages:
+
+- <Standby>
+- <Image View On> and <Text View On>
+- <Give Device Power Status>
+- <Set System Audio Mode>
+- <Request Current Latency>
+
+If not set, then userspace is responsible for this, and it will have to
+configure the CEC devices for HDMI Input and the HDMI Outputs manually.
+
+``manufacturer_name``
+---------------------
+
+A three character manufacturer name that is used in the EDID for the HDMI
+Input. If not set, then userspace is reponsible for configuring an EDID.
+If set, then the driver will update the EDID automatically based on the
+resolutions supported by the connected displays, and it will not be possible
+anymore to manually set the EDID for the HDMI Input.
+
+``hpd_never_low``
+-----------------
+
+If set, then the Hotplug Detect pin of the HDMI Input will always be high,
+even if nothing is connected to the HDMI Outputs. If not set (the default)
+then the Hotplug Detect pin of the HDMI input will go low if all the detected
+Hotplug Detect pins of the HDMI Outputs are also low.
+
+This option may be changed dynamically.
diff --git a/Documentation/admin-guide/media/em28xx-cardlist.rst b/Documentation/admin-guide/media/em28xx-cardlist.rst
index ace65718ea22..7dac07986d91 100644
--- a/Documentation/admin-guide/media/em28xx-cardlist.rst
+++ b/Documentation/admin-guide/media/em28xx-cardlist.rst
@@ -438,3 +438,11 @@ EM28xx cards list
- MyGica iGrabber
- em2860
- 1f4d:1abe
+ * - 106
+ - Hauppauge USB QuadHD ATSC
+ - em28274
+ - 2040:846d
+ * - 107
+ - MyGica UTV3 Analog USB2.0 TV Box
+ - em2860
+ - eb1a:2860
diff --git a/Documentation/admin-guide/media/index.rst b/Documentation/admin-guide/media/index.rst
index be7e0e4482ca..b11737ae6c04 100644
--- a/Documentation/admin-guide/media/index.rst
+++ b/Documentation/admin-guide/media/index.rst
@@ -20,6 +20,11 @@ Documentation/driver-api/media/index.rst
- for driver development information and Kernel APIs used by
media devices;
+Documentation/process/debugging/media_specific_debugging_guide.rst
+
+ - for advice about essential tools and techniques to debug drivers on this
+ subsystem
+
.. toctree::
:caption: Table of Contents
:maxdepth: 2
diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst
index 83b3cd03b35c..9c190942932e 100644
--- a/Documentation/admin-guide/media/ipu3.rst
+++ b/Documentation/admin-guide/media/ipu3.rst
@@ -98,7 +98,7 @@ frames in packed raw Bayer format to IPU3 CSI2 receiver.
# and that ov5670 sensor is connected to i2c bus 10 with address 0x36
export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036")
- # Establish the link for the media devices using media-ctl [#f3]_
+ # Establish the link for the media devices using media-ctl
media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]"
# Set the format for the media devices
@@ -589,12 +589,8 @@ preserved.
References
==========
-.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h
-
.. [#f1] https://github.com/intel/nvt
.. [#f2] http://git.ideasonboard.org/yavta.git
-.. [#f3] http://git.ideasonboard.org/?p=media-ctl.git;a=summary
-
.. [#f4] ImgU limitation requires an additional 16x16 for all input resolutions
diff --git a/Documentation/admin-guide/media/ipu6-isys.rst b/Documentation/admin-guide/media/ipu6-isys.rst
new file mode 100644
index 000000000000..d05086824a74
--- /dev/null
+++ b/Documentation/admin-guide/media/ipu6-isys.rst
@@ -0,0 +1,161 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+.. include:: <isonum.txt>
+
+========================================================
+Intel Image Processing Unit 6 (IPU6) Input System driver
+========================================================
+
+Copyright |copy| 2023--2024 Intel Corporation
+
+Introduction
+============
+
+This file documents the Intel IPU6 (6th generation Image Processing Unit)
+Input System (MIPI CSI2 receiver) drivers located under
+drivers/media/pci/intel/ipu6.
+
+The Intel IPU6 can be found in certain Intel SoCs but not in all SKUs:
+
+* Tiger Lake
+* Jasper Lake
+* Alder Lake
+* Raptor Lake
+* Meteor Lake
+
+Intel IPU6 is made up of two components - Input System (ISYS) and Processing
+System (PSYS).
+
+The Input System mainly works as MIPI CSI-2 receiver which receives and
+processes the image data from the sensors and outputs the frames to memory.
+
+There are 2 driver modules - intel-ipu6 and intel-ipu6-isys. intel-ipu6 is an
+IPU6 common driver which does PCI configuration, firmware loading and parsing,
+firmware authentication, DMA mapping and IPU-MMU (internal Memory mapping Unit)
+configuration. intel_ipu6_isys implements V4L2, Media Controller and V4L2
+sub-device interfaces. The IPU6 ISYS driver supports camera sensors connected
+to the IPU6 ISYS through V4L2 sub-device sensor drivers.
+
+.. Note:: See Documentation/driver-api/media/drivers/ipu6.rst for more
+ information about the IPU6 hardware.
+
+Input system driver
+===================
+
+The Input System driver mainly configures CSI-2 D-PHY, constructs the firmware
+stream configuration, sends commands to firmware, gets response from hardware
+and firmware and then returns buffers to user. The ISYS is represented as
+several V4L2 sub-devices as well as video nodes.
+
+.. kernel-figure:: ipu6_isys_graph.svg
+ :alt: ipu6 isys media graph with multiple streams support
+
+ IPU6 ISYS media graph with multiple streams support
+
+The graph has been produced using the following command:
+
+.. code-block:: none
+
+ fdp -Gsplines=true -Tsvg < dot > dot.svg
+
+Capturing frames with IPU6 ISYS
+-------------------------------
+
+IPU6 ISYS is used to capture frames from the camera sensors connected to the
+CSI2 ports. The supported input formats of ISYS are listed in table below:
+
+.. tabularcolumns:: |p{0.8cm}|p{4.0cm}|p{4.0cm}|
+
+.. flat-table::
+ :header-rows: 1
+
+ * - IPU6 ISYS supported input formats
+
+ * - RGB565, RGB888
+
+ * - UYVY8, YUYV8
+
+ * - RAW8, RAW10, RAW12
+
+.. _ipu6_isys_capture_examples:
+
+Examples
+~~~~~~~~
+
+Here is an example of IPU6 ISYS raw capture on Dell XPS 9315 laptop. On this
+machine, ov01a10 sensor is connected to IPU ISYS CSI-2 port 2, which can
+generate images at sBGGR10 with resolution 1280x800.
+
+Using the media controller APIs, we can configure ov01a10 sensor by
+media-ctl [#f1]_ and yavta [#f2]_ to transmit frames to IPU6 ISYS.
+
+.. code-block:: none
+
+ # Example 1 capture frame from ov01a10 camera sensor
+ # This example assumes /dev/media0 as the IPU ISYS media device
+ export MDEV=/dev/media0
+
+ # Establish the link for the media devices using media-ctl
+ media-ctl -d $MDEV -l "\"ov01a10 3-0036\":0 -> \"Intel IPU6 CSI2 2\":0[1]"
+
+ # Set the format for the media devices
+ media-ctl -d $MDEV -V "ov01a10:0 [fmt:SBGGR10/1280x800]"
+ media-ctl -d $MDEV -V "Intel IPU6 CSI2 2:0 [fmt:SBGGR10/1280x800]"
+ media-ctl -d $MDEV -V "Intel IPU6 CSI2 2:1 [fmt:SBGGR10/1280x800]"
+
+Once the media pipeline is configured, desired sensor specific settings
+(such as exposure and gain settings) can be set, using the yavta tool.
+
+e.g
+
+.. code-block:: none
+
+ # and that ov01a10 sensor is connected to i2c bus 3 with address 0x36
+ export SDEV=$(media-ctl -d $MDEV -e "ov01a10 3-0036")
+
+ yavta -w 0x009e0903 400 $SDEV
+ yavta -w 0x009e0913 1000 $SDEV
+ yavta -w 0x009e0911 2000 $SDEV
+
+Once the desired sensor settings are set, frame captures can be done as below.
+
+e.g
+
+.. code-block:: none
+
+ yavta --data-prefix -u -c10 -n5 -I -s 1280x800 --file=/tmp/frame-#.bin \
+ -f SBGGR10 $(media-ctl -d $MDEV -e "Intel IPU6 ISYS Capture 0")
+
+With the above command, 10 frames are captured at 1280x800 resolution with
+sBGGR10 format. The captured frames are available as /tmp/frame-#.bin files.
+
+Here is another example of IPU6 ISYS RAW and metadata capture from camera
+sensor ov2740 on Lenovo X1 Yoga laptop.
+
+.. code-block:: none
+
+ media-ctl -l "\"ov2740 14-0036\":0 -> \"Intel IPU6 CSI2 1\":0[1]"
+ media-ctl -l "\"Intel IPU6 CSI2 1\":1 -> \"Intel IPU6 ISYS Capture 0\":0[1]"
+ media-ctl -l "\"Intel IPU6 CSI2 1\":2 -> \"Intel IPU6 ISYS Capture 1\":0[1]"
+
+ # set routing
+ media-ctl -R "\"Intel IPU6 CSI2 1\" [0/0->1/0[1],0/1->2/1[1]]"
+
+ media-ctl -V "\"Intel IPU6 CSI2 1\":0/0 [fmt:SGRBG10/1932x1092]"
+ media-ctl -V "\"Intel IPU6 CSI2 1\":0/1 [fmt:GENERIC_8/97x1]"
+ media-ctl -V "\"Intel IPU6 CSI2 1\":1/0 [fmt:SGRBG10/1932x1092]"
+ media-ctl -V "\"Intel IPU6 CSI2 1\":2/1 [fmt:GENERIC_8/97x1]"
+
+ CAPTURE_DEV=$(media-ctl -e "Intel IPU6 ISYS Capture 0")
+ ./yavta --data-prefix -c100 -n5 -I -s1932x1092 --file=/tmp/frame-#.bin \
+ -f SGRBG10 ${CAPTURE_DEV}
+
+ CAPTURE_META=$(media-ctl -e "Intel IPU6 ISYS Capture 1")
+ ./yavta --data-prefix -c100 -n5 -I -s97x1 -B meta-capture \
+ --file=/tmp/meta-#.bin -f GENERIC_8 ${CAPTURE_META}
+
+References
+==========
+
+.. [#f1] https://git.ideasonboard.org/media-ctl.git
+.. [#f2] https://git.ideasonboard.org/yavta.git
diff --git a/Documentation/admin-guide/media/ipu6_isys_graph.svg b/Documentation/admin-guide/media/ipu6_isys_graph.svg
new file mode 100644
index 000000000000..c8539ef320d2
--- /dev/null
+++ b/Documentation/admin-guide/media/ipu6_isys_graph.svg
@@ -0,0 +1,548 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
+ "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<!-- Generated by graphviz version 2.43.0 (0)
+ -->
+<!-- Title: board Pages: 1 -->
+<svg width="1703pt" height="1473pt"
+ viewBox="0.00 0.00 1703.00 1473.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
+<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 1469)">
+<title>board</title>
+<polygon fill="white" stroke="transparent" points="-4,4 -4,-1469 1699,-1469 1699,4 -4,4"/>
+<!-- n00000001 -->
+<g id="node1" class="node">
+<title>n00000001</title>
+<polygon fill="yellow" stroke="black" points="832.99,-750.08 629.99,-750.08 629.99,-712.08 832.99,-712.08 832.99,-750.08"/>
+<text text-anchor="middle" x="731.49" y="-734.88" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 0</text>
+<text text-anchor="middle" x="731.49" y="-719.88" font-family="Times,serif" font-size="14.00">/dev/video0</text>
+</g>
+<!-- n00000005 -->
+<g id="node2" class="node">
+<title>n00000005</title>
+<polygon fill="yellow" stroke="black" points="1396.59,-771.88 1193.59,-771.88 1193.59,-733.88 1396.59,-733.88 1396.59,-771.88"/>
+<text text-anchor="middle" x="1295.09" y="-756.68" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 1</text>
+<text text-anchor="middle" x="1295.09" y="-741.68" font-family="Times,serif" font-size="14.00">/dev/video1</text>
+</g>
+<!-- n00000009 -->
+<g id="node3" class="node">
+<title>n00000009</title>
+<polygon fill="yellow" stroke="black" points="1118.52,-690.47 915.52,-690.47 915.52,-652.47 1118.52,-652.47 1118.52,-690.47"/>
+<text text-anchor="middle" x="1017.02" y="-675.27" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 2</text>
+<text text-anchor="middle" x="1017.02" y="-660.27" font-family="Times,serif" font-size="14.00">/dev/video2</text>
+</g>
+<!-- n0000000d -->
+<g id="node4" class="node">
+<title>n0000000d</title>
+<polygon fill="yellow" stroke="black" points="1105.89,-838.84 902.89,-838.84 902.89,-800.84 1105.89,-800.84 1105.89,-838.84"/>
+<text text-anchor="middle" x="1004.39" y="-823.64" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 3</text>
+<text text-anchor="middle" x="1004.39" y="-808.64" font-family="Times,serif" font-size="14.00">/dev/video3</text>
+</g>
+<!-- n00000011 -->
+<g id="node5" class="node">
+<title>n00000011</title>
+<polygon fill="yellow" stroke="black" points="1279.22,-992.95 1076.22,-992.95 1076.22,-954.95 1279.22,-954.95 1279.22,-992.95"/>
+<text text-anchor="middle" x="1177.72" y="-977.75" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 4</text>
+<text text-anchor="middle" x="1177.72" y="-962.75" font-family="Times,serif" font-size="14.00">/dev/video4</text>
+</g>
+<!-- n00000015 -->
+<g id="node6" class="node">
+<title>n00000015</title>
+<polygon fill="yellow" stroke="black" points="939.18,-984.91 736.18,-984.91 736.18,-946.91 939.18,-946.91 939.18,-984.91"/>
+<text text-anchor="middle" x="837.68" y="-969.71" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 5</text>
+<text text-anchor="middle" x="837.68" y="-954.71" font-family="Times,serif" font-size="14.00">/dev/video5</text>
+</g>
+<!-- n00000019 -->
+<g id="node7" class="node">
+<title>n00000019</title>
+<polygon fill="yellow" stroke="black" points="957.87,-527.99 754.87,-527.99 754.87,-489.99 957.87,-489.99 957.87,-527.99"/>
+<text text-anchor="middle" x="856.37" y="-512.79" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 6</text>
+<text text-anchor="middle" x="856.37" y="-497.79" font-family="Times,serif" font-size="14.00">/dev/video6</text>
+</g>
+<!-- n0000001d -->
+<g id="node8" class="node">
+<title>n0000001d</title>
+<polygon fill="yellow" stroke="black" points="1291.02,-542.15 1088.02,-542.15 1088.02,-504.15 1291.02,-504.15 1291.02,-542.15"/>
+<text text-anchor="middle" x="1189.52" y="-526.95" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 7</text>
+<text text-anchor="middle" x="1189.52" y="-511.95" font-family="Times,serif" font-size="14.00">/dev/video7</text>
+</g>
+<!-- n00000021 -->
+<g id="node9" class="node">
+<title>n00000021</title>
+<polygon fill="yellow" stroke="black" points="202.74,-611.46 -0.26,-611.46 -0.26,-573.46 202.74,-573.46 202.74,-611.46"/>
+<text text-anchor="middle" x="101.24" y="-596.26" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 8</text>
+<text text-anchor="middle" x="101.24" y="-581.26" font-family="Times,serif" font-size="14.00">/dev/video8</text>
+</g>
+<!-- n00000025 -->
+<g id="node10" class="node">
+<title>n00000025</title>
+<polygon fill="yellow" stroke="black" points="764.86,-637.89 561.86,-637.89 561.86,-599.89 764.86,-599.89 764.86,-637.89"/>
+<text text-anchor="middle" x="663.36" y="-622.69" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 9</text>
+<text text-anchor="middle" x="663.36" y="-607.69" font-family="Times,serif" font-size="14.00">/dev/video9</text>
+</g>
+<!-- n00000029 -->
+<g id="node11" class="node">
+<title>n00000029</title>
+<polygon fill="yellow" stroke="black" points="358.62,-519.5 146.62,-519.5 146.62,-481.5 358.62,-481.5 358.62,-519.5"/>
+<text text-anchor="middle" x="252.62" y="-504.3" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 10</text>
+<text text-anchor="middle" x="252.62" y="-489.3" font-family="Times,serif" font-size="14.00">/dev/video10</text>
+</g>
+<!-- n0000002d -->
+<g id="node12" class="node">
+<title>n0000002d</title>
+<polygon fill="yellow" stroke="black" points="481.4,-662.59 269.4,-662.59 269.4,-624.59 481.4,-624.59 481.4,-662.59"/>
+<text text-anchor="middle" x="375.4" y="-647.39" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 11</text>
+<text text-anchor="middle" x="375.4" y="-632.39" font-family="Times,serif" font-size="14.00">/dev/video11</text>
+</g>
+<!-- n00000031 -->
+<g id="node13" class="node">
+<title>n00000031</title>
+<polygon fill="yellow" stroke="black" points="637.17,-837.47 425.17,-837.47 425.17,-799.47 637.17,-799.47 637.17,-837.47"/>
+<text text-anchor="middle" x="531.17" y="-822.27" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 12</text>
+<text text-anchor="middle" x="531.17" y="-807.27" font-family="Times,serif" font-size="14.00">/dev/video12</text>
+</g>
+<!-- n00000035 -->
+<g id="node14" class="node">
+<title>n00000035</title>
+<polygon fill="yellow" stroke="black" points="337.75,-833.67 125.75,-833.67 125.75,-795.67 337.75,-795.67 337.75,-833.67"/>
+<text text-anchor="middle" x="231.75" y="-818.47" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 13</text>
+<text text-anchor="middle" x="231.75" y="-803.47" font-family="Times,serif" font-size="14.00">/dev/video13</text>
+</g>
+<!-- n00000039 -->
+<g id="node15" class="node">
+<title>n00000039</title>
+<polygon fill="yellow" stroke="black" points="393.07,-317.96 181.07,-317.96 181.07,-279.96 393.07,-279.96 393.07,-317.96"/>
+<text text-anchor="middle" x="287.07" y="-302.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 14</text>
+<text text-anchor="middle" x="287.07" y="-287.76" font-family="Times,serif" font-size="14.00">/dev/video14</text>
+</g>
+<!-- n0000003d -->
+<g id="node16" class="node">
+<title>n0000003d</title>
+<polygon fill="yellow" stroke="black" points="701.46,-391.04 489.46,-391.04 489.46,-353.04 701.46,-353.04 701.46,-391.04"/>
+<text text-anchor="middle" x="595.46" y="-375.84" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 15</text>
+<text text-anchor="middle" x="595.46" y="-360.84" font-family="Times,serif" font-size="14.00">/dev/video15</text>
+</g>
+<!-- n00000041 -->
+<g id="node17" class="node">
+<title>n00000041</title>
+<polygon fill="yellow" stroke="black" points="212.45,-1228.8 0.45,-1228.8 0.45,-1190.8 212.45,-1190.8 212.45,-1228.8"/>
+<text text-anchor="middle" x="106.45" y="-1213.6" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 16</text>
+<text text-anchor="middle" x="106.45" y="-1198.6" font-family="Times,serif" font-size="14.00">/dev/video16</text>
+</g>
+<!-- n00000045 -->
+<g id="node18" class="node">
+<title>n00000045</title>
+<polygon fill="yellow" stroke="black" points="784.86,-1252.38 572.86,-1252.38 572.86,-1214.38 784.86,-1214.38 784.86,-1252.38"/>
+<text text-anchor="middle" x="678.86" y="-1237.18" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 17</text>
+<text text-anchor="middle" x="678.86" y="-1222.18" font-family="Times,serif" font-size="14.00">/dev/video17</text>
+</g>
+<!-- n00000049 -->
+<g id="node19" class="node">
+<title>n00000049</title>
+<polygon fill="yellow" stroke="black" points="503.14,-1169.96 291.14,-1169.96 291.14,-1131.96 503.14,-1131.96 503.14,-1169.96"/>
+<text text-anchor="middle" x="397.14" y="-1154.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 18</text>
+<text text-anchor="middle" x="397.14" y="-1139.76" font-family="Times,serif" font-size="14.00">/dev/video18</text>
+</g>
+<!-- n0000004d -->
+<g id="node20" class="node">
+<title>n0000004d</title>
+<polygon fill="yellow" stroke="black" points="492.62,-1319.4 280.62,-1319.4 280.62,-1281.4 492.62,-1281.4 492.62,-1319.4"/>
+<text text-anchor="middle" x="386.62" y="-1304.2" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 19</text>
+<text text-anchor="middle" x="386.62" y="-1289.2" font-family="Times,serif" font-size="14.00">/dev/video19</text>
+</g>
+<!-- n00000051 -->
+<g id="node21" class="node">
+<title>n00000051</title>
+<polygon fill="yellow" stroke="black" points="680.74,-1464.66 468.74,-1464.66 468.74,-1426.66 680.74,-1426.66 680.74,-1464.66"/>
+<text text-anchor="middle" x="574.74" y="-1449.46" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 20</text>
+<text text-anchor="middle" x="574.74" y="-1434.46" font-family="Times,serif" font-size="14.00">/dev/video20</text>
+</g>
+<!-- n00000055 -->
+<g id="node22" class="node">
+<title>n00000055</title>
+<polygon fill="yellow" stroke="black" points="302.42,-1452.56 90.42,-1452.56 90.42,-1414.56 302.42,-1414.56 302.42,-1452.56"/>
+<text text-anchor="middle" x="196.42" y="-1437.36" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 21</text>
+<text text-anchor="middle" x="196.42" y="-1422.36" font-family="Times,serif" font-size="14.00">/dev/video21</text>
+</g>
+<!-- n00000059 -->
+<g id="node23" class="node">
+<title>n00000059</title>
+<polygon fill="yellow" stroke="black" points="319.89,-1018.32 107.89,-1018.32 107.89,-980.32 319.89,-980.32 319.89,-1018.32"/>
+<text text-anchor="middle" x="213.89" y="-1003.12" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 22</text>
+<text text-anchor="middle" x="213.89" y="-988.12" font-family="Times,serif" font-size="14.00">/dev/video22</text>
+</g>
+<!-- n0000005d -->
+<g id="node24" class="node">
+<title>n0000005d</title>
+<polygon fill="yellow" stroke="black" points="692.62,-1031.39 480.62,-1031.39 480.62,-993.39 692.62,-993.39 692.62,-1031.39"/>
+<text text-anchor="middle" x="586.62" y="-1016.19" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 23</text>
+<text text-anchor="middle" x="586.62" y="-1001.19" font-family="Times,serif" font-size="14.00">/dev/video23</text>
+</g>
+<!-- n00000061 -->
+<g id="node25" class="node">
+<title>n00000061</title>
+<polygon fill="yellow" stroke="black" points="1122.45,-248.8 910.45,-248.8 910.45,-210.8 1122.45,-210.8 1122.45,-248.8"/>
+<text text-anchor="middle" x="1016.45" y="-233.6" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 24</text>
+<text text-anchor="middle" x="1016.45" y="-218.6" font-family="Times,serif" font-size="14.00">/dev/video24</text>
+</g>
+<!-- n00000065 -->
+<g id="node26" class="node">
+<title>n00000065</title>
+<polygon fill="yellow" stroke="black" points="1694.86,-272.38 1482.86,-272.38 1482.86,-234.38 1694.86,-234.38 1694.86,-272.38"/>
+<text text-anchor="middle" x="1588.86" y="-257.18" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 25</text>
+<text text-anchor="middle" x="1588.86" y="-242.18" font-family="Times,serif" font-size="14.00">/dev/video25</text>
+</g>
+<!-- n00000069 -->
+<g id="node27" class="node">
+<title>n00000069</title>
+<polygon fill="yellow" stroke="black" points="1413.14,-189.96 1201.14,-189.96 1201.14,-151.96 1413.14,-151.96 1413.14,-189.96"/>
+<text text-anchor="middle" x="1307.14" y="-174.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 26</text>
+<text text-anchor="middle" x="1307.14" y="-159.76" font-family="Times,serif" font-size="14.00">/dev/video26</text>
+</g>
+<!-- n0000006d -->
+<g id="node28" class="node">
+<title>n0000006d</title>
+<polygon fill="yellow" stroke="black" points="1402.62,-339.4 1190.62,-339.4 1190.62,-301.4 1402.62,-301.4 1402.62,-339.4"/>
+<text text-anchor="middle" x="1296.62" y="-324.2" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 27</text>
+<text text-anchor="middle" x="1296.62" y="-309.2" font-family="Times,serif" font-size="14.00">/dev/video27</text>
+</g>
+<!-- n00000071 -->
+<g id="node29" class="node">
+<title>n00000071</title>
+<polygon fill="yellow" stroke="black" points="1590.74,-484.66 1378.74,-484.66 1378.74,-446.66 1590.74,-446.66 1590.74,-484.66"/>
+<text text-anchor="middle" x="1484.74" y="-469.46" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 28</text>
+<text text-anchor="middle" x="1484.74" y="-454.46" font-family="Times,serif" font-size="14.00">/dev/video28</text>
+</g>
+<!-- n00000075 -->
+<g id="node30" class="node">
+<title>n00000075</title>
+<polygon fill="yellow" stroke="black" points="1212.42,-472.56 1000.42,-472.56 1000.42,-434.56 1212.42,-434.56 1212.42,-472.56"/>
+<text text-anchor="middle" x="1106.42" y="-457.36" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 29</text>
+<text text-anchor="middle" x="1106.42" y="-442.36" font-family="Times,serif" font-size="14.00">/dev/video29</text>
+</g>
+<!-- n00000079 -->
+<g id="node31" class="node">
+<title>n00000079</title>
+<polygon fill="yellow" stroke="black" points="1229.89,-38.32 1017.89,-38.32 1017.89,-0.32 1229.89,-0.32 1229.89,-38.32"/>
+<text text-anchor="middle" x="1123.89" y="-23.12" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 30</text>
+<text text-anchor="middle" x="1123.89" y="-8.12" font-family="Times,serif" font-size="14.00">/dev/video30</text>
+</g>
+<!-- n0000007d -->
+<g id="node32" class="node">
+<title>n0000007d</title>
+<polygon fill="yellow" stroke="black" points="1602.62,-51.39 1390.62,-51.39 1390.62,-13.39 1602.62,-13.39 1602.62,-51.39"/>
+<text text-anchor="middle" x="1496.62" y="-36.19" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 31</text>
+<text text-anchor="middle" x="1496.62" y="-21.19" font-family="Times,serif" font-size="14.00">/dev/video31</text>
+</g>
+<!-- n00000081 -->
+<g id="node33" class="node">
+<title>n00000081</title>
+<path fill="green" stroke="black" d="M924.28,-700.28C924.28,-700.28 1108.28,-700.28 1108.28,-700.28 1114.28,-700.28 1120.28,-706.28 1120.28,-712.28 1120.28,-712.28 1120.28,-772.28 1120.28,-772.28 1120.28,-778.28 1114.28,-784.28 1108.28,-784.28 1108.28,-784.28 924.28,-784.28 924.28,-784.28 918.28,-784.28 912.28,-778.28 912.28,-772.28 912.28,-772.28 912.28,-712.28 912.28,-712.28 912.28,-706.28 918.28,-700.28 924.28,-700.28"/>
+<text text-anchor="middle" x="1016.28" y="-769.08" font-family="Times,serif" font-size="14.00">0</text>
+<polyline fill="none" stroke="black" points="912.28,-761.28 1120.28,-761.28 "/>
+<text text-anchor="middle" x="1016.28" y="-746.08" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 0</text>
+<text text-anchor="middle" x="1016.28" y="-731.08" font-family="Times,serif" font-size="14.00">/dev/v4l&#45;subdev0</text>
+<polyline fill="none" stroke="black" points="912.28,-723.28 1120.28,-723.28 "/>
+<text text-anchor="middle" x="925.28" y="-708.08" font-family="Times,serif" font-size="14.00">1</text>
+<polyline fill="none" stroke="black" points="938.28,-700.28 938.28,-723.28 "/>
+<text text-anchor="middle" x="951.28" y="-708.08" font-family="Times,serif" font-size="14.00">2</text>
+<polyline fill="none" stroke="black" points="964.28,-700.28 964.28,-723.28 "/>
+<text text-anchor="middle" x="977.28" y="-708.08" font-family="Times,serif" font-size="14.00">3</text>
+<polyline fill="none" stroke="black" points="990.28,-700.28 990.28,-723.28 "/>
+<text text-anchor="middle" x="1003.28" y="-708.08" font-family="Times,serif" font-size="14.00">4</text>
+<polyline fill="none" stroke="black" points="1016.28,-700.28 1016.28,-723.28 "/>
+<text text-anchor="middle" x="1029.28" y="-708.08" font-family="Times,serif" font-size="14.00">5</text>
+<polyline fill="none" stroke="black" points="1042.28,-700.28 1042.28,-723.28 "/>
+<text text-anchor="middle" x="1055.28" y="-708.08" font-family="Times,serif" font-size="14.00">6</text>
+<polyline fill="none" stroke="black" points="1068.28,-700.28 1068.28,-723.28 "/>
+<text text-anchor="middle" x="1081.28" y="-708.08" font-family="Times,serif" font-size="14.00">7</text>
+<polyline fill="none" stroke="black" points="1094.28,-700.28 1094.28,-723.28 "/>
+<text text-anchor="middle" x="1107.28" y="-708.08" font-family="Times,serif" font-size="14.00">8</text>
+</g>
+<!-- n00000081&#45;&gt;n00000001 -->
+<g id="edge1" class="edge">
+<title>n00000081:port1&#45;&gt;n00000001</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M912.28,-711.28C912.28,-711.28 880.33,-714.78 843.28,-718.84"/>
+<polygon fill="black" stroke="black" points="842.81,-715.37 833.25,-719.94 843.57,-722.33 842.81,-715.37"/>
+</g>
+<!-- n00000081&#45;&gt;n00000005 -->
+<g id="edge2" class="edge">
+<title>n00000081:port2&#45;&gt;n00000005</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M951.38,-700.28C951.38,-700.28 1086.18,-688.61 1123.48,-697.08 1155.93,-704.45 1158.99,-719.67 1190.39,-730.68 1190.49,-730.71 1190.59,-730.75 1190.69,-730.78"/>
+<polygon fill="black" stroke="black" points="1189.45,-734.06 1200.05,-733.86 1191.64,-727.41 1189.45,-734.06"/>
+</g>
+<!-- n00000081&#45;&gt;n00000009 -->
+<g id="edge3" class="edge">
+<title>n00000081:port3&#45;&gt;n00000009</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M977.28,-700.28C977.28,-700.28 979.31,-698.81 982.45,-696.54"/>
+<polygon fill="black" stroke="black" points="984.7,-699.23 990.74,-690.53 980.59,-693.56 984.7,-699.23"/>
+</g>
+<!-- n00000081&#45;&gt;n0000000d -->
+<g id="edge4" class="edge">
+<title>n00000081:port4&#45;&gt;n0000000d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1003.38,-700.26C1003.38,-700.26 916.62,-689.8 909.08,-697.08 880.2,-725.01 885.68,-754.82 909.08,-787.48 910.88,-789.99 918.96,-793.59 929.7,-797.47"/>
+<polygon fill="black" stroke="black" points="928.69,-800.82 939.28,-800.79 930.98,-794.21 928.69,-800.82"/>
+</g>
+<!-- n00000081&#45;&gt;n00000011 -->
+<g id="edge5" class="edge">
+<title>n00000081:port5&#45;&gt;n00000011</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1029.19,-700.26C1029.19,-700.26 1115.28,-690.56 1123.48,-697.08 1198.37,-756.64 1190.55,-886.51 1182.64,-944.71"/>
+<polygon fill="black" stroke="black" points="1179.16,-944.31 1181.18,-954.71 1186.09,-945.32 1179.16,-944.31"/>
+</g>
+<!-- n00000081&#45;&gt;n00000015 -->
+<g id="edge6" class="edge">
+<title>n00000081:port6&#45;&gt;n00000015</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1055.18,-700.28C1055.18,-700.28 915.57,-692.2 909.08,-697.08 834.02,-753.51 831.79,-879.34 835.06,-936.56"/>
+<polygon fill="black" stroke="black" points="831.58,-936.99 835.74,-946.73 838.56,-936.52 831.58,-936.99"/>
+</g>
+<!-- n00000081&#45;&gt;n00000019 -->
+<g id="edge7" class="edge">
+<title>n00000081:port7&#45;&gt;n00000019</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1081.28,-700.28C1081.28,-700.28 916.04,-696.54 912.32,-693.67 864.52,-656.73 856.3,-580.22 855.62,-538.2"/>
+<polygon fill="black" stroke="black" points="859.11,-538.05 855.59,-528.06 852.11,-538.07 859.11,-538.05"/>
+</g>
+<!-- n00000081&#45;&gt;n0000001d -->
+<g id="edge8" class="edge">
+<title>n00000081:port8&#45;&gt;n0000001d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1107.28,-700.28C1107.28,-700.28 1119.29,-696.23 1121.72,-693.67 1159.76,-653.62 1177.38,-589.6 1184.78,-552.46"/>
+<polygon fill="black" stroke="black" points="1188.29,-552.76 1186.69,-542.29 1181.41,-551.47 1188.29,-552.76"/>
+</g>
+<!-- n0000008b -->
+<g id="node34" class="node">
+<title>n0000008b</title>
+<path fill="green" stroke="black" d="M293.1,-532.08C293.1,-532.08 477.1,-532.08 477.1,-532.08 483.1,-532.08 489.1,-538.08 489.1,-544.08 489.1,-544.08 489.1,-604.08 489.1,-604.08 489.1,-610.08 483.1,-616.08 477.1,-616.08 477.1,-616.08 293.1,-616.08 293.1,-616.08 287.1,-616.08 281.1,-610.08 281.1,-604.08 281.1,-604.08 281.1,-544.08 281.1,-544.08 281.1,-538.08 287.1,-532.08 293.1,-532.08"/>
+<text text-anchor="middle" x="385.1" y="-600.88" font-family="Times,serif" font-size="14.00">0</text>
+<polyline fill="none" stroke="black" points="281.1,-593.08 489.1,-593.08 "/>
+<text text-anchor="middle" x="385.1" y="-577.88" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 1</text>
+<text text-anchor="middle" x="385.1" y="-562.88" font-family="Times,serif" font-size="14.00">/dev/v4l&#45;subdev1</text>
+<polyline fill="none" stroke="black" points="281.1,-555.08 489.1,-555.08 "/>
+<text text-anchor="middle" x="294.1" y="-539.88" font-family="Times,serif" font-size="14.00">1</text>
+<polyline fill="none" stroke="black" points="307.1,-532.08 307.1,-555.08 "/>
+<text text-anchor="middle" x="320.1" y="-539.88" font-family="Times,serif" font-size="14.00">2</text>
+<polyline fill="none" stroke="black" points="333.1,-532.08 333.1,-555.08 "/>
+<text text-anchor="middle" x="346.1" y="-539.88" font-family="Times,serif" font-size="14.00">3</text>
+<polyline fill="none" stroke="black" points="359.1,-532.08 359.1,-555.08 "/>
+<text text-anchor="middle" x="372.1" y="-539.88" font-family="Times,serif" font-size="14.00">4</text>
+<polyline fill="none" stroke="black" points="385.1,-532.08 385.1,-555.08 "/>
+<text text-anchor="middle" x="398.1" y="-539.88" font-family="Times,serif" font-size="14.00">5</text>
+<polyline fill="none" stroke="black" points="411.1,-532.08 411.1,-555.08 "/>
+<text text-anchor="middle" x="424.1" y="-539.88" font-family="Times,serif" font-size="14.00">6</text>
+<polyline fill="none" stroke="black" points="437.1,-532.08 437.1,-555.08 "/>
+<text text-anchor="middle" x="450.1" y="-539.88" font-family="Times,serif" font-size="14.00">7</text>
+<polyline fill="none" stroke="black" points="463.1,-532.08 463.1,-555.08 "/>
+<text text-anchor="middle" x="476.1" y="-539.88" font-family="Times,serif" font-size="14.00">8</text>
+</g>
+<!-- n0000008b&#45;&gt;n00000021 -->
+<g id="edge9" class="edge">
+<title>n0000008b:port1&#45;&gt;n00000021</title>
+<path fill="none" stroke="black" d="M281.1,-543.08C281.1,-543.08 240.1,-560.51 205.94,-570.26 205.35,-570.43 204.77,-570.59 204.18,-570.76"/>
+<polygon fill="black" stroke="black" points="203.2,-567.39 194.47,-573.39 205.03,-574.15 203.2,-567.39"/>
+</g>
+<!-- n0000008b&#45;&gt;n00000025 -->
+<g id="edge10" class="edge">
+<title>n0000008b:port2&#45;&gt;n00000025</title>
+<path fill="none" stroke="black" d="M320.2,-532.07C320.2,-532.07 456.9,-514.37 492.3,-528.88 528.42,-543.68 522.86,-571.78 556.11,-594.53"/>
+<polygon fill="black" stroke="black" points="554.54,-597.67 564.9,-599.88 558.18,-591.69 554.54,-597.67"/>
+</g>
+<!-- n0000008b&#45;&gt;n00000029 -->
+<g id="edge11" class="edge">
+<title>n0000008b:port3&#45;&gt;n00000029</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M346.1,-532.08C346.1,-532.08 333.93,-527.96 318.37,-522.71"/>
+<polygon fill="black" stroke="black" points="319.48,-519.39 308.88,-519.5 317.24,-526.02 319.48,-519.39"/>
+</g>
+<!-- n0000008b&#45;&gt;n0000002d -->
+<g id="edge12" class="edge">
+<title>n0000008b:port4&#45;&gt;n0000002d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M372.19,-532.05C372.19,-532.05 292.97,-514.3 277.9,-528.88 249.01,-556.8 253.16,-587.62 277.9,-619.28 278.34,-619.85 280.33,-620.69 283.45,-621.71"/>
+<polygon fill="black" stroke="black" points="282.71,-625.14 293.29,-624.58 284.67,-618.42 282.71,-625.14"/>
+</g>
+<!-- n0000008b&#45;&gt;n00000031 -->
+<g id="edge13" class="edge">
+<title>n0000008b:port5&#45;&gt;n00000031</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M398,-532.05C398,-532.05 476.28,-515.34 492.3,-528.88 568.49,-593.29 550.55,-729.67 538.14,-789.41"/>
+<polygon fill="black" stroke="black" points="534.69,-788.79 535.99,-799.31 541.53,-790.28 534.69,-788.79"/>
+</g>
+<!-- n0000008b&#45;&gt;n00000035 -->
+<g id="edge14" class="edge">
+<title>n0000008b:port6&#45;&gt;n00000035</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M424,-532.07C424,-532.07 290.37,-518.48 277.9,-528.88 202.27,-591.86 215.34,-725.69 225.66,-785.15"/>
+<polygon fill="black" stroke="black" points="222.29,-786.14 227.54,-795.35 229.17,-784.88 222.29,-786.14"/>
+</g>
+<!-- n0000008b&#45;&gt;n00000039 -->
+<g id="edge15" class="edge">
+<title>n0000008b:port7&#45;&gt;n00000039</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M450.1,-532.08C450.1,-532.08 395.22,-528.13 383.45,-518.65 375.46,-512.21 322.64,-385.46 298.76,-327.47"/>
+<polygon fill="black" stroke="black" points="301.96,-326.05 294.92,-318.14 295.49,-328.72 301.96,-326.05"/>
+</g>
+<!-- n0000008b&#45;&gt;n0000003d -->
+<g id="edge16" class="edge">
+<title>n0000008b:port8&#45;&gt;n0000003d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M476.1,-532.08C476.1,-532.08 522.37,-522.39 526.85,-518.65 563.15,-488.33 581.38,-434.52 589.6,-401.2"/>
+<polygon fill="black" stroke="black" points="593.08,-401.69 591.93,-391.16 586.26,-400.11 593.08,-401.69"/>
+</g>
+<!-- n00000095 -->
+<g id="node35" class="node">
+<title>n00000095</title>
+<path fill="green" stroke="black" d="M301.38,-1180.11C301.38,-1180.11 485.38,-1180.11 485.38,-1180.11 491.38,-1180.11 497.38,-1186.11 497.38,-1192.11 497.38,-1192.11 497.38,-1252.11 497.38,-1252.11 497.38,-1258.11 491.38,-1264.11 485.38,-1264.11 485.38,-1264.11 301.38,-1264.11 301.38,-1264.11 295.38,-1264.11 289.38,-1258.11 289.38,-1252.11 289.38,-1252.11 289.38,-1192.11 289.38,-1192.11 289.38,-1186.11 295.38,-1180.11 301.38,-1180.11"/>
+<text text-anchor="middle" x="393.38" y="-1248.91" font-family="Times,serif" font-size="14.00">0</text>
+<polyline fill="none" stroke="black" points="289.38,-1241.11 497.38,-1241.11 "/>
+<text text-anchor="middle" x="393.38" y="-1225.91" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 2</text>
+<text text-anchor="middle" x="393.38" y="-1210.91" font-family="Times,serif" font-size="14.00">/dev/v4l&#45;subdev2</text>
+<polyline fill="none" stroke="black" points="289.38,-1203.11 497.38,-1203.11 "/>
+<text text-anchor="middle" x="302.38" y="-1187.91" font-family="Times,serif" font-size="14.00">1</text>
+<polyline fill="none" stroke="black" points="315.38,-1180.11 315.38,-1203.11 "/>
+<text text-anchor="middle" x="328.38" y="-1187.91" font-family="Times,serif" font-size="14.00">2</text>
+<polyline fill="none" stroke="black" points="341.38,-1180.11 341.38,-1203.11 "/>
+<text text-anchor="middle" x="354.38" y="-1187.91" font-family="Times,serif" font-size="14.00">3</text>
+<polyline fill="none" stroke="black" points="367.38,-1180.11 367.38,-1203.11 "/>
+<text text-anchor="middle" x="380.38" y="-1187.91" font-family="Times,serif" font-size="14.00">4</text>
+<polyline fill="none" stroke="black" points="393.38,-1180.11 393.38,-1203.11 "/>
+<text text-anchor="middle" x="406.38" y="-1187.91" font-family="Times,serif" font-size="14.00">5</text>
+<polyline fill="none" stroke="black" points="419.38,-1180.11 419.38,-1203.11 "/>
+<text text-anchor="middle" x="432.38" y="-1187.91" font-family="Times,serif" font-size="14.00">6</text>
+<polyline fill="none" stroke="black" points="445.38,-1180.11 445.38,-1203.11 "/>
+<text text-anchor="middle" x="458.38" y="-1187.91" font-family="Times,serif" font-size="14.00">7</text>
+<polyline fill="none" stroke="black" points="471.38,-1180.11 471.38,-1203.11 "/>
+<text text-anchor="middle" x="484.38" y="-1187.91" font-family="Times,serif" font-size="14.00">8</text>
+</g>
+<!-- n00000095&#45;&gt;n00000041 -->
+<g id="edge17" class="edge">
+<title>n00000095:port1&#45;&gt;n00000041</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M289.38,-1191.11C289.38,-1191.11 258.94,-1194.22 222.89,-1197.91"/>
+<polygon fill="black" stroke="black" points="222.19,-1194.46 212.6,-1198.96 222.9,-1201.42 222.19,-1194.46"/>
+</g>
+<!-- n00000095&#45;&gt;n00000045 -->
+<g id="edge18" class="edge">
+<title>n00000095:port2&#45;&gt;n00000045</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M328.48,-1180.11C328.48,-1180.11 463.26,-1168.53 500.58,-1176.91 534.02,-1184.43 537.24,-1200.06 569.66,-1211.18 569.76,-1211.22 569.86,-1211.25 569.96,-1211.29"/>
+<polygon fill="black" stroke="black" points="568.86,-1214.61 579.45,-1214.34 571,-1207.95 568.86,-1214.61"/>
+</g>
+<!-- n00000095&#45;&gt;n00000049 -->
+<g id="edge19" class="edge">
+<title>n00000095:port3&#45;&gt;n00000049</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M354.38,-1180.11C354.38,-1180.11 356.8,-1178.46 360.49,-1175.94"/>
+<polygon fill="black" stroke="black" points="362.56,-1178.77 368.86,-1170.24 358.62,-1172.98 362.56,-1178.77"/>
+</g>
+<!-- n00000095&#45;&gt;n0000004d -->
+<g id="edge20" class="edge">
+<title>n00000095:port4&#45;&gt;n0000004d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M380.47,-1180.09C380.47,-1180.09 293.71,-1169.63 286.18,-1176.91 257.29,-1204.84 262.63,-1234.76 286.18,-1267.31 288.16,-1270.05 297.33,-1273.96 309.38,-1278.13"/>
+<polygon fill="black" stroke="black" points="308.49,-1281.53 319.09,-1281.36 310.7,-1274.88 308.49,-1281.53"/>
+</g>
+<!-- n00000095&#45;&gt;n00000051 -->
+<g id="edge21" class="edge">
+<title>n00000095:port5&#45;&gt;n00000051</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M406.28,-1180.09C406.28,-1180.09 492.13,-1170.7 500.58,-1176.91 576.41,-1232.66 579.83,-1358.79 577.09,-1416.2"/>
+<polygon fill="black" stroke="black" points="573.59,-1416.23 576.51,-1426.41 580.58,-1416.63 573.59,-1416.23"/>
+</g>
+<!-- n00000095&#45;&gt;n00000055 -->
+<g id="edge22" class="edge">
+<title>n00000095:port6&#45;&gt;n00000055</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M432.28,-1180.11C432.28,-1180.11 292.85,-1172.29 286.18,-1176.91 211.26,-1228.86 198.3,-1348.49 196.45,-1404.12"/>
+<polygon fill="black" stroke="black" points="192.94,-1404.28 196.21,-1414.36 199.94,-1404.44 192.94,-1404.28"/>
+</g>
+<!-- n00000095&#45;&gt;n00000059 -->
+<g id="edge23" class="edge">
+<title>n00000095:port7&#45;&gt;n00000059</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M458.38,-1180.11C458.38,-1180.11 291.84,-1175.85 287.94,-1173.16 239.87,-1139.96 222.85,-1068.83 216.94,-1028.6"/>
+<polygon fill="black" stroke="black" points="220.39,-1028.06 215.6,-1018.61 213.46,-1028.98 220.39,-1028.06"/>
+</g>
+<!-- n00000095&#45;&gt;n0000005d -->
+<g id="edge24" class="edge">
+<title>n00000095:port8&#45;&gt;n0000005d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M484.38,-1180.11C484.38,-1180.11 502.45,-1176.49 506.34,-1173.16 547.25,-1138.2 569.47,-1077.38 579.62,-1041.41"/>
+<polygon fill="black" stroke="black" points="583.06,-1042.09 582.28,-1031.53 576.3,-1040.27 583.06,-1042.09"/>
+</g>
+<!-- n0000009f -->
+<g id="node36" class="node">
+<title>n0000009f</title>
+<path fill="green" stroke="black" d="M1211.38,-200.11C1211.38,-200.11 1395.38,-200.11 1395.38,-200.11 1401.38,-200.11 1407.38,-206.11 1407.38,-212.11 1407.38,-212.11 1407.38,-272.11 1407.38,-272.11 1407.38,-278.11 1401.38,-284.11 1395.38,-284.11 1395.38,-284.11 1211.38,-284.11 1211.38,-284.11 1205.38,-284.11 1199.38,-278.11 1199.38,-272.11 1199.38,-272.11 1199.38,-212.11 1199.38,-212.11 1199.38,-206.11 1205.38,-200.11 1211.38,-200.11"/>
+<text text-anchor="middle" x="1303.38" y="-268.91" font-family="Times,serif" font-size="14.00">0</text>
+<polyline fill="none" stroke="black" points="1199.38,-261.11 1407.38,-261.11 "/>
+<text text-anchor="middle" x="1303.38" y="-245.91" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 3</text>
+<text text-anchor="middle" x="1303.38" y="-230.91" font-family="Times,serif" font-size="14.00">/dev/v4l&#45;subdev3</text>
+<polyline fill="none" stroke="black" points="1199.38,-223.11 1407.38,-223.11 "/>
+<text text-anchor="middle" x="1212.38" y="-207.91" font-family="Times,serif" font-size="14.00">1</text>
+<polyline fill="none" stroke="black" points="1225.38,-200.11 1225.38,-223.11 "/>
+<text text-anchor="middle" x="1238.38" y="-207.91" font-family="Times,serif" font-size="14.00">2</text>
+<polyline fill="none" stroke="black" points="1251.38,-200.11 1251.38,-223.11 "/>
+<text text-anchor="middle" x="1264.38" y="-207.91" font-family="Times,serif" font-size="14.00">3</text>
+<polyline fill="none" stroke="black" points="1277.38,-200.11 1277.38,-223.11 "/>
+<text text-anchor="middle" x="1290.38" y="-207.91" font-family="Times,serif" font-size="14.00">4</text>
+<polyline fill="none" stroke="black" points="1303.38,-200.11 1303.38,-223.11 "/>
+<text text-anchor="middle" x="1316.38" y="-207.91" font-family="Times,serif" font-size="14.00">5</text>
+<polyline fill="none" stroke="black" points="1329.38,-200.11 1329.38,-223.11 "/>
+<text text-anchor="middle" x="1342.38" y="-207.91" font-family="Times,serif" font-size="14.00">6</text>
+<polyline fill="none" stroke="black" points="1355.38,-200.11 1355.38,-223.11 "/>
+<text text-anchor="middle" x="1368.38" y="-207.91" font-family="Times,serif" font-size="14.00">7</text>
+<polyline fill="none" stroke="black" points="1381.38,-200.11 1381.38,-223.11 "/>
+<text text-anchor="middle" x="1394.38" y="-207.91" font-family="Times,serif" font-size="14.00">8</text>
+</g>
+<!-- n0000009f&#45;&gt;n00000061 -->
+<g id="edge25" class="edge">
+<title>n0000009f:port1&#45;&gt;n00000061</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1199.38,-211.11C1199.38,-211.11 1168.94,-214.22 1132.89,-217.91"/>
+<polygon fill="black" stroke="black" points="1132.19,-214.46 1122.6,-218.96 1132.9,-221.42 1132.19,-214.46"/>
+</g>
+<!-- n0000009f&#45;&gt;n00000065 -->
+<g id="edge26" class="edge">
+<title>n0000009f:port2&#45;&gt;n00000065</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1238.48,-200.11C1238.48,-200.11 1373.26,-188.53 1410.58,-196.91 1444.02,-204.43 1447.24,-220.06 1479.66,-231.18 1479.76,-231.22 1479.86,-231.25 1479.96,-231.29"/>
+<polygon fill="black" stroke="black" points="1478.86,-234.61 1489.45,-234.34 1481,-227.95 1478.86,-234.61"/>
+</g>
+<!-- n0000009f&#45;&gt;n00000069 -->
+<g id="edge27" class="edge">
+<title>n0000009f:port3&#45;&gt;n00000069</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1264.38,-200.11C1264.38,-200.11 1266.8,-198.46 1270.49,-195.94"/>
+<polygon fill="black" stroke="black" points="1272.56,-198.77 1278.86,-190.24 1268.62,-192.98 1272.56,-198.77"/>
+</g>
+<!-- n0000009f&#45;&gt;n0000006d -->
+<g id="edge28" class="edge">
+<title>n0000009f:port4&#45;&gt;n0000006d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1290.47,-200.09C1290.47,-200.09 1203.71,-189.63 1196.18,-196.91 1167.29,-224.84 1172.63,-254.76 1196.18,-287.31 1198.16,-290.05 1207.33,-293.96 1219.38,-298.13"/>
+<polygon fill="black" stroke="black" points="1218.49,-301.53 1229.09,-301.36 1220.7,-294.88 1218.49,-301.53"/>
+</g>
+<!-- n0000009f&#45;&gt;n00000071 -->
+<g id="edge29" class="edge">
+<title>n0000009f:port5&#45;&gt;n00000071</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1316.28,-200.09C1316.28,-200.09 1402.13,-190.7 1410.58,-196.91 1486.41,-252.66 1489.83,-378.79 1487.09,-436.2"/>
+<polygon fill="black" stroke="black" points="1483.59,-436.23 1486.51,-446.41 1490.58,-436.63 1483.59,-436.23"/>
+</g>
+<!-- n0000009f&#45;&gt;n00000075 -->
+<g id="edge30" class="edge">
+<title>n0000009f:port6&#45;&gt;n00000075</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1342.28,-200.11C1342.28,-200.11 1202.85,-192.29 1196.18,-196.91 1121.26,-248.86 1108.3,-368.49 1106.45,-424.12"/>
+<polygon fill="black" stroke="black" points="1102.94,-424.28 1106.21,-434.36 1109.94,-424.44 1102.94,-424.28"/>
+</g>
+<!-- n0000009f&#45;&gt;n00000079 -->
+<g id="edge31" class="edge">
+<title>n0000009f:port7&#45;&gt;n00000079</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1368.38,-200.11C1368.38,-200.11 1201.84,-195.85 1197.94,-193.16 1149.87,-159.96 1132.85,-88.83 1126.94,-48.6"/>
+<polygon fill="black" stroke="black" points="1130.39,-48.06 1125.6,-38.61 1123.46,-48.98 1130.39,-48.06"/>
+</g>
+<!-- n0000009f&#45;&gt;n0000007d -->
+<g id="edge32" class="edge">
+<title>n0000009f:port8&#45;&gt;n0000007d</title>
+<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1394.38,-200.11C1394.38,-200.11 1412.45,-196.49 1416.34,-193.16 1457.25,-158.2 1479.47,-97.38 1489.62,-61.41"/>
+<polygon fill="black" stroke="black" points="1493.06,-62.09 1492.28,-51.53 1486.3,-60.27 1493.06,-62.09"/>
+</g>
+<!-- n000000e9 -->
+<g id="node37" class="node">
+<title>n000000e9</title>
+<path fill="green" stroke="black" d="M398.65,-431.45C398.65,-431.45 511.65,-431.45 511.65,-431.45 517.65,-431.45 523.65,-437.45 523.65,-443.45 523.65,-443.45 523.65,-503.45 523.65,-503.45 523.65,-509.45 517.65,-515.45 511.65,-515.45 511.65,-515.45 398.65,-515.45 398.65,-515.45 392.65,-515.45 386.65,-509.45 386.65,-503.45 386.65,-503.45 386.65,-443.45 386.65,-443.45 386.65,-437.45 392.65,-431.45 398.65,-431.45"/>
+<text text-anchor="middle" x="420.65" y="-500.25" font-family="Times,serif" font-size="14.00">1</text>
+<polyline fill="none" stroke="black" points="454.65,-492.45 454.65,-515.45 "/>
+<text text-anchor="middle" x="489.15" y="-500.25" font-family="Times,serif" font-size="14.00">2</text>
+<polyline fill="none" stroke="black" points="386.65,-492.45 523.65,-492.45 "/>
+<text text-anchor="middle" x="455.15" y="-477.25" font-family="Times,serif" font-size="14.00">ov2740 4&#45;0036</text>
+<text text-anchor="middle" x="455.15" y="-462.25" font-family="Times,serif" font-size="14.00">/dev/v4l&#45;subdev4</text>
+<polyline fill="none" stroke="black" points="386.65,-454.45 523.65,-454.45 "/>
+<text text-anchor="middle" x="455.15" y="-439.25" font-family="Times,serif" font-size="14.00">0</text>
+</g>
+<!-- n000000e9&#45;&gt;n0000008b -->
+<g id="edge33" class="edge">
+<title>n000000e9:port0&#45;&gt;n0000008b:port0</title>
+<path fill="none" stroke="black" stroke-width="2" d="M386.14,-442.55C386.14,-442.55 361.11,-493.23 383.45,-518.65 391.47,-527.78 484.31,-519.72 492.3,-528.88 508.64,-547.6 499.26,-579.87 493.12,-595.68"/>
+<polygon fill="black" stroke="black" stroke-width="2" points="489.86,-594.41 489.11,-604.98 496.29,-597.19 489.86,-594.41"/>
+</g>
+</g>
+</svg>
diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst
index 2977f74d7e26..b9da127c074d 100644
--- a/Documentation/admin-guide/media/mgb4.rst
+++ b/Documentation/admin-guide/media/mgb4.rst
@@ -1,8 +1,10 @@
.. SPDX-License-Identifier: GPL-2.0
-====================
-mgb4 sysfs interface
-====================
+The mgb4 driver
+===============
+
+sysfs interface
+---------------
The mgb4 driver provides a sysfs interface, that is used to configure video
stream related parameters (some of them must be set properly before the v4l2
@@ -12,9 +14,8 @@ There are two types of parameters - global / PCI card related, found under
``/sys/class/video4linux/videoX/device`` and module specific found under
``/sys/class/video4linux/videoX``.
-
Global (PCI card) parameters
-============================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**module_type** (R):
Module type.
@@ -42,9 +43,8 @@ Global (PCI card) parameters
where each component is a 8b number.
-
Common FPDL3/GMSL input parameters
-==================================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**input_id** (R):
Input number ID, zero based.
@@ -190,9 +190,8 @@ Common FPDL3/GMSL input parameters
*Note: This parameter can not be changed while the input v4l2 device is
open.*
-
Common FPDL3/GMSL output parameters
-===================================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**output_id** (R):
Output number ID, zero based.
@@ -228,8 +227,13 @@ Common FPDL3/GMSL output parameters
open.*
**frame_rate** (RW):
- Output video frame rate in frames per second. The default frame rate is
- 60Hz.
+ Output video signal frame rate limit in frames per second. Due to
+ the limited output pixel clock steps, the card can not always generate
+ a frame rate perfectly matching the value required by the connected display.
+ Using this parameter one can limit the frame rate by "crippling" the signal
+ so that the lines are not equal (the porches of the last line differ) but
+ the signal appears like having the exact frame rate to the connected display.
+ The default frame rate limit is 60Hz.
**hsync_polarity** (RW):
HSYNC signal polarity.
@@ -254,37 +258,36 @@ Common FPDL3/GMSL output parameters
and there is a non-linear stepping between two consecutive allowed
frequencies. The driver finds the nearest allowed frequency to the given
value and sets it. When reading this property, you get the exact
- frequency set by the driver. The default frequency is 70000kHz.
+ frequency set by the driver. The default frequency is 61150kHz.
*Note: This parameter can not be changed while the output v4l2 device is
open.*
**hsync_width** (RW):
- Width of the HSYNC signal in pixels. The default value is 16.
+ Width of the HSYNC signal in pixels. The default value is 40.
**vsync_width** (RW):
- Width of the VSYNC signal in video lines. The default value is 2.
+ Width of the VSYNC signal in video lines. The default value is 20.
**hback_porch** (RW):
Number of PCLK pulses between deassertion of the HSYNC signal and the first
- valid pixel in the video line (marked by DE=1). The default value is 32.
+ valid pixel in the video line (marked by DE=1). The default value is 50.
**hfront_porch** (RW):
Number of PCLK pulses between the end of the last valid pixel in the video
line (marked by DE=1) and assertion of the HSYNC signal. The default value
- is 32.
+ is 50.
**vback_porch** (RW):
Number of video lines between deassertion of the VSYNC signal and the video
- line with the first valid pixel (marked by DE=1). The default value is 2.
+ line with the first valid pixel (marked by DE=1). The default value is 31.
**vfront_porch** (RW):
Number of video lines between the end of the last valid pixel line (marked
- by DE=1) and assertion of the VSYNC signal. The default value is 2.
-
+ by DE=1) and assertion of the VSYNC signal. The default value is 30.
FPDL3 specific input parameters
-===============================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**fpdl3_input_width** (RW):
Number of deserializer input lines.
@@ -294,7 +297,7 @@ FPDL3 specific input parameters
| 2 - dual
FPDL3 specific output parameters
-================================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**fpdl3_output_width** (RW):
Number of serializer output lines.
@@ -304,7 +307,7 @@ FPDL3 specific output parameters
| 2 - dual
GMSL specific input parameters
-==============================
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**gmsl_mode** (RW):
GMSL speed mode.
@@ -328,10 +331,8 @@ GMSL specific input parameters
| 0 - disabled
| 1 - enabled (default)
-
-====================
-mgb4 mtd partitions
-====================
+MTD partitions
+--------------
The mgb4 driver creates a MTD device with two partitions:
- mgb4-fw.X - FPGA firmware.
@@ -344,9 +345,8 @@ also have a third partition named *mgb4-flash* available in the system. This
partition represents the whole, unpartitioned, card's FLASH memory and one should
not fiddle with it...
-====================
-mgb4 iio (triggers)
-====================
+IIO (triggers)
+--------------
The mgb4 driver creates an Industrial I/O (IIO) device that provides trigger and
signal level status capability. The following scan elements are available:
diff --git a/Documentation/admin-guide/media/omap4_camera.rst b/Documentation/admin-guide/media/omap4_camera.rst
deleted file mode 100644
index 2ada9b1e6897..000000000000
--- a/Documentation/admin-guide/media/omap4_camera.rst
+++ /dev/null
@@ -1,62 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-OMAP4 ISS Driver
-================
-
-Author: Sergio Aguirre <sergio.a.aguirre@gmail.com>
-
-Copyright (C) 2012, Texas Instruments
-
-Introduction
-------------
-
-The OMAP44XX family of chips contains the Imaging SubSystem (a.k.a. ISS),
-Which contains several components that can be categorized in 3 big groups:
-
-- Interfaces (2 Interfaces: CSI2-A & CSI2-B/CCP2)
-- ISP (Image Signal Processor)
-- SIMCOP (Still Image Coprocessor)
-
-For more information, please look in [#f1]_ for latest version of:
-"OMAP4430 Multimedia Device Silicon Revision 2.x"
-
-As of Revision AB, the ISS is described in detail in section 8.
-
-This driver is supporting **only** the CSI2-A/B interfaces for now.
-
-It makes use of the Media Controller framework [#f2]_, and inherited most of the
-code from OMAP3 ISP driver (found under drivers/media/platform/ti/omap3isp/\*),
-except that it doesn't need an IOMMU now for ISS buffers memory mapping.
-
-Supports usage of MMAP buffers only (for now).
-
-Tested platforms
-----------------
-
-- OMAP4430SDP, w/ ES2.1 GP & SEVM4430-CAM-V1-0 (Contains IMX060 & OV5640, in
- which only the last one is supported, outputting YUV422 frames).
-
-- TI Blaze MDP, w/ OMAP4430 ES2.2 EMU (Contains 1 IMX060 & 2 OV5650 sensors, in
- which only the OV5650 are supported, outputting RAW10 frames).
-
-- PandaBoard, Rev. A2, w/ OMAP4430 ES2.1 GP & OV adapter board, tested with
- following sensors:
- * OV5640
- * OV5650
-
-- Tested on mainline kernel:
-
- http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=summary
-
- Tag: v3.3 (commit c16fa4f2ad19908a47c63d8fa436a1178438c7e7)
-
-File list
----------
-drivers/staging/media/omap4iss/
-include/linux/platform_data/media/omap4iss.h
-
-References
-----------
-
-.. [#f1] http://focus.ti.com/general/docs/wtbu/wtbudocumentcenter.tsp?navigationId=12037&templateId=6123#62
-.. [#f2] http://lwn.net/Articles/420485/
diff --git a/Documentation/admin-guide/media/raspberrypi-pisp-be.dot b/Documentation/admin-guide/media/raspberrypi-pisp-be.dot
new file mode 100644
index 000000000000..55671dc1d443
--- /dev/null
+++ b/Documentation/admin-guide/media/raspberrypi-pisp-be.dot
@@ -0,0 +1,20 @@
+digraph board {
+ rankdir=TB
+ n00000001 [label="{{<port0> 0 | <port1> 1 | <port2> 2 | <port7> 7} | pispbe\n | {<port3> 3 | <port4> 4 | <port5> 5 | <port6> 6}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000001:port3 -> n0000001c [style=bold]
+ n00000001:port4 -> n00000022 [style=bold]
+ n00000001:port5 -> n00000028 [style=bold]
+ n00000001:port6 -> n0000002e [style=bold]
+ n0000000a [label="pispbe-input\n/dev/video0", shape=box, style=filled, fillcolor=yellow]
+ n0000000a -> n00000001:port0 [style=bold]
+ n00000010 [label="pispbe-tdn_input\n/dev/video1", shape=box, style=filled, fillcolor=yellow]
+ n00000010 -> n00000001:port1 [style=bold]
+ n00000016 [label="pispbe-stitch_input\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
+ n00000016 -> n00000001:port2 [style=bold]
+ n0000001c [label="pispbe-output0\n/dev/video3", shape=box, style=filled, fillcolor=yellow]
+ n00000022 [label="pispbe-output1\n/dev/video4", shape=box, style=filled, fillcolor=yellow]
+ n00000028 [label="pispbe-tdn_output\n/dev/video5", shape=box, style=filled, fillcolor=yellow]
+ n0000002e [label="pispbe-stitch_output\n/dev/video6", shape=box, style=filled, fillcolor=yellow]
+ n00000034 [label="pispbe-config\n/dev/video7", shape=box, style=filled, fillcolor=yellow]
+ n00000034 -> n00000001:port7 [style=bold]
+}
diff --git a/Documentation/admin-guide/media/raspberrypi-pisp-be.rst b/Documentation/admin-guide/media/raspberrypi-pisp-be.rst
new file mode 100644
index 000000000000..0fcf46f26276
--- /dev/null
+++ b/Documentation/admin-guide/media/raspberrypi-pisp-be.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================================================
+Raspberry Pi PiSP Back End Memory-to-Memory ISP (pisp-be)
+=========================================================
+
+The PiSP Back End
+=================
+
+The PiSP Back End is a memory-to-memory Image Signal Processor (ISP) which reads
+image data from DRAM memory and performs image processing as specified by the
+application through the parameters in a configuration buffer, before writing
+pixel data back to memory through two distinct output channels.
+
+The ISP registers and programming model are documented in the `Raspberry Pi
+Image Signal Processor (PiSP) Specification document`_
+
+The PiSP Back End ISP processes images in tiles. The handling of image
+tessellation and the computation of low-level configuration parameters is
+realized by a free software library called `libpisp
+<https://github.com/raspberrypi/libpisp>`_.
+
+The full image processing pipeline, which involves capturing RAW Bayer data from
+an image sensor through a MIPI CSI-2 compatible capture interface, storing them
+in DRAM memory and processing them in the PiSP Back End to obtain images usable
+by an application is implemented in `libcamera <https://libcamera.org>`_ as
+part of the Raspberry Pi platform support.
+
+The pisp-be driver
+==================
+
+The Raspberry Pi PiSP Back End (pisp-be) driver is located under
+drivers/media/platform/raspberrypi/pisp-be. It uses the `V4L2 API` to register
+a number of video capture and output devices, the `V4L2 subdev API` to register
+a subdevice for the ISP that connects the video devices in a single media graph
+realized using the `Media Controller (MC) API`.
+
+The media topology registered by the `pisp-be` driver is represented below:
+
+.. _pips-be-topology:
+
+.. kernel-figure:: raspberrypi-pisp-be.dot
+ :alt: Diagram of the default media pipeline topology
+ :align: center
+
+
+The media graph registers the following video device nodes:
+
+- pispbe-input: output device for images to be submitted to the ISP for
+ processing.
+- pispbe-tdn_input: output device for temporal denoise.
+- pispbe-stitch_input: output device for image stitching (HDR).
+- pispbe-output0: first capture device for processed images.
+- pispbe-output1: second capture device for processed images.
+- pispbe-tdn_output: capture device for temporal denoise.
+- pispbe-stitch_output: capture device for image stitching (HDR).
+- pispbe-config: output device for ISP configuration parameters.
+
+pispbe-input
+------------
+
+Images to be processed by the ISP are queued to the `pispbe-input` output device
+node. For a list of image formats supported as input to the ISP refer to the
+`Raspberry Pi Image Signal Processor (PiSP) Specification document`_.
+
+pispbe-tdn_input, pispbe-tdn_output
+-----------------------------------
+
+The `pispbe-tdn_input` output video device receives images to be processed by
+the temporal denoise block which are captured from the `pispbe-tdn_output`
+capture video device. Userspace is responsible for maintaining queues on both
+devices, and ensuring that buffers completed on the output are queued to the
+input.
+
+pispbe-stitch_input, pispbe-stitch_output
+-----------------------------------------
+
+To realize HDR (high dynamic range) image processing the image stitching and
+tonemapping blocks are used. The `pispbe-stitch_output` writes images to memory
+and the `pispbe-stitch_input` receives the previously written frame to process
+it along with the current input image. Userspace is responsible for maintaining
+queues on both devices, and ensuring that buffers completed on the output are
+queued to the input.
+
+pispbe-output0, pispbe-output1
+------------------------------
+
+The two capture devices write to memory the pixel data as processed by the ISP.
+
+pispbe-config
+-------------
+
+The `pispbe-config` output video devices receives a buffer of configuration
+parameters that define the desired image processing to be performed by the ISP.
+
+The format of the ISP configuration parameter is defined by
+:c:type:`pisp_be_tiles_config` C structure and the meaning of each parameter is
+described in the `Raspberry Pi Image Signal Processor (PiSP) Specification
+document`_.
+
+ISP configuration
+=================
+
+The ISP configuration is described solely by the content of the parameters
+buffer. The only parameter that userspace needs to configure using the V4L2 API
+is the image format on the output and capture video devices for validation of
+the content of the parameters buffer.
+
+.. _Raspberry Pi Image Signal Processor (PiSP) Specification document: https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf
diff --git a/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot
new file mode 100644
index 000000000000..7717f2291049
--- /dev/null
+++ b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot
@@ -0,0 +1,27 @@
+digraph board {
+ rankdir=TB
+ n00000001 [label="{{<port0> 0} | csi2\n/dev/v4l-subdev0 | {<port1> 1 | <port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000001:port1 -> n00000011 [style=dashed]
+ n00000001:port1 -> n00000007:port0
+ n00000001:port2 -> n00000015
+ n00000001:port2 -> n00000007:port0 [style=dashed]
+ n00000001:port3 -> n00000019 [style=dashed]
+ n00000001:port3 -> n00000007:port0 [style=dashed]
+ n00000001:port4 -> n0000001d [style=dashed]
+ n00000001:port4 -> n00000007:port0 [style=dashed]
+ n00000007 [label="{{<port0> 0 | <port1> 1} | pisp-fe\n/dev/v4l-subdev1 | {<port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green]
+ n00000007:port2 -> n00000021
+ n00000007:port3 -> n00000025 [style=dashed]
+ n00000007:port4 -> n00000029
+ n0000000d [label="{imx219 6-0010\n/dev/v4l-subdev2 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green]
+ n0000000d:port0 -> n00000001:port0 [style=bold]
+ n00000011 [label="rp1-cfe-csi2-ch0\n/dev/video0", shape=box, style=filled, fillcolor=yellow]
+ n00000015 [label="rp1-cfe-csi2-ch1\n/dev/video1", shape=box, style=filled, fillcolor=yellow]
+ n00000019 [label="rp1-cfe-csi2-ch2\n/dev/video2", shape=box, style=filled, fillcolor=yellow]
+ n0000001d [label="rp1-cfe-csi2-ch3\n/dev/video3", shape=box, style=filled, fillcolor=yellow]
+ n00000021 [label="rp1-cfe-fe-image0\n/dev/video4", shape=box, style=filled, fillcolor=yellow]
+ n00000025 [label="rp1-cfe-fe-image1\n/dev/video5", shape=box, style=filled, fillcolor=yellow]
+ n00000029 [label="rp1-cfe-fe-stats\n/dev/video6", shape=box, style=filled, fillcolor=yellow]
+ n0000002d [label="rp1-cfe-fe-config\n/dev/video7", shape=box, style=filled, fillcolor=yellow]
+ n0000002d -> n00000007:port1
+}
diff --git a/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst
new file mode 100644
index 000000000000..668d978a9875
--- /dev/null
+++ b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst
@@ -0,0 +1,78 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+============================================
+Raspberry Pi PiSP Camera Front End (rp1-cfe)
+============================================
+
+The PiSP Camera Front End
+=========================
+
+The PiSP Camera Front End (CFE) is a module which combines a CSI-2 receiver with
+a simple ISP, called the Front End (FE).
+
+The CFE has four DMA engines and can write frames from four separate streams
+received from the CSI-2 to the memory. One of those streams can also be routed
+directly to the FE, which can do minimal image processing, write two versions
+(e.g. non-scaled and downscaled versions) of the received frames to memory and
+provide statistics of the received frames.
+
+The FE registers are documented in the `Raspberry Pi Image Signal Processor
+(ISP) Specification document
+<https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf>`_,
+and example code for FE can be found in `libpisp
+<https://github.com/raspberrypi/libpisp>`_.
+
+The rp1-cfe driver
+==================
+
+The Raspberry Pi PiSP Camera Front End (rp1-cfe) driver is located under
+drivers/media/platform/raspberrypi/rp1-cfe. It uses the `V4L2 API` to register
+a number of video capture and output devices, the `V4L2 subdev API` to register
+subdevices for the CSI-2 received and the FE that connects the video devices in
+a single media graph realized using the `Media Controller (MC) API`.
+
+The media topology registered by the `rp1-cfe` driver, in this particular
+example connected to an imx219 sensor, is the following one:
+
+.. _rp1-cfe-topology:
+
+.. kernel-figure:: raspberrypi-rp1-cfe.dot
+ :alt: Diagram of an example media pipeline topology
+ :align: center
+
+The media graph contains the following video device nodes:
+
+- rp1-cfe-csi2-ch0: capture device for the first CSI-2 stream
+- rp1-cfe-csi2-ch1: capture device for the second CSI-2 stream
+- rp1-cfe-csi2-ch2: capture device for the third CSI-2 stream
+- rp1-cfe-csi2-ch3: capture device for the fourth CSI-2 stream
+- rp1-cfe-fe-image0: capture device for the first FE output
+- rp1-cfe-fe-image1: capture device for the second FE output
+- rp1-cfe-fe-stats: capture device for the FE statistics
+- rp1-cfe-fe-config: output device for FE configuration
+
+rp1-cfe-csi2-chX
+----------------
+
+The rp1-cfe-csi2-chX capture devices are normal V4L2 capture devices which
+can be used to capture video frames or metadata received from the CSI-2.
+
+rp1-cfe-fe-image0, rp1-cfe-fe-image1
+------------------------------------
+
+The rp1-cfe-fe-image0 and rp1-cfe-fe-image1 capture devices are used to write
+the processed frames to memory.
+
+rp1-cfe-fe-stats
+----------------
+
+The format of the FE statistics buffer is defined by
+:c:type:`pisp_statistics` C structure and the meaning of each parameter is
+described in the `PiSP specification` document.
+
+rp1-cfe-fe-config
+-----------------
+
+The format of the FE configuration buffer is defined by
+:c:type:`pisp_fe_config` C structure and the meaning of each parameter is
+described in the `PiSP specification` document.
diff --git a/Documentation/admin-guide/media/rkisp1.rst b/Documentation/admin-guide/media/rkisp1.rst
index 6f14d9561fa5..6c878c71442f 100644
--- a/Documentation/admin-guide/media/rkisp1.rst
+++ b/Documentation/admin-guide/media/rkisp1.rst
@@ -114,11 +114,18 @@ to be applied to the hardware during a video stream, allowing userspace
to dynamically modify values such as black level, cross talk corrections
and others.
-The buffer format is defined by struct :c:type:`rkisp1_params_cfg`, and
-userspace should set
+The ISP driver supports two different parameters configuration methods, the
+`fixed parameters format` or the `extensible parameters format`.
+
+When using the `fixed parameters` method the buffer format is defined by struct
+:c:type:`rkisp1_params_cfg`, and userspace should set
:ref:`V4L2_META_FMT_RK_ISP1_PARAMS <v4l2-meta-fmt-rk-isp1-params>` as the
dataformat.
+When using the `extensible parameters` method the buffer format is defined by
+struct :c:type:`rkisp1_ext_params_cfg`, and userspace should set
+:ref:`V4L2_META_FMT_RK_ISP1_EXT_PARAMS <v4l2-meta-fmt-rk-isp1-ext-params>` as
+the dataformat.
Capturing Video Frames Example
==============================
diff --git a/Documentation/admin-guide/media/saa7134.rst b/Documentation/admin-guide/media/saa7134.rst
index 51eae7eb5ab7..18d7cbc897db 100644
--- a/Documentation/admin-guide/media/saa7134.rst
+++ b/Documentation/admin-guide/media/saa7134.rst
@@ -67,7 +67,7 @@ Changes / Fixes
Please mail to linux-media AT vger.kernel.org unified diffs against
the linux media git tree:
- https://git.linuxtv.org/media_tree.git/
+ https://git.linuxtv.org/media.git/
This is done by committing a patch at a clone of the git tree and
submitting the patch using ``git send-email``. Don't forget to
diff --git a/Documentation/admin-guide/media/tuner-cardlist.rst b/Documentation/admin-guide/media/tuner-cardlist.rst
index 362617c59c5d..65ecf48ddf24 100644
--- a/Documentation/admin-guide/media/tuner-cardlist.rst
+++ b/Documentation/admin-guide/media/tuner-cardlist.rst
@@ -97,4 +97,6 @@ Tuner number Card name
89 Sony BTF-PG472Z PAL/SECAM
90 Sony BTF-PK467Z NTSC-M-JP
91 Sony BTF-PB463Z NTSC-M
+92 Silicon Labs Si2157 tuner
+93 Tena TNF931D-DFDR1
============ =====================================================
diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst
index f4bb2605f07e..e8761561b2fe 100644
--- a/Documentation/admin-guide/media/v4l-drivers.rst
+++ b/Documentation/admin-guide/media/v4l-drivers.rst
@@ -16,14 +16,16 @@ Video4Linux (V4L) driver-specific documentation
imx
imx7
ipu3
+ ipu6-isys
ivtv
mgb4
omap3isp
- omap4_camera
philips
qcom_camss
+ raspberrypi-pisp-be
rcar-fdp1
rkisp1
+ raspberrypi-rp1-cfe
saa7134
si470x
si4713
diff --git a/Documentation/admin-guide/media/visl.rst b/Documentation/admin-guide/media/visl.rst
index db1ef29438e1..cd45145cde68 100644
--- a/Documentation/admin-guide/media/visl.rst
+++ b/Documentation/admin-guide/media/visl.rst
@@ -49,6 +49,10 @@ Module parameters
visl_dprintk_frame_start, visl_dprintk_nframes, but controls the dumping of
buffer data through debugfs instead.
+- tpg_verbose: Write extra information on each output frame to ease debugging
+ the API. When set to true, the output frames are not stable for a given input
+ as some information like pointers or queue status will be added to them.
+
What is the default use case for this driver?
---------------------------------------------
@@ -57,8 +61,12 @@ This assumes that a working client is run against visl and that the ftrace and
OUTPUT buffer data is subsequently used to debug a work-in-progress
implementation.
-Information on reference frames, their timestamps, the status of the OUTPUT and
-CAPTURE queues and more can be read directly from the CAPTURE buffers.
+Even though no video decoding is actually done, the output frames can be used
+against a reference for a given input, except if tpg_verbose is set to true.
+
+Depending on the tpg_verbose parameter value, information on reference frames,
+their timestamps, the status of the OUTPUT and CAPTURE queues and more can be
+read directly from the CAPTURE buffers.
Supported codecs
----------------
diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst
index 58ac25b2c385..034ca7c77fb9 100644
--- a/Documentation/admin-guide/media/vivid.rst
+++ b/Documentation/admin-guide/media/vivid.rst
@@ -60,7 +60,7 @@ all configurable using the following module options:
- node_types:
which devices should each driver instance create. An array of
- hexadecimal values, one for each instance. The default is 0x1d3d.
+ hexadecimal values, one for each instance. The default is 0xe1d3d.
Each value is a bitmask with the following meaning:
- bit 0: Video Capture node
@@ -302,6 +302,15 @@ all configurable using the following module options:
- 0: forbid hints
- 1: allow hints
+- supports_requests:
+
+ specifies if the device should support the Request API. There are
+ three possible values, default is 1:
+
+ - 0: no request
+ - 1: supports requests
+ - 2: requires requests
+
Taken together, all these module options allow you to precisely customize
the driver behavior and test your application with all sorts of permutations.
It is also very suitable to emulate hardware that is not yet available, e.g.
@@ -313,13 +322,13 @@ Video Capture
This is probably the most frequently used feature. The video capture device
can be configured by using the module options num_inputs, input_types and
-ccs_cap_mode (see section 1 for more detailed information), but by default
-four inputs are configured: a webcam, a TV tuner, an S-Video and an HDMI
-input, one input for each input type. Those are described in more detail
-below.
+ccs_cap_mode (see "Configuring the driver" for more detailed information),
+but by default four inputs are configured: a webcam, a TV tuner, an S-Video
+and an HDMI input, one input for each input type. Those are described in more
+detail below.
Special attention has been given to the rate at which new frames become
-available. The jitter will be around 1 jiffie (that depends on the HZ
+available. The jitter will be around 1 jiffy (that depends on the HZ
configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second),
but the long-term behavior is exactly following the framerate. So a
framerate of 59.94 Hz is really different from 60 Hz. If the framerate
@@ -434,10 +443,10 @@ Video Output
------------
The video output device can be configured by using the module options
-num_outputs, output_types and ccs_out_mode (see section 1 for more detailed
-information), but by default two outputs are configured: an S-Video and an
-HDMI input, one output for each output type. Those are described in more detail
-below.
+num_outputs, output_types and ccs_out_mode (see "Configuring the driver"
+for more detailed information), but by default two outputs are configured:
+an S-Video and an HDMI input, one output for each output type. Those are
+described in more detail below.
Like with video capture the framerate is also exact in the long term.
@@ -1011,11 +1020,6 @@ Digital Video Controls
affects the reported colorspace since DVI_D outputs will always use
sRGB.
-- Display Present:
-
- sets the presence of a "display" on the HDMI output. This affects
- the tx_edid_present, tx_hotplug and tx_rxsense controls.
-
FM Radio Receiver Controls
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -1130,35 +1134,34 @@ Metadata Capture Controls
if set, then the generated metadata stream contains Source Clock information.
-Video, VBI and RDS Looping
---------------------------
-The vivid driver supports looping of video output to video input, VBI output
-to VBI input and RDS output to RDS input. For video/VBI looping this emulates
-as if a cable was hooked up between the output and input connector. So video
-and VBI looping is only supported between S-Video and HDMI inputs and outputs.
-VBI is only valid for S-Video as it makes no sense for HDMI.
+Video, Sliced VBI and HDMI CEC Looping
+--------------------------------------
-Since radio is wireless this looping always happens if the radio receiver
-frequency is close to the radio transmitter frequency. In that case the radio
-transmitter will 'override' the emulated radio stations.
-
-Looping is currently supported only between devices created by the same
-vivid driver instance.
+Video Looping functionality is supported for devices created by the same
+vivid driver instance, as well as across multiple instances of the vivid driver.
+The vivid driver supports looping of video and Sliced VBI data between an S-Video output
+and an S-Video input. It also supports looping of video and HDMI CEC data between an
+HDMI output and an HDMI input.
+To enable looping, set the 'HDMI/S-Video XXX-N Is Connected To' control(s) to select
+whether an input uses the Test Pattern Generator, or is disconnected, or is connected
+to an output. An input can be connected to an output from any vivid instance.
+The inputs and outputs are numbered XXX-N where XXX is the vivid instance number
+(see module option n_devs). If there is only one vivid instance (the default), then
+XXX will be 000. And N is the Nth S-Video/HDMI input or output of that instance.
+If vivid is loaded without module options, then you can connect the S-Video 000-0 input
+to the S-Video 000-0 output, or the HDMI 000-0 input to the HDMI 000-0 output.
+This is the equivalent of connecting or disconnecting a cable between an input and an
+output in a physical device.
-Video and Sliced VBI looping
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If an 'HDMI/S-Video XXX-N Is Connected To' control selected an output, then the video
+output will be looped to the video input provided that:
-The way to enable video/VBI looping is currently fairly crude. A 'Loop Video'
-control is available in the "Vivid" control class of the video
-capture and VBI capture devices. When checked the video looping will be enabled.
-Once enabled any video S-Video or HDMI input will show a static test pattern
-until the video output has started. At that time the video output will be
-looped to the video input provided that:
+- the currently selected input matches the input indicated by the control name.
-- the input type matches the output type. So the HDMI input cannot receive
- video from the S-Video output.
+- in the vivid instance of the output connector, the currently selected output matches
+ the output indicated by the control's value.
- the video resolution of the video input must match that of the video output.
So it is not possible to loop a 50 Hz (720x576) S-Video output to a 60 Hz
@@ -1185,6 +1188,8 @@ looped to the video input provided that:
"DV Timings Signal Mode" for the HDMI input should be configured so that a
valid signal is passed to the video input.
+If any condition is not valid, then the 'Noise' test pattern is shown.
+
The framerates do not have to match, although this might change in the future.
By default you will see the OSD text superimposed on top of the looped video.
@@ -1198,17 +1203,26 @@ and WSS (50 Hz formats) VBI data is looped. Teletext VBI data is not looped.
Radio & RDS Looping
-~~~~~~~~~~~~~~~~~~~
-
-As mentioned in section 6 the radio receiver emulates stations are regular
-frequency intervals. Depending on the frequency of the radio receiver a
-signal strength value is calculated (this is returned by VIDIOC_G_TUNER).
-However, it will also look at the frequency set by the radio transmitter and
-if that results in a higher signal strength than the settings of the radio
-transmitter will be used as if it was a valid station. This also includes
-the RDS data (if any) that the transmitter 'transmits'. This is received
-faithfully on the receiver side. Note that when the driver is loaded the
-frequencies of the radio receiver and transmitter are not identical, so
+-------------------
+
+The vivid driver supports looping of RDS output to RDS input.
+
+Since radio is wireless this looping always happens if the radio receiver
+frequency is close to the radio transmitter frequency. In that case the radio
+transmitter will 'override' the emulated radio stations.
+
+RDS looping is currently supported only between devices created by the same
+vivid driver instance.
+
+As mentioned in the "Radio Receiver" section, the radio receiver emulates
+stations at regular frequency intervals. Depending on the frequency of the
+radio receiver a signal strength value is calculated (this is returned by
+VIDIOC_G_TUNER). However, it will also look at the frequency set by the radio
+transmitter and if that results in a higher signal strength than the settings
+of the radio transmitter will be used as if it was a valid station. This also
+includes the RDS data (if any) that the transmitter 'transmits'. This is
+received faithfully on the receiver side. Note that when the driver is loaded
+the frequencies of the radio receiver and transmitter are not identical, so
initially no looping takes place.
@@ -1218,8 +1232,8 @@ Cropping, Composing, Scaling
This driver supports cropping, composing and scaling in any combination. Normally
which features are supported can be selected through the Vivid controls,
but it is also possible to hardcode it when the module is loaded through the
-ccs_cap_mode and ccs_out_mode module options. See section 1 on the details of
-these module options.
+ccs_cap_mode and ccs_out_mode module options. See "Configuring the driver" on
+the details of these module options.
This allows you to test your application for all these variations.
@@ -1260,7 +1274,8 @@ is set, then the alpha component is only used for the color red and set to
The driver has to be configured to support the multiplanar formats. By default
the driver instances are single-planar. This can be changed by setting the
-multiplanar module option, see section 1 for more details on that option.
+multiplanar module option, see "Configuring the driver" for more details on that
+option.
If the driver instance is using the multiplanar formats/API, then the first
single planar format (YUYV) and the multiplanar NV16M and NV61M formats the
@@ -1270,74 +1285,6 @@ data_offset to be non-zero, so this is a useful feature for testing applications
Video output will also honor any data_offset that the application set.
-Capture Overlay
----------------
-
-Note: capture overlay support is implemented primarily to test the existing
-V4L2 capture overlay API. In practice few if any GPUs support such overlays
-anymore, and neither are they generally needed anymore since modern hardware
-is so much more capable. By setting flag 0x10000 in the node_types module
-option the vivid driver will create a simple framebuffer device that can be
-used for testing this API. Whether this API should be used for new drivers is
-questionable.
-
-This driver has support for a destructive capture overlay with bitmap clipping
-and list clipping (up to 16 rectangles) capabilities. Overlays are not
-supported for multiplanar formats. It also honors the struct v4l2_window field
-setting: if it is set to FIELD_TOP or FIELD_BOTTOM and the capture setting is
-FIELD_ALTERNATE, then only the top or bottom fields will be copied to the overlay.
-
-The overlay only works if you are also capturing at that same time. This is a
-vivid limitation since it copies from a buffer to the overlay instead of
-filling the overlay directly. And if you are not capturing, then no buffers
-are available to fill.
-
-In addition, the pixelformat of the capture format and that of the framebuffer
-must be the same for the overlay to work. Otherwise VIDIOC_OVERLAY will return
-an error.
-
-In order to really see what it going on you will need to create two vivid
-instances: the first with a framebuffer enabled. You configure the capture
-overlay of the second instance to use the framebuffer of the first, then
-you start capturing in the second instance. For the first instance you setup
-the output overlay for the video output, turn on video looping and capture
-to see the blended framebuffer overlay that's being written to by the second
-instance. This setup would require the following commands:
-
-.. code-block:: none
-
- $ sudo modprobe vivid n_devs=2 node_types=0x10101,0x1
- $ v4l2-ctl -d1 --find-fb
- /dev/fb1 is the framebuffer associated with base address 0x12800000
- $ sudo v4l2-ctl -d2 --set-fbuf fb=1
- $ v4l2-ctl -d1 --set-fbuf fb=1
- $ v4l2-ctl -d0 --set-fmt-video=pixelformat='AR15'
- $ v4l2-ctl -d1 --set-fmt-video-out=pixelformat='AR15'
- $ v4l2-ctl -d2 --set-fmt-video=pixelformat='AR15'
- $ v4l2-ctl -d0 -i2
- $ v4l2-ctl -d2 -i2
- $ v4l2-ctl -d2 -c horizontal_movement=4
- $ v4l2-ctl -d1 --overlay=1
- $ v4l2-ctl -d0 -c loop_video=1
- $ v4l2-ctl -d2 --stream-mmap --overlay=1
-
-And from another console:
-
-.. code-block:: none
-
- $ v4l2-ctl -d1 --stream-out-mmap
-
-And yet another console:
-
-.. code-block:: none
-
- $ qv4l2
-
-and start streaming.
-
-As you can see, this is not for the faint of heart...
-
-
Output Overlay
--------------
@@ -1396,7 +1343,7 @@ Some Future Improvements
Just as a reminder and in no particular order:
- Add a virtual alsa driver to test audio
-- Add virtual sub-devices and media controller support
+- Add virtual sub-devices
- Some support for testing compressed video
- Add support to loop raw VBI output to raw VBI input
- Add support to loop teletext sliced VBI output to VBI input
@@ -1405,12 +1352,10 @@ Just as a reminder and in no particular order:
- Add ARGB888 overlay support: better testing of the alpha channel
- Improve pixel aspect support in the tpg code by passing a real v4l2_fract
- Use per-queue locks and/or per-device locks to improve throughput
-- Add support to loop from a specific output to a specific input across
- vivid instances
- The SDR radio should use the same 'frequencies' for stations as the normal
radio receiver, and give back noise if the frequency doesn't match up with
a station frequency
- Make a thread for the RDS generation, that would help in particular for the
"Controls" RDS Rx I/O Mode as the read-only RDS controls could be updated
in real-time.
-- Changing the EDID should cause hotplug detect emulation to happen.
+- Changing the EDID doesn't wait 100 ms before setting the HPD signal.
diff --git a/Documentation/admin-guide/mm/damon/reclaim.rst b/Documentation/admin-guide/mm/damon/reclaim.rst
index 343e25b252f4..af05ae617018 100644
--- a/Documentation/admin-guide/mm/damon/reclaim.rst
+++ b/Documentation/admin-guide/mm/damon/reclaim.rst
@@ -117,6 +117,33 @@ milliseconds.
1 second by default.
+quota_mem_pressure_us
+---------------------
+
+Desired level of memory pressure-stall time in microseconds.
+
+While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
+increases and decreases the effective level of the quota aiming this level of
+memory pressure is incurred. System-wide ``some`` memory PSI in microseconds
+per quota reset interval (``quota_reset_interval_ms``) is collected and
+compared to this value to see if the aim is satisfied. Value zero means
+disabling this auto-tuning feature.
+
+Disabled by default.
+
+quota_autotune_feedback
+-----------------------
+
+User-specifiable feedback for auto-tuning of the effective quota.
+
+While keeping the caps that set by other quotas, DAMON_RECLAIM automatically
+increases and decreases the effective level of the quota aiming receiving this
+feedback of value ``10,000`` from the user. DAMON_RECLAIM assumes the feedback
+value and the quota are positively proportional. Value zero means disabling
+this auto-tuning feature.
+
+Disabled by default.
+
wmarks_interval
---------------
diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst
index 7aa0071ff1c3..ede14b679d02 100644
--- a/Documentation/admin-guide/mm/damon/start.rst
+++ b/Documentation/admin-guide/mm/damon/start.rst
@@ -7,7 +7,7 @@ Getting Started
This document briefly describes how you can use DAMON by demonstrating its
default user space tool. Please note that this document describes only a part
of its features for brevity. Please refer to the usage `doc
-<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more
+<https://github.com/damonitor/damo/blob/next/USAGE.md>`_ of the tool for more
details.
@@ -26,7 +26,7 @@ User Space Tool
For the demonstration, we will use the default user space tool for DAMON,
called DAMON Operator (DAMO). It is available at
-https://github.com/awslabs/damo. The examples below assume that ``damo`` is on
+https://github.com/damonitor/damo. The examples below assume that ``damo`` is on
your ``$PATH``. It's not mandatory, though.
Because DAMO is using the sysfs interface (refer to :doc:`usage` for the
@@ -34,18 +34,69 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is
mounted.
+Snapshot Data Access Patterns
+=============================
+
+The commands below show the memory access pattern of a program at the moment of
+the execution. ::
+
+ $ git clone https://github.com/sjp38/masim; cd masim; make
+ $ sudo damo start "./masim ./configs/stairs.cfg --quiet"
+ $ sudo damo report access
+ heatmap: 641111111000000000000000000000000000000000000000000000[...]33333333333333335557984444[...]7
+ # min/max temperatures: -1,840,000,000, 370,010,000, column size: 3.925 MiB
+ 0 addr 86.182 TiB size 8.000 KiB access 0 % age 14.900 s
+ 1 addr 86.182 TiB size 8.000 KiB access 60 % age 0 ns
+ 2 addr 86.182 TiB size 3.422 MiB access 0 % age 4.100 s
+ 3 addr 86.182 TiB size 2.004 MiB access 95 % age 2.200 s
+ 4 addr 86.182 TiB size 29.688 MiB access 0 % age 14.100 s
+ 5 addr 86.182 TiB size 29.516 MiB access 0 % age 16.700 s
+ 6 addr 86.182 TiB size 29.633 MiB access 0 % age 17.900 s
+ 7 addr 86.182 TiB size 117.652 MiB access 0 % age 18.400 s
+ 8 addr 126.990 TiB size 62.332 MiB access 0 % age 9.500 s
+ 9 addr 126.990 TiB size 13.980 MiB access 0 % age 5.200 s
+ 10 addr 126.990 TiB size 9.539 MiB access 100 % age 3.700 s
+ 11 addr 126.990 TiB size 16.098 MiB access 0 % age 6.400 s
+ 12 addr 127.987 TiB size 132.000 KiB access 0 % age 2.900 s
+ total size: 314.008 MiB
+ $ sudo damo stop
+
+The first command of the above example downloads and builds an artificial
+memory access generator program called ``masim``. The second command asks DAMO
+to start the program via the given command and make DAMON monitors the newly
+started process. The third command retrieves the current snapshot of the
+monitored access pattern of the process from DAMON and shows the pattern in a
+human readable format.
+
+The first line of the output shows the relative access temperature (hotness) of
+the regions in a single row hetmap format. Each column on the heatmap
+represents regions of same size on the monitored virtual address space. The
+position of the colun on the row and the number on the column represents the
+relative location and access temperature of the region. ``[...]`` means
+unmapped huge regions on the virtual address spaces. The second line shows
+additional information for better understanding the heatmap.
+
+Each line of the output from the third line shows which virtual address range
+(``addr XX size XX``) of the process is how frequently (``access XX %``)
+accessed for how long time (``age XX``). For example, the evelenth region of
+~9.5 MiB size is being most frequently accessed for last 3.7 seconds. Finally,
+the fourth command stops DAMON.
+
+Note that DAMON can monitor not only virtual address spaces but multiple types
+of address spaces including the physical address space.
+
+
Recording Data Access Patterns
==============================
The commands below record the memory access patterns of a program and save the
monitoring results to a file. ::
- $ git clone https://github.com/sjp38/masim
- $ cd masim; make; ./masim ./configs/zigzag.cfg &
+ $ ./masim ./configs/zigzag.cfg &
$ sudo damo record -o damon.data $(pidof masim)
-The first two lines of the commands download an artificial memory access
-generator program and run it in the background. The generator will repeatedly
+The line of the commands run the artificial memory access
+generator program again. The generator will repeatedly
access two 100 MiB sized memory regions one by one. You can substitute this
with your real workload. The last line asks ``damo`` to record the access
pattern in the ``damon.data`` file.
@@ -57,7 +108,7 @@ Visualizing Recorded Patterns
You can visualize the pattern in a heatmap, showing which memory region
(x-axis) got accessed when (y-axis) and how frequently (number).::
- $ sudo damo report heats --heatmap stdout
+ $ sudo damo report heatmap
22222222222222222222222222222222222222211111111111111111111111111111111111111100
44444444444444444444444444444444444444434444444444444444444444444444444444443200
44444444444444444444444444444444444444433444444444444444444444444444444444444200
@@ -122,6 +173,6 @@ Data Access Pattern Aware Memory Management
Below command makes every memory region of size >=4K that has not accessed for
>=60 seconds in your workload to be swapped out. ::
- $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \
- --damos_age 60s max --damos_action pageout \
- <pid of your workload>
+ $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \
+ --damos_age 60s max --damos_action pageout \
+ <pid of your workload>
diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst
index 9d23144bf985..47a44bd348ab 100644
--- a/Documentation/admin-guide/mm/damon/usage.rst
+++ b/Documentation/admin-guide/mm/damon/usage.rst
@@ -7,31 +7,25 @@ Detailed Usages
DAMON provides below interfaces for different users.
- *DAMON user space tool.*
- `This <https://github.com/awslabs/damo>`_ is for privileged people such as
+ `This <https://github.com/damonitor/damo>`_ is for privileged people such as
system administrators who want a just-working human-friendly interface.
Using this, users can use the DAMON’s major features in a human-friendly way.
It may not be highly tuned for special cases, though. For more detail,
please refer to its `usage document
- <https://github.com/awslabs/damo/blob/next/USAGE.md>`_.
+ <https://github.com/damonitor/damo/blob/next/USAGE.md>`_.
- *sysfs interface.*
:ref:`This <sysfs_interface>` is for privileged user space programmers who
want more optimized use of DAMON. Using this, users can use DAMON’s major
features by reading from and writing to special sysfs files. Therefore,
you can write and use your personalized DAMON sysfs wrapper programs that
reads/writes the sysfs files instead of you. The `DAMON user space tool
- <https://github.com/awslabs/damo>`_ is one example of such programs.
+ <https://github.com/damonitor/damo>`_ is one example of such programs.
- *Kernel Space Programming Interface.*
:doc:`This </mm/damon/api>` is for kernel space programmers. Using this,
users can utilize every feature of DAMON most flexibly and efficiently by
writing kernel space DAMON application programs for you. You can even extend
DAMON for various address spaces. For detail, please refer to the interface
:doc:`document </mm/damon/api>`.
-- *debugfs interface. (DEPRECATED!)*
- :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface
- <sysfs_interface>`. This is deprecated, so users should move to the
- :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
- move, please report your usecase to damon@lists.linux.dev and
- linux-mm@kvack.org.
.. _sysfs_interface:
@@ -78,21 +72,21 @@ comma (",").
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes
- │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us
+ │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us
│ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/
│ │ │ │ │ │ │ │ sz/min,max
│ │ │ │ │ │ │ │ nr_accesses/min,max
│ │ │ │ │ │ │ │ age/min,max
- │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms
+ │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes
│ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil
│ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals
- │ │ │ │ │ │ │ │ │ 0/target_value,current_value
+ │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value
│ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low
│ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters
- │ │ │ │ │ │ │ │ 0/type,matching,memcg_id
- │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds
+ │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx
+ │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds
│ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes
- │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age
+ │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed
│ │ │ │ │ │ │ │ ...
│ │ │ │ │ │ ...
│ │ │ │ ...
@@ -153,6 +147,9 @@ Users can write below commands for the kdamond to the ``state`` file.
- ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme
action tried regions directory for each DAMON-based operation scheme of the
kdamond.
+- ``update_schemes_effective_quotas``: Update the contents of
+ ``effective_bytes`` files for each DAMON-based operation scheme of the
+ kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`.
If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread.
@@ -180,19 +177,14 @@ In each context directory, two files (``avail_operations`` and ``operations``)
and three directories (``monitoring_attrs``, ``targets``, and ``schemes``)
exist.
-DAMON supports multiple types of monitoring operations, including those for
-virtual address space and the physical address space. You can get the list of
-available monitoring operations set on the currently running kernel by reading
+DAMON supports multiple types of :ref:`monitoring operations
+<damon_design_configurable_operations_set>`, including those for virtual address
+space and the physical address space. You can get the list of available
+monitoring operations set on the currently running kernel by reading
``avail_operations`` file. Based on the kernel configuration, the file will
-list some or all of below keywords.
-
- - vaddr: Monitor virtual address spaces of specific processes
- - fvaddr: Monitor fixed virtual address ranges
- - paddr: Monitor the physical address space of the system
-
-Please refer to :ref:`regions sysfs directory <sysfs_regions>` for detailed
-differences between the operations sets in terms of the monitoring target
-regions.
+list different available operation sets. Please refer to the :ref:`design
+<damon_operations_set>` for the list of all available operation sets and their
+brief explanations.
You can set and get what type of monitoring operations DAMON will use for the
context by writing one of the keywords listed in ``avail_operations`` file and
@@ -247,17 +239,11 @@ process to the ``pid_target`` file.
targets/<N>/regions
-------------------
-When ``vaddr`` monitoring operations set is being used (``vaddr`` is written to
-the ``contexts/<N>/operations`` file), DAMON automatically sets and updates the
-monitoring target regions so that entire memory mappings of target processes
-can be covered. However, users could want to set the initial monitoring region
-to specific address ranges.
-
-In contrast, DAMON do not automatically sets and updates the monitoring target
-regions when ``fvaddr`` or ``paddr`` monitoring operations sets are being used
-(``fvaddr`` or ``paddr`` have written to the ``contexts/<N>/operations``).
-Therefore, users should set the monitoring target regions by themselves in the
-cases.
+In case of ``fvaddr`` or ``paddr`` monitoring operations sets, users are
+required to set the monitoring target address ranges. In case of ``vaddr``
+operations set, it is not mandatory, but users can optionally set the initial
+monitoring region to specific address ranges. Please refer to the :ref:`design
+<damon_design_vaddr_target_regions_construction>` for more details.
For such cases, users can explicitly set the initial monitoring target regions
as they want, by writing proper values to the files under this directory.
@@ -297,32 +283,17 @@ schemes/<N>/
------------
In each scheme directory, five directories (``access_pattern``, ``quotas``,
-``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files
-(``action`` and ``apply_interval``) exist.
+``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files
+(``action``, ``target_nid`` and ``apply_interval``) exist.
The ``action`` file is for setting and getting the scheme's :ref:`action
<damon_design_damos_action>`. The keywords that can be written to and read
-from the file and their meaning are as below.
-
-Note that support of each action depends on the running DAMON operations set
-:ref:`implementation <sysfs_context>`.
-
- - ``willneed``: Call ``madvise()`` for the region with ``MADV_WILLNEED``.
- Supported by ``vaddr`` and ``fvaddr`` operations set.
- - ``cold``: Call ``madvise()`` for the region with ``MADV_COLD``.
- Supported by ``vaddr`` and ``fvaddr`` operations set.
- - ``pageout``: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
- Supported by ``vaddr``, ``fvaddr`` and ``paddr`` operations set.
- - ``hugepage``: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``.
- Supported by ``vaddr`` and ``fvaddr`` operations set.
- - ``nohugepage``: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``.
- Supported by ``vaddr`` and ``fvaddr`` operations set.
- - ``lru_prio``: Prioritize the region on its LRU lists.
- Supported by ``paddr`` operations set.
- - ``lru_deprio``: Deprioritize the region on its LRU lists.
- Supported by ``paddr`` operations set.
- - ``stat``: Do nothing but count the statistics.
- Supported by all operations sets.
+from the file and their meaning are same to those of the list on
+:ref:`design doc <damon_design_damos_action>`.
+
+The ``target_nid`` file is for setting the migration target node, which is
+only meaningful when the ``action`` is either ``migrate_hot`` or
+``migrate_cold``.
The ``apply_interval_us`` file is for setting and getting the scheme's
:ref:`apply_interval <damon_design_damos>` in microseconds.
@@ -350,8 +321,9 @@ schemes/<N>/quotas/
The directory for the :ref:`quotas <damon_design_damos_quotas>` of the given
DAMON-based operation scheme.
-Under ``quotas`` directory, three files (``ms``, ``bytes``,
-``reset_interval_ms``) and two directores (``weights`` and ``goals``) exist.
+Under ``quotas`` directory, four files (``ms``, ``bytes``,
+``reset_interval_ms``, ``effective_bytes``) and two directores (``weights`` and
+``goals``) exist.
You can set the ``time quota`` in milliseconds, ``size quota`` in bytes, and
``reset interval`` in milliseconds by writing the values to the three files,
@@ -359,7 +331,17 @@ respectively. Then, DAMON tries to use only up to ``time quota`` milliseconds
for applying the ``action`` to memory regions of the ``access_pattern``, and to
apply the action to only up to ``bytes`` bytes of memory regions within the
``reset_interval_ms``. Setting both ``ms`` and ``bytes`` zero disables the
-quota limits.
+quota limits unless at least one :ref:`goal <sysfs_schemes_quota_goals>` is
+set.
+
+The time quota is internally transformed to a size quota. Between the
+transformed size quota and user-specified size quota, smaller one is applied.
+Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the
+effective size quota is further adjusted. Reading ``effective_bytes`` returns
+the current effective size quota. The file is not updated in real time, so
+users should ask DAMON sysfs interface to update the content of the file for
+the stats by writing a special keyword, ``update_schemes_effective_quotas`` to
+the relevant ``kdamonds/<N>/state`` file.
Under ``weights`` directory, three files (``sz_permil``,
``nr_accesses_permil``, and ``age_permil``) exist.
@@ -382,11 +364,11 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each goal and current achievement.
Among the multiple feedback, the best one is used.
-Each goal directory contains two files, namely ``target_value`` and
-``current_value``. Users can set and get any number to those files to set the
-feedback. User space main workload's latency or throughput, system metrics
-like free memory ratio or memory pressure stall time (PSI) could be example
-metrics for the values. Note that users should write
+Each goal directory contains three files, namely ``target_metric``,
+``target_value`` and ``current_value``. Users can set and get the three
+parameters for the quota auto-tuning goals that specified on the :ref:`design
+doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each
+of the files. Note that users should further write
``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond
directory <sysfs_kdamond>` to pass the feedback to DAMON.
@@ -424,59 +406,62 @@ number (``N``) to the file creates the number of child directories named ``0``
to ``N-1``. Each directory represents each filter. The filters are evaluated
in the numeric order.
-Each filter directory contains six files, namely ``type``, ``matcing``,
-``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type``
-file, you can write one of four special keywords: ``anon`` for anonymous pages,
-``memcg`` for specific memory cgroup, ``addr`` for specific address range (an
-open-ended interval), or ``target`` for specific DAMON monitoring target
-filtering. In case of the memory cgroup filtering, you can specify the memory
-cgroup of the interest by writing the path of the memory cgroup from the
-cgroups mount point to ``memcg_path`` file. In case of the address range
-filtering, you can specify the start and end address of the range to
-``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring
-target filtering, you can specify the index of the target between the list of
-the DAMON context's monitoring targets list to ``target_idx`` file. You can
-write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does
-not match to the type, respectively. Then, the scheme's action will not be
-applied to the pages that specified to be filtered out.
+Each filter directory contains seven files, namely ``type``, ``matching``,
+``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``.
+To ``type`` file, you can write one of five special keywords: ``anon`` for
+anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young
+pages, ``addr`` for specific address range (an open-ended interval), or
+``target`` for specific DAMON monitoring target filtering. Meaning of the
+types are same to the description on the :ref:`design doc
+<damon_design_damos_filters>`.
+
+In case of the memory cgroup filtering, you can specify the memory cgroup of
+the interest by writing the path of the memory cgroup from the cgroups mount
+point to ``memcg_path`` file. In case of the address range filtering, you can
+specify the start and end address of the range to ``addr_start`` and
+``addr_end`` files, respectively. For the DAMON monitoring target filtering,
+you can specify the index of the target between the list of the DAMON context's
+monitoring targets list to ``target_idx`` file.
+
+You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter
+is for memory that matches the ``type``. You can write ``Y`` or ``N`` to
+``allow`` file to specify if applying the action to the memory that satisfies
+the ``type`` and ``matching`` should be allowed or not.
For example, below restricts a DAMOS action to be applied to only non-anonymous
pages of all memory cgroups except ``/having_care_already``.::
# echo 2 > nr_filters
- # # filter out anonymous pages
+ # # disallow anonymous pages
echo anon > 0/type
echo Y > 0/matching
+ echo N > 0/allow
# # further filter out all cgroups except one at '/having_care_already'
echo memcg > 1/type
echo /having_care_already > 1/memcg_path
- echo N > 1/matching
-
-Note that ``anon`` and ``memcg`` filters are currently supported only when
-``paddr`` :ref:`implementation <sysfs_context>` is being used.
+ echo Y > 1/matching
+ echo N > 1/allow
-Also, memory regions that are filtered out by ``addr`` or ``target`` filters
-are not counted as the scheme has tried to those, while regions that filtered
-out by other type filters are counted as the scheme has tried to. The
-difference is applied to :ref:`stats <damos_stats>` and
-:ref:`tried regions <sysfs_schemes_tried_regions>`.
+Refer to the :ref:`DAMOS filters design documentation
+<damon_design_damos_filters>` for more details including how multiple filters
+of different ``allow`` works, when each of the filters are supported, and
+differences on stats.
.. _sysfs_schemes_stats:
schemes/<N>/stats/
------------------
-DAMON counts the total number and bytes of regions that each scheme is tried to
-be applied, the two numbers for the regions that each scheme is successfully
-applied, and the total number of the quota limit exceeds. This statistics can
-be used for online analysis or tuning of the schemes.
+DAMON counts statistics for each scheme. This statistics can be used for
+online analysis or tuning of the schemes. Refer to :ref:`design doc
+<damon_design_damos_stat>` for more details about the stats.
The statistics can be retrieved by reading the files under ``stats`` directory
-(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and
-``qt_exceeds``), respectively. The files are not updated in real time, so you
-should ask DAMON sysfs interface to update the content of the files for the
-stats by writing a special keyword, ``update_schemes_stats`` to the relevant
-``kdamonds/<N>/state`` file.
+(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``,
+``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not
+updated in real time, so you should ask DAMON sysfs interface to update the
+content of the files for the stats by writing a special keyword,
+``update_schemes_stats`` to the relevant ``kdamonds/<N>/state`` file.
.. _sysfs_schemes_tried_regions:
@@ -513,10 +498,10 @@ set the ``access pattern`` as their interested pattern that they want to query.
tried_regions/<N>/
------------------
-In each region directory, you will find four files (``start``, ``end``,
-``nr_accesses``, and ``age``). Reading the files will show the start and end
-addresses, ``nr_accesses``, and ``age`` of the region that corresponding
-DAMON-based operation scheme ``action`` has tried to be applied.
+In each region directory, you will find five files (``start``, ``end``,
+``nr_accesses``, ``age``, and ``sz_filter_passed``). Reading the files will
+show the properties of the region that corresponding DAMON-based operation
+scheme ``action`` has tried to be applied.
Example
~~~~~~~
@@ -555,7 +540,7 @@ memory rate becomes larger than 60%, or lower than 30%". ::
# echo 300 > watermarks/low
Please note that it's highly recommended to use user space tools like `damo
-<https://github.com/awslabs/damo>`_ rather than manually reading and writing
+<https://github.com/damonitor/damo>`_ rather than manually reading and writing
the files as above. Above is only for an example.
.. _tracepoint:
@@ -579,11 +564,11 @@ monitoring results recording.
While the monitoring is turned on, you could record the tracepoint events and
show results using tracepoint supporting tools like ``perf``. For example::
- # echo on > monitor_on
+ # echo on > kdamonds/0/state
# perf record -e damon:damon_aggregated &
# sleep 5
# kill 9 $(pidof perf)
- # echo off > monitor_on
+ # echo off > kdamonds/0/state
# perf script
kdamond.0 46568 [027] 79357.842179: damon:damon_aggregated: target_id=0 nr_regions=11 122509119488-135708762112: 0 864
[...]
@@ -612,300 +597,3 @@ fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``)
of the scheme in the list of the contexts of the context's kdamond, the index
of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in
addition to the output of ``damon_aggregated`` tracepoint.
-
-
-.. _debugfs_interface:
-
-debugfs Interface (DEPRECATED!)
-===============================
-
-.. note::
-
- THIS IS DEPRECATED!
-
- DAMON debugfs interface is deprecated, so users should move to the
- :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot
- move, please report your usecase to damon@lists.linux.dev and
- linux-mm@kvack.org.
-
-DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``,
-``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and
-``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``.
-
-
-Attributes
-----------
-
-Users can get and set the ``sampling interval``, ``aggregation interval``,
-``update interval``, and min/max number of monitoring target regions by
-reading from and writing to the ``attrs`` file. To know about the monitoring
-attributes in detail, please refer to the :doc:`/mm/damon/design`. For
-example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
-1000, and then check it again::
-
- # cd <debugfs>/damon
- # echo 5000 100000 1000000 10 1000 > attrs
- # cat attrs
- 5000 100000 1000000 10 1000
-
-
-Target IDs
-----------
-
-Some types of address spaces supports multiple monitoring target. For example,
-the virtual memory address spaces monitoring can have multiple processes as the
-monitoring targets. Users can set the targets by writing relevant id values of
-the targets to, and get the ids of the current targets by reading from the
-``target_ids`` file. In case of the virtual address spaces monitoring, the
-values should be pids of the monitoring target processes. For example, below
-commands set processes having pids 42 and 4242 as the monitoring targets and
-check it again::
-
- # cd <debugfs>/damon
- # echo 42 4242 > target_ids
- # cat target_ids
- 42 4242
-
-Users can also monitor the physical memory address space of the system by
-writing a special keyword, "``paddr\n``" to the file. Because physical address
-space monitoring doesn't support multiple targets, reading the file will show a
-fake value, ``42``, as below::
-
- # cd <debugfs>/damon
- # echo paddr > target_ids
- # cat target_ids
- 42
-
-Note that setting the target ids doesn't start the monitoring.
-
-
-Initial Monitoring Target Regions
----------------------------------
-
-In case of the virtual address space monitoring, DAMON automatically sets and
-updates the monitoring target regions so that entire memory mappings of target
-processes can be covered. However, users can want to limit the monitoring
-region to specific address ranges, such as the heap, the stack, or specific
-file-mapped area. Or, some users can know the initial access pattern of their
-workloads and therefore want to set optimal initial regions for the 'adaptive
-regions adjustment'.
-
-In contrast, DAMON do not automatically sets and updates the monitoring target
-regions in case of physical memory monitoring. Therefore, users should set the
-monitoring target regions by themselves.
-
-In such cases, users can explicitly set the initial monitoring target regions
-as they want, by writing proper values to the ``init_regions`` file. The input
-should be a sequence of three integers separated by white spaces that represent
-one region in below form.::
-
- <target idx> <start address> <end address>
-
-The ``target idx`` should be the index of the target in ``target_ids`` file,
-starting from ``0``, and the regions should be passed in address order. For
-example, below commands will set a couple of address ranges, ``1-100`` and
-``100-200`` as the initial monitoring target region of pid 42, which is the
-first one (index ``0``) in ``target_ids``, and another couple of address
-ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one
-(index ``1``) in ``target_ids``.::
-
- # cd <debugfs>/damon
- # cat target_ids
- 42 4242
- # echo "0 1 100 \
- 0 100 200 \
- 1 20 40 \
- 1 50 100" > init_regions
-
-Note that this sets the initial monitoring target regions only. In case of
-virtual memory monitoring, DAMON will automatically updates the boundary of the
-regions after one ``update interval``. Therefore, users should set the
-``update interval`` large enough in this case, if they don't want the
-update.
-
-
-Schemes
--------
-
-Users can get and set the DAMON-based operation :ref:`schemes
-<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file.
-Reading the file also shows the statistics of each scheme. To the file, each
-of the schemes should be represented in each line in below form::
-
- <target access pattern> <action> <quota> <watermarks>
-
-You can disable schemes by simply writing an empty string to the file.
-
-Target Access Pattern
-~~~~~~~~~~~~~~~~~~~~~
-
-The target access :ref:`pattern <damon_design_damos_access_pattern>` of the
-scheme. The ``<target access pattern>`` is constructed with three ranges in
-below form::
-
- min-size max-size min-acc max-acc min-age max-age
-
-Specifically, bytes for the size of regions (``min-size`` and ``max-size``),
-number of monitored accesses per aggregate interval for access frequency
-(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of
-regions (``min-age`` and ``max-age``) are specified. Note that the ranges are
-closed interval.
-
-Action
-~~~~~~
-
-The ``<action>`` is a predefined integer for memory management :ref:`actions
-<damon_design_damos_action>`. The supported numbers and their meanings are as
-below.
-
- - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``. Ignored if
- ``target`` is ``paddr``.
- - 1: Call ``madvise()`` for the region with ``MADV_COLD``. Ignored if
- ``target`` is ``paddr``.
- - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``.
- - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``. Ignored if
- ``target`` is ``paddr``.
- - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``. Ignored if
- ``target`` is ``paddr``.
- - 5: Do nothing but count the statistics
-
-Quota
-~~~~~
-
-Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme
-via the ``<quota>`` in below form::
-
- <ms> <sz> <reset interval> <priority weights>
-
-This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying
-the action to memory regions of the ``target access pattern`` within the
-``<reset interval>`` milliseconds, and to apply the action to only up to
-``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both
-``<ms>`` and ``<sz>`` zero disables the quota limits.
-
-For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users
-can set the weights for the three properties in ``<priority weights>`` in below
-form::
-
- <size weight> <access frequency weight> <age weight>
-
-Watermarks
-~~~~~~~~~~
-
-Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the
-given scheme via ``<watermarks>`` in below form::
-
- <metric> <check interval> <high mark> <middle mark> <low mark>
-
-``<metric>`` is a predefined integer for the metric to be checked. The
-supported numbers and their meanings are as below.
-
- - 0: Ignore the watermarks
- - 1: System's free memory rate (per thousand)
-
-The value of the metric is checked every ``<check interval>`` microseconds.
-
-If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the
-scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme
-is activated.
-
-.. _damos_stats:
-
-Statistics
-~~~~~~~~~~
-
-It also counts the total number and bytes of regions that each scheme is tried
-to be applied, the two numbers for the regions that each scheme is successfully
-applied, and the total number of the quota limit exceeds. This statistics can
-be used for online analysis or tuning of the schemes.
-
-The statistics can be shown by reading the ``schemes`` file. Reading the file
-will show each scheme you entered in each line, and the five numbers for the
-statistics will be added at the end of each line.
-
-Example
-~~~~~~~
-
-Below commands applies a scheme saying "If a memory region of size in [4KiB,
-8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
-interval in [10, 20], page out the region. For the paging out, use only up to
-10ms per second, and also don't page out more than 1GiB per second. Under the
-limitation, page out memory regions having longer age first. Also, check the
-free memory rate of the system every 5 seconds, start the monitoring and paging
-out when the free memory rate becomes lower than 50%, but stop it if the free
-memory rate becomes larger than 60%, or lower than 30%".::
-
- # cd <debugfs>/damon
- # scheme="4096 8192 0 5 10 20 2" # target access pattern and action
- # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas
- # scheme+=" 0 0 100" # prioritization weights
- # scheme+=" 1 5000000 600 500 300" # watermarks
- # echo "$scheme" > schemes
-
-
-Turning On/Off
---------------
-
-Setting the files as described above doesn't incur effect unless you explicitly
-start the monitoring. You can start, stop, and check the current status of the
-monitoring by writing to and reading from the ``monitor_on`` file. Writing
-``on`` to the file starts the monitoring of the targets with the attributes.
-Writing ``off`` to the file stops those. DAMON also stops if every target
-process is terminated. Below example commands turn on, off, and check the
-status of DAMON::
-
- # cd <debugfs>/damon
- # echo on > monitor_on
- # echo off > monitor_on
- # cat monitor_on
- off
-
-Please note that you cannot write to the above-mentioned debugfs files while
-the monitoring is turned on. If you write to the files while DAMON is running,
-an error code such as ``-EBUSY`` will be returned.
-
-
-Monitoring Thread PID
----------------------
-
-DAMON does requested monitoring with a kernel thread called ``kdamond``. You
-can get the pid of the thread by reading the ``kdamond_pid`` file. When the
-monitoring is turned off, reading the file returns ``none``. ::
-
- # cd <debugfs>/damon
- # cat monitor_on
- off
- # cat kdamond_pid
- none
- # echo on > monitor_on
- # cat kdamond_pid
- 18594
-
-
-Using Multiple Monitoring Threads
----------------------------------
-
-One ``kdamond`` thread is created for each monitoring context. You can create
-and remove monitoring contexts for multiple ``kdamond`` required use case using
-the ``mk_contexts`` and ``rm_contexts`` files.
-
-Writing the name of the new context to the ``mk_contexts`` file creates a
-directory of the name on the DAMON debugfs directory. The directory will have
-DAMON debugfs files for the context. ::
-
- # cd <debugfs>/damon
- # ls foo
- # ls: cannot access 'foo': No such file or directory
- # echo foo > mk_contexts
- # ls foo
- # attrs init_regions kdamond_pid schemes target_ids
-
-If the context is not needed anymore, you can remove it and the corresponding
-directory by putting the name of the context to the ``rm_contexts`` file. ::
-
- # echo foo > rm_contexts
- # ls foo
- # ls: cannot access 'foo': No such file or directory
-
-Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on`` files are in the
-root directory only.
diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst
index e4d4b4a8dc97..f34a0d798d5b 100644
--- a/Documentation/admin-guide/mm/hugetlbpage.rst
+++ b/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -376,6 +376,13 @@ Note that the number of overcommit and reserve pages remain global quantities,
as we don't know until fault time, when the faulting task's mempolicy is
applied, from which node the huge page allocation will be attempted.
+The hugetlb may be migrated between the per-node hugepages pool in the following
+scenarios: memory offline, memory failure, longterm pinning, syscalls(mbind,
+migrate_pages and move_pages), alloc_contig_range() and alloc_contig_pages().
+Now only memory offline, memory failure and syscalls allow fallbacking to allocate
+a new hugetlb on a different node if the current node is unable to allocate during
+hugetlb migration, that means these 3 cases can break the per-node hugepages pool.
+
.. _using_huge_pages:
Using Huge Pages
diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index 1f883abf3f00..8b35795b664b 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -10,7 +10,7 @@ processes address space and many other cool things.
Linux memory management is a complex system with many configurable
settings. Most of these settings are available via ``/proc``
-filesystem and can be quired and adjusted using ``sysctl``. These APIs
+filesystem and can be queried and adjusted using ``sysctl``. These APIs
are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_.
.. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html
diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst
index a639cac12477..ad8e7a41f3b5 100644
--- a/Documentation/admin-guide/mm/ksm.rst
+++ b/Documentation/admin-guide/mm/ksm.rst
@@ -308,7 +308,7 @@ limited by the ``advisor_max_cpu`` parameter. In addition there is also the
``advisor_target_scan_time`` parameter. This parameter sets the target time to
scan all the KSM candidate pages. The parameter ``advisor_target_scan_time``
decides how aggressive the scan time advisor scans candidate pages. Lower
-values make the scan time advisor to scan more aggresively. This is the most
+values make the scan time advisor to scan more aggressively. This is the most
important parameter for the configuration of the scan time advisor.
The initial value and the maximum value can be changed with
diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst
index 098f14d83e99..33c886f3d198 100644
--- a/Documentation/admin-guide/mm/memory-hotplug.rst
+++ b/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -280,8 +280,8 @@ The following files are currently defined:
blocks; configure auto-onlining.
The default value depends on the
- CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
- option.
+ CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel configuration
+ options.
See the ``state`` property of memory blocks for details.
``block_size_bytes`` read-only: the size in bytes of a memory block.
@@ -294,8 +294,9 @@ The following files are currently defined:
``crash_hotplug`` read-only: when changes to the system memory map
occur due to hot un/plug of memory, this file contains
'1' if the kernel updates the kdump capture kernel memory
- map itself (via elfcorehdr), or '0' if userspace must update
- the kdump capture kernel memory map.
+ map itself (via elfcorehdr and other relevant kexec
+ segments), or '0' if userspace must update the kdump
+ capture kernel memory map.
Availability depends on the CONFIG_MEMORY_HOTPLUG kernel
configuration option.
diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
can fall back to all existing numa nodes. This is effectively
MPOL_PREFERRED allowed for a mask rather than a single node.
+MPOL_WEIGHTED_INTERLEAVE
+ This mode operates the same as MPOL_INTERLEAVE, except that
+ interleaving behavior is executed based on weights set in
+ /sys/kernel/mm/mempolicy/weighted_interleave/
+
+ Weighted interleave allocates pages on nodes according to a
+ weight. For example if nodes [0,1] are weighted [5,2], 5 pages
+ will be allocated on node0 for every 2 pages allocated on node1.
+
NUMA memory policy supports the following optional mode flags:
MPOL_F_STATIC_NODES
diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst
index f5f065c67615..caba0f52dd36 100644
--- a/Documentation/admin-guide/mm/pagemap.rst
+++ b/Documentation/admin-guide/mm/pagemap.rst
@@ -118,7 +118,7 @@ Short descriptions to the page flags
21 - KSM
Identical memory pages dynamically shared between one or more processes.
22 - THP
- Contiguous pages which construct transparent hugepages.
+ Contiguous pages which construct THP of any size and mapped by any granularity.
23 - OFFLINE
The page is logically offline.
24 - ZERO_PAGE
@@ -173,27 +173,6 @@ LRU related page flags
The page-types tool in the tools/mm directory can be used to query the
above flags.
-Using pagemap to do something useful
-====================================
-
-The general procedure for using pagemap to find out about a process' memory
-usage goes like this:
-
- 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are
- mapped to what.
- 2. Select the maps you are interested in -- all of them, or a particular
- library, or the stack or the heap, etc.
- 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine.
- 4. Read a u64 for each page from pagemap.
- 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you
- just read, seek to that entry in the file, and read the data you want.
-
-For example, to find the "unique set size" (USS), which is the amount of
-memory that a process is using that is not shared with any other process,
-you can go through every map in the process, find the PFNs, look those up
-in kpagecount, and tally up the number of pages that are only referenced
-once.
-
Exceptions for Shared Memory
============================
@@ -252,7 +231,7 @@ Following flags about pages are currently supported:
- ``PAGE_IS_PRESENT`` - Page is present in the memory
- ``PAGE_IS_SWAPPED`` - Page is in swapped
- ``PAGE_IS_PFNZERO`` - Page has zero PFN
-- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed
+- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed
- ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty
The ``struct pm_scan_arg`` is used as the argument of the IOCTL.
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 04eb45a2f940..dff8d5985f0f 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -202,12 +202,21 @@ PMD-mappable transparent hugepage::
cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
-khugepaged will be automatically started when one or more hugepage
-sizes are enabled (either by directly setting "always" or "madvise",
-or by setting "inherit" while the top-level enabled is set to "always"
-or "madvise"), and it'll be automatically shutdown when the last
-hugepage size is disabled (either by directly setting "never", or by
-setting "inherit" while the top-level enabled is set to "never").
+All THPs at fault and collapse time will be added to _deferred_list,
+and will therefore be split under memory presure if they are considered
+"underused". A THP is underused if the number of zero-filled pages in
+the THP is above max_ptes_none (see below). It is possible to disable
+this behaviour by writing 0 to shrink_underused, and enable it by writing
+1 to it::
+
+ echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused
+ echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused
+
+khugepaged will be automatically started when PMD-sized THP is enabled
+(either of the per-size anon control or the top-level control are set
+to "always" or "madvise"), and it'll be automatically shutdown when
+PMD-sized THP is disabled (when both the per-size anon control and the
+top-level control are "never")
Khugepaged controls
-------------------
@@ -278,25 +287,92 @@ collapsed, resulting fewer pages being collapsed into
THPs, and lower memory access performance.
``max_ptes_shared`` specifies how many pages can be shared across multiple
-processes. Exceeding the number would block the collapse::
+processes. khugepaged might treat pages of THPs as shared if any page of
+that THP is shared. Exceeding the number would block the collapse::
/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared
A higher value may increase memory footprint for some workloads.
-Boot parameter
-==============
-
-You can change the sysfs boot time defaults of Transparent Hugepage
-Support by passing the parameter ``transparent_hugepage=always`` or
-``transparent_hugepage=madvise`` or ``transparent_hugepage=never``
-to the kernel command line.
+Boot parameters
+===============
+
+You can change the sysfs boot time default for the top-level "enabled"
+control by passing the parameter ``transparent_hugepage=always`` or
+``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the
+kernel command line.
+
+Alternatively, each supported anonymous THP size can be controlled by
+passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``,
+where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and
+supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``,
+``never`` or ``inherit``.
+
+For example, the following will set 16K, 32K, 64K THP to ``always``,
+set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M
+to ``never``::
+
+ thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never
+
+``thp_anon=`` may be specified multiple times to configure all THP sizes as
+required. If ``thp_anon=`` is specified at least once, any anon THP sizes
+not explicitly configured on the command line are implicitly set to
+``never``.
+
+``transparent_hugepage`` setting only affects the global toggle. If
+``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``.
+However, if a valid ``thp_anon`` setting is provided by the user, the
+PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER
+is not defined within a valid ``thp_anon``, its policy will default to
+``never``.
+
+Similarly to ``transparent_hugepage``, you can control the hugepage
+allocation policy for the internal shmem mount by using the kernel parameter
+``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the
+seven valid policies for shmem (``always``, ``within_size``, ``advise``,
+``never``, ``deny``, and ``force``).
+
+Similarly to ``transparent_hugepage_shmem``, you can control the default
+hugepage allocation policy for the tmpfs mount by using the kernel parameter
+``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the
+four valid policies for tmpfs (``always``, ``within_size``, ``advise``,
+``never``). The tmpfs mount default policy is ``never``.
+
+In the same manner as ``thp_anon`` controls each supported anonymous THP
+size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem``
+has the same format as ``thp_anon``, but also supports the policy
+``within_size``.
+
+``thp_shmem=`` may be specified multiple times to configure all THP sizes
+as required. If ``thp_shmem=`` is specified at least once, any shmem THP
+sizes not explicitly configured on the command line are implicitly set to
+``never``.
+
+``transparent_hugepage_shmem`` setting only affects the global toggle. If
+``thp_shmem`` is not specified, PMD_ORDER hugepage will default to
+``inherit``. However, if a valid ``thp_shmem`` setting is provided by the
+user, the PMD_ORDER hugepage policy will be overridden. If the policy for
+PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will
+default to ``never``.
Hugepages in tmpfs/shmem
========================
-You can control hugepage allocation policy in tmpfs with mount option
-``huge=``. It can have following values:
+Traditionally, tmpfs only supported a single huge page size ("PMD"). Today,
+it also supports smaller sizes just like anonymous memory, often referred
+to as "multi-size THP" (mTHP). Huge pages of any size are commonly
+represented in the kernel as "large folios".
+
+While there is fine control over the huge page sizes to use for the internal
+shmem mount (see below), ordinary tmpfs mounts will make use of all available
+huge page sizes without any control over the exact sizes, behaving more like
+other file systems.
+
+tmpfs mounts
+------------
+
+The THP allocation policy for tmpfs mounts can be adjusted using the mount
+option: ``huge=``. It can have following values:
always
Attempt to allocate huge pages every time we need a new page;
@@ -306,24 +382,24 @@ never
within_size
Only allocate huge page if it will be fully within i_size.
- Also respect fadvise()/madvise() hints;
+ Also respect madvise() hints;
advise
- Only allocate huge pages if requested with fadvise()/madvise();
+ Only allocate huge pages if requested with madvise();
-The default policy is ``never``.
+Remember, that the kernel may use huge pages of all available sizes, and
+that no fine control as for the internal tmpfs mount is available.
+
+The default policy in the past was ``never``, but it can now be adjusted
+using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``.
``mount -o remount,huge= /mountpoint`` works fine after mount: remounting
``huge=never`` will not attempt to break up huge pages at all, just stop more
from being allocated.
-There's also sysfs knob to control hugepage allocation policy for internal
-shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount
-is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or
-MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
-
-In addition to policies listed above, shmem_enabled allows two further
-values:
+In addition to policies listed above, the sysfs knob
+/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the
+allocation policy of tmpfs mounts, when set to the following values:
deny
For use in emergencies, to force the huge option off from
@@ -331,6 +407,42 @@ deny
force
Force the huge option on for all - very useful for testing;
+shmem / internal tmpfs
+----------------------
+The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous
+mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem.
+
+To control the THP allocation policy for this internal tmpfs mount, the
+sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs
+per THP size in
+'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled'
+can be used.
+
+The global knob has the same semantics as the ``huge=`` mount options
+for tmpfs mounts, except that the different huge page sizes can be controlled
+individually, and will only use the setting of the global knob when the
+per-size knob is set to 'inherit'.
+
+The options 'force' and 'deny' are dropped for the individual sizes, which
+are rather testing artifacts from the old ages.
+
+always
+ Attempt to allocate <size> huge pages every time we need a new page;
+
+inherit
+ Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages
+ have enabled="inherit" and all other hugepage sizes have enabled="never";
+
+never
+ Do not allocate <size> huge pages;
+
+within_size
+ Only allocate <size> huge page if it will be fully within i_size.
+ Also respect madvise() hints;
+
+advise
+ Only allocate <size> huge pages if requested with madvise();
+
Need of application restart
===========================
@@ -343,10 +455,6 @@ also applies to the regions registered in khugepaged.
Monitoring usage
================
-.. note::
- Currently the below counters only record events relating to
- PMD-sized THP. Events relating to other THP sizes are not included.
-
The number of PMD-sized anonymous transparent huge pages currently used by the
system is available by reading the AnonHugePages field in ``/proc/meminfo``.
To identify what applications are using PMD-sized anonymous transparent huge
@@ -358,7 +466,7 @@ AnonHugePmdMapped).
The number of file transparent huge pages mapped to userspace is available
by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``.
To identify what applications are mapping file transparent huge pages, it
-is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields
+is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields
for each mapping.
Note that reading the smaps file is expensive and reading it
@@ -369,7 +477,7 @@ monitor how successfully the system is providing huge pages for use.
thp_fault_alloc
is incremented every time a huge page is successfully
- allocated to handle a page fault.
+ allocated and charged to handle a page fault.
thp_collapse_alloc
is incremented by khugepaged when it has found
@@ -377,7 +485,7 @@ thp_collapse_alloc
successfully allocated a new huge page to store the data.
thp_fault_fallback
- is incremented if a page fault fails to allocate
+ is incremented if a page fault fails to allocate or charge
a huge page and instead falls back to using small pages.
thp_fault_fallback_charge
@@ -391,20 +499,23 @@ thp_collapse_alloc_failed
the allocation.
thp_file_alloc
- is incremented every time a file huge page is successfully
- allocated.
+ is incremented every time a shmem huge page is successfully
+ allocated (Note that despite being named after "file", the counter
+ measures only shmem).
thp_file_fallback
- is incremented if a file huge page is attempted to be allocated
- but fails and instead falls back to using small pages.
+ is incremented if a shmem huge page is attempted to be allocated
+ but fails and instead falls back to using small pages. (Note that
+ despite being named after "file", the counter measures only shmem).
thp_file_fallback_charge
- is incremented if a file huge page cannot be charged and instead
+ is incremented if a shmem huge page cannot be charged and instead
falls back to using small pages even though the allocation was
- successful.
+ successful. (Note that despite being named after "file", the
+ counter measures only shmem).
thp_file_mapped
- is incremented every time a file huge page is mapped into
+ is incremented every time a file or shmem huge page is mapped into
user address space.
thp_split_page
@@ -423,6 +534,12 @@ thp_deferred_split_page
splitting it would free up some memory. Pages on split queue are
going to be split under memory pressure.
+thp_underused_split_page
+ is incremented when a huge page on the split queue was split
+ because it was underused. A THP is underused if the number of
+ zero pages in the THP is above a certain threshold
+ (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none).
+
thp_split_pmd
is incremented every time a PMD split into table of PTEs.
This can happen, for instance, when application calls mprotect() or
@@ -447,6 +564,92 @@ thp_swpout_fallback
Usually because failed to allocate some continuous swap space
for the huge page.
+In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are
+also individual counters for each huge page size, which can be utilized to
+monitor the system's effectiveness in providing huge pages for usage. Each
+counter has its own corresponding file.
+
+anon_fault_alloc
+ is incremented every time a huge page is successfully
+ allocated and charged to handle a page fault.
+
+anon_fault_fallback
+ is incremented if a page fault fails to allocate or charge
+ a huge page and instead falls back to using huge pages with
+ lower orders or small pages.
+
+anon_fault_fallback_charge
+ is incremented if a page fault fails to charge a huge page and
+ instead falls back to using huge pages with lower orders or
+ small pages even though the allocation was successful.
+
+zswpout
+ is incremented every time a huge page is swapped out to zswap in one
+ piece without splitting.
+
+swpin
+ is incremented every time a huge page is swapped in from a non-zswap
+ swap device in one piece.
+
+swpin_fallback
+ is incremented if swapin fails to allocate or charge a huge page
+ and instead falls back to using huge pages with lower orders or
+ small pages.
+
+swpin_fallback_charge
+ is incremented if swapin fails to charge a huge page and instead
+ falls back to using huge pages with lower orders or small pages
+ even though the allocation was successful.
+
+swpout
+ is incremented every time a huge page is swapped out to a non-zswap
+ swap device in one piece without splitting.
+
+swpout_fallback
+ is incremented if a huge page has to be split before swapout.
+ Usually because failed to allocate some continuous swap space
+ for the huge page.
+
+shmem_alloc
+ is incremented every time a shmem huge page is successfully
+ allocated.
+
+shmem_fallback
+ is incremented if a shmem huge page is attempted to be allocated
+ but fails and instead falls back to using small pages.
+
+shmem_fallback_charge
+ is incremented if a shmem huge page cannot be charged and instead
+ falls back to using small pages even though the allocation was
+ successful.
+
+split
+ is incremented every time a huge page is successfully split into
+ smaller orders. This can happen for a variety of reasons but a
+ common reason is that a huge page is old and is being reclaimed.
+
+split_failed
+ is incremented if kernel fails to split huge
+ page. This can happen if the page was pinned by somebody.
+
+split_deferred
+ is incremented when a huge page is put onto split queue.
+ This happens when a huge page is partially unmapped and splitting
+ it would free up some memory. Pages on split queue are going to
+ be split under memory pressure, if splitting is possible.
+
+nr_anon
+ the number of anonymous THP we have in the whole system. These THPs
+ might be currently entirely mapped or have partially unmapped/unused
+ subpages.
+
+nr_anon_partially_mapped
+ the number of anonymous THP which are likely partially mapped, possibly
+ wasting memory, and have been queued for deferred memory reclamation.
+ Note that in corner some cases (e.g., failed migration), we might detect
+ an anonymous THP as "partially mapped" and count it here, even though it
+ is not actually partially mapped anymore.
+
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
index b42132969e31..3598dcd7dbe7 100644
--- a/Documentation/admin-guide/mm/zswap.rst
+++ b/Documentation/admin-guide/mm/zswap.rst
@@ -111,35 +111,6 @@ checked if it is a same-value filled page before compressing it. If true, the
compressed length of the page is set to zero and the pattern or same-filled
value is stored.
-Same-value filled pages identification feature is enabled by default and can be
-disabled at boot time by setting the ``same_filled_pages_enabled`` attribute
-to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and
-disabled at runtime using the sysfs ``same_filled_pages_enabled``
-attribute, e.g.::
-
- echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled
-
-When zswap same-filled page identification is disabled at runtime, it will stop
-checking for the same-value filled pages during store operation.
-In other words, every page will be then considered non-same-value filled.
-However, the existing pages which are marked as same-value filled pages remain
-stored unchanged in zswap until they are either loaded or invalidated.
-
-In some circumstances it might be advantageous to make use of just the zswap
-ability to efficiently store same-filled pages without enabling the whole
-compressed page storage.
-In this case the handling of non-same-value pages by zswap (enabled by default)
-can be disabled by setting the ``non_same_filled_pages_enabled`` attribute
-to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``.
-It can also be enabled and disabled at runtime using the sysfs
-``non_same_filled_pages_enabled`` attribute, e.g.::
-
- echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled
-
-Disabling both ``zswap.same_filled_pages_enabled`` and
-``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new
-pages by zswap.
-
To prevent zswap from shrinking pool when zswap is full and there's a high
pressure on swap (this will result in flipping pages in and out zswap pool
without any real benefit but with a performance drop for the system), a
@@ -155,7 +126,7 @@ Setting this parameter to 100 will disable the hysteresis.
Some users cannot tolerate the swapping that comes with zswap store failures
and zswap writebacks. Swapping can be disabled entirely (without disabling
-zswap itself) on a cgroup-basis as follows:
+zswap itself) on a cgroup-basis as follows::
echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
@@ -166,7 +137,7 @@ writeback (because the same pages might be rejected again and again).
When there is a sizable amount of cold memory residing in the zswap pool, it
can be advantageous to proactively write these cold pages to swap and reclaim
the memory for other use cases. By default, the zswap shrinker is disabled.
-User can enable it as follows:
+User can enable it as follows::
echo Y > /sys/module/zswap/parameters/shrinker_enabled
diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst
new file mode 100644
index 000000000000..97ca1ccef459
--- /dev/null
+++ b/Documentation/admin-guide/nvme-multipath.rst
@@ -0,0 +1,72 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================
+Linux NVMe multipath
+====================
+
+This document describes NVMe multipath and its path selection policies supported
+by the Linux NVMe host driver.
+
+
+Introduction
+============
+
+The NVMe multipath feature in Linux integrates namespaces with the same
+identifier into a single block device. Using multipath enhances the reliability
+and stability of I/O access while improving bandwidth performance. When a user
+sends I/O to this merged block device, the multipath mechanism selects one of
+the underlying block devices (paths) according to the configured policy.
+Different policies result in different path selections.
+
+
+Policies
+========
+
+All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning
+that when an optimized path is available, it will be chosen over a non-optimized
+one. Current the NVMe multipath policies include numa(default), round-robin and
+queue-depth.
+
+To set the desired policy (e.g., round-robin), use one of the following methods:
+ 1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy
+ 2. or add the "nvme_core.iopolicy=round-robin" to cmdline.
+
+
+NUMA
+----
+
+The NUMA policy selects the path closest to the NUMA node of the current CPU for
+I/O distribution. This policy maintains the nearest paths to each NUMA node
+based on network interface connections.
+
+When to use the NUMA policy:
+ 1. Multi-core Systems: Optimizes memory access in multi-core and
+ multi-processor systems, especially under NUMA architecture.
+ 2. High Affinity Workloads: Binds I/O processing to the CPU to reduce
+ communication and data transfer delays across nodes.
+
+
+Round-Robin
+-----------
+
+The round-robin policy distributes I/O requests evenly across all paths to
+enhance throughput and resource utilization. Each I/O operation is sent to the
+next path in sequence.
+
+When to use the round-robin policy:
+ 1. Balanced Workloads: Effective for balanced and predictable workloads with
+ similar I/O size and type.
+ 2. Homogeneous Path Performance: Utilizes all paths efficiently when
+ performance characteristics (e.g., latency, bandwidth) are similar.
+
+
+Queue-Depth
+-----------
+
+The queue-depth policy manages I/O requests based on the current queue depth
+of each path, selecting the path with the least number of in-flight I/Os.
+
+When to use the queue-depth policy:
+ 1. High load with small I/Os: Effectively balances load across paths when
+ the load is high, and I/O operations consist of small, relatively
+ fixed-sized requests.
diff --git a/Documentation/admin-guide/perf/arm-ni.rst b/Documentation/admin-guide/perf/arm-ni.rst
new file mode 100644
index 000000000000..d26a8f697c36
--- /dev/null
+++ b/Documentation/admin-guide/perf/arm-ni.rst
@@ -0,0 +1,17 @@
+====================================
+Arm Network-on Chip Interconnect PMU
+====================================
+
+NI-700 and friends implement a distinct PMU for each clock domain within the
+interconnect. Correspondingly, the driver exposes multiple PMU devices named
+arm_ni_<x>_cd_<y>, where <x> is an (arbitrary) instance identifier and <y> is
+the clock domain ID within that particular instance. If multiple NI instances
+exist within a system, the PMU devices can be correlated with the underlying
+hardware instance via sysfs parentage.
+
+Each PMU exposes base event aliases for the interface types present in its clock
+domain. These require qualifying with the "eventid" and "nodeid" parameters
+to specify the event code to count and the interface at which to count it
+(per the configured hardware ID as reflected in the xxNI_NODE_INFO register).
+The exception is the "cycles" alias for the PMU cycle counter, which is encoded
+with the PMU node type and needs no further qualification.
diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
index d47cd229d710..cb376f335f40 100644
--- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
+++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst
@@ -46,41 +46,41 @@ Some of the events only exist for specific configurations.
DesignWare Cores (DWC) PCIe PMU Driver
=======================================
-This driver adds PMU devices for each PCIe Root Port named based on the BDF of
+This driver adds PMU devices for each PCIe Root Port named based on the SBDF of
the Root Port. For example,
- 30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
+ 0001:30:03.0 PCI bridge: Device 1ded:8000 (rev 01)
-the PMU device name for this Root Port is dwc_rootport_3018.
+the PMU device name for this Root Port is dwc_rootport_13018.
The DWC PCIe PMU driver registers a perf PMU driver, which provides
description of available events and configuration options in sysfs, see
-/sys/bus/event_source/devices/dwc_rootport_{bdf}.
+/sys/bus/event_source/devices/dwc_rootport_{sbdf}.
The "format" directory describes format of the config fields of the
perf_event_attr structure. The "events" directory provides configuration
templates for all documented events. For example,
-"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1".
+"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0".
The "perf list" command shall list the available events from sysfs, e.g.::
$# perf list | grep dwc_rootport
<...>
- dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event]
+ dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event]
<...>
- dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event]
+ dwc_rootport_13018/rx_memory_read,lane=?/ [Kernel PMU event]
Time Based Analysis Event Usage
-------------------------------
Example usage of counting PCIe RX TLP data payload (Units of bytes)::
- $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/
+ $# perf stat -a -e dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/
The average RX/TX bandwidth can be calculated using the following formula:
- PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window
- PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window
+ PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window
+ PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window
Lane Event Usage
-------------------------------
@@ -88,7 +88,7 @@ Lane Event Usage
Each lane has the same event set and to avoid generating a list of hundreds
of events, the user need to specify the lane ID explicitly, e.g.::
- $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/
+ $# perf stat -a -e dwc_rootport_13018/rx_memory_read,lane=4/
The driver does not support sampling, therefore "perf record" will not
work. Per-task (without "-a") perf sessions are not supported.
diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst
index 7e863662e2d4..083ca50de896 100644
--- a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst
+++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst
@@ -28,7 +28,9 @@ The "identifier" sysfs file allows users to identify the version of the
PMU hardware device.
The "bus" sysfs file allows users to get the bus number of Root Ports
-monitored by PMU.
+monitored by PMU. Furthermore users can get the Root Ports range in
+[bdf_min, bdf_max] from "bdf_min" and "bdf_max" sysfs attributes
+respectively.
Example usage of perf::
@@ -37,9 +39,21 @@ Example usage of perf::
hisi_pcie0_core0/rx_mwr_cnt/ [kernel PMU event]
------------------------------------------
- $# perf stat -e hisi_pcie0_core0/rx_mwr_latency/
- $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt/
- $# perf stat -g -e hisi_pcie0_core0/rx_mwr_latency/ -e hisi_pcie0_core0/rx_mwr_cnt/
+ $# perf stat -e hisi_pcie0_core0/rx_mwr_latency,port=0xffff/
+ $# perf stat -e hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/
+
+The related events usually used to calculate the bandwidth, latency or others.
+They need to start and end counting at the same time, therefore related events
+are best used in the same event group to get the expected value. There are two
+ways to know if they are related events:
+
+a) By event name, such as the latency events "xxx_latency, xxx_cnt" or
+ bandwidth events "xxx_flux, xxx_time".
+b) By event type, such as "event=0xXXXX, event=0x1XXXX".
+
+Example usage of perf group::
+
+ $# perf stat -e "{hisi_pcie0_core0/rx_mwr_latency,port=0xffff/,hisi_pcie0_core0/rx_mwr_cnt,port=0xffff/}"
The current driver does not support sampling. So "perf record" is unsupported.
Also attach to a task is unsupported for PCIe PMU.
@@ -51,8 +65,12 @@ Filter options
PMU could only monitor the performance of traffic downstream target Root
Ports or downstream target Endpoint. PCIe PMU driver support "port" and
- "bdf" interfaces for users, and these two interfaces aren't supported at the
- same time.
+ "bdf" interfaces for users.
+ Please notice that, one of these two interfaces must be set, and these two
+ interfaces aren't supported at the same time. If they are both set, only
+ "port" filter is valid.
+ If "port" filter not being set or is set explicitly to zero (default), the
+ "bdf" filter will be in effect, because "bdf=0" meaning 0000:000:00.0.
- port
@@ -95,7 +113,7 @@ Filter options
Example usage of perf::
- $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,trig_len=0x4,trig_mode=1/ sleep 5
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,trig_len=0x4,trig_mode=1/ sleep 5
3. Threshold filter
@@ -109,7 +127,7 @@ Filter options
Example usage of perf::
- $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,thr_len=0x4,thr_mode=1/ sleep 5
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,thr_len=0x4,thr_mode=1/ sleep 5
4. TLP Length filter
@@ -127,4 +145,4 @@ Filter options
Example usage of perf::
- $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,len_mode=0x1/ sleep 5
+ $# perf stat -e hisi_pcie0_core0/rx_mrd_flux,port=0xffff,len_mode=0x1/ sleep 5
diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst
index e0174d20809a..48992a0b8e94 100644
--- a/Documentation/admin-guide/perf/hisi-pmu.rst
+++ b/Documentation/admin-guide/perf/hisi-pmu.rst
@@ -20,7 +20,6 @@ interrupt, and the PMU driver shall register perf PMU drivers like L3C,
HHA and DDRC etc. The available events and configuration options shall
be described in the sysfs, see:
-/sys/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>/, or
/sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>.
The "perf list" command shall list the available events from sysfs.
@@ -36,7 +35,10 @@ e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in
SCCL ID #1.
The driver also provides a "cpumask" sysfs attribute, which shows the CPU core
-ID used to count the uncore PMU event.
+ID used to count the uncore PMU event. An "associated_cpus" sysfs attribute is
+also provided to show the CPUs associated with this PMU. The "cpumask" indicates
+the CPUs to open the events, usually as a hint for userspaces tools like perf.
+It only contains one associated CPU from the "associated_cpus".
Example usage of perf::
diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst
index 75a40846d47f..1195e570f2d6 100644
--- a/Documentation/admin-guide/perf/hns3-pmu.rst
+++ b/Documentation/admin-guide/perf/hns3-pmu.rst
@@ -16,7 +16,7 @@ HNS3 PMU driver
The HNS3 PMU driver registers a perf PMU with the name of its sicl id.::
- /sys/devices/hns3_pmu_sicl_<sicl_id>
+ /sys/bus/event_source/devices/hns3_pmu_sicl_<sicl_id>
PMU driver provides description of available events, filter modes, format,
identifier and cpumask in sysfs.
@@ -40,9 +40,9 @@ device.
Example usage of checking event code and subevent code::
- $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time
+ $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time
config=0x00204
- $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num
+ $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num
config=0x10204
Each performance statistic has a pair of events to get two values to
@@ -60,7 +60,7 @@ computation to calculate real performance data is:::
Example usage of checking supported filter mode::
- $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num
+ $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num
filter mode supported: global/port/port-tc/func/func-queue/
Example usage of perf::
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index f4a4513c526f..072b510385c4 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -13,8 +13,12 @@ Performance monitor support
imx-ddr
qcom_l2_pmu
qcom_l3_pmu
+ starfive_starlink_pmu
+ mrvl-odyssey-ddr-pmu
+ mrvl-odyssey-tad-pmu
arm-ccn
arm-cmn
+ arm-ni
xgene-pmu
arm_dsu_pmu
thunderx2-pmu
@@ -24,3 +28,4 @@ Performance monitor support
meson-ddr-pmu
cxl
ampere_cspmu
+ mrvl-pem-pmu
diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst
new file mode 100644
index 000000000000..2e817593a4d9
--- /dev/null
+++ b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst
@@ -0,0 +1,80 @@
+===================================================================
+Marvell Odyssey DDR PMU Performance Monitoring Unit (PMU UNCORE)
+===================================================================
+
+Odyssey DRAM Subsystem supports eight counters for monitoring performance
+and software can program those counters to monitor any of the defined
+performance events. Supported performance events include those counted
+at the interface between the DDR controller and the PHY, interface between
+the DDR Controller and the CHI interconnect, or within the DDR Controller.
+
+Additionally DSS also supports two fixed performance event counters, one
+for ddr reads and the other for ddr writes.
+
+The counter will be operating in either manual or auto mode.
+
+The PMU driver exposes the available events and format options under sysfs::
+
+ /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/events/
+ /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/format/
+
+Examples::
+
+ $ perf list | grep ddr
+ mrvl_ddr_pmu_<>/ddr_act_bypass_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_bsm_alloc/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_bsm_starvation/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_active_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_mwr/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_rd_active_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_rd_or_wr_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_read/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_wr_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_cam_write/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_capar_error/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_crit_ref/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_ddr_reads/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_ddr_writes/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dfi_cmd_is_retry/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dfi_cycles/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dfi_parity_poison/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dfi_rd_data_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dfi_wr_data_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dqsosc_mpc/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_dqsosc_mrr/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_enter_mpsm/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_enter_powerdown/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_enter_selfref/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_hif_pri_rdaccess/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_hif_rd_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_hif_rd_or_wr_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_hif_rmw_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_hif_wr_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_hpri_sched_rd_crit_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_load_mode/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_lpri_sched_rd_crit_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_precharge/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_precharge_for_other/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_precharge_for_rdwr/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_raw_hazard/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_rd_bypass_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_rd_crc_error/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_rd_uc_ecc_error/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_rdwr_transitions/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_refresh/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_retry_fifo_full/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_spec_ref/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_tcr_mrr/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_war_hazard/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_waw_hazard/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_win_limit_reached_rd/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_win_limit_reached_wr/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_wr_crc_error/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_wr_trxn_crit_access/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_write_combine/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_zqcl/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_zqlatch/ [Kernel PMU event]
+ mrvl_ddr_pmu_<>/ddr_zqstart/ [Kernel PMU event]
+
+ $ perf stat -e ddr_cam_read,ddr_cam_write,ddr_cam_active_access,ddr_cam
+ rd_or_wr_access,ddr_cam_rd_active_access,ddr_cam_mwr <workload>
diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst
new file mode 100644
index 000000000000..ad1975b14087
--- /dev/null
+++ b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst
@@ -0,0 +1,37 @@
+====================================================================
+Marvell Odyssey LLC-TAD Performance Monitoring Unit (PMU UNCORE)
+====================================================================
+
+Each TAD provides eight 64-bit counters for monitoring
+cache behavior.The driver always configures the same counter for
+all the TADs. The user would end up effectively reserving one of
+eight counters in every TAD to look across all TADs.
+The occurrences of events are aggregated and presented to the user
+at the end of running the workload. The driver does not provide a
+way for the user to partition TADs so that different TADs are used for
+different applications.
+
+The performance events reflect various internal or interface activities.
+By combining the values from multiple performance counters, cache
+performance can be measured in terms such as: cache miss rate, cache
+allocations, interface retry rate, internal resource occupancy, etc.
+
+The PMU driver exposes the available events and format options under sysfs::
+
+ /sys/bus/event_source/devices/tad/events/
+ /sys/bus/event_source/devices/tad/format/
+
+Examples::
+
+ $ perf list | grep tad
+ tad/tad_alloc_any/ [Kernel PMU event]
+ tad/tad_alloc_dtg/ [Kernel PMU event]
+ tad/tad_alloc_ltg/ [Kernel PMU event]
+ tad/tad_hit_any/ [Kernel PMU event]
+ tad/tad_hit_dtg/ [Kernel PMU event]
+ tad/tad_hit_ltg/ [Kernel PMU event]
+ tad/tad_req_msh_in_exlmn/ [Kernel PMU event]
+ tad/tad_tag_rd/ [Kernel PMU event]
+ tad/tad_tot_cycle/ [Kernel PMU event]
+
+ $ perf stat -e tad_alloc_dtg,tad_alloc_ltg,tad_alloc_any,tad_hit_dtg,tad_hit_ltg,tad_hit_any,tad_tag_rd <workload>
diff --git a/Documentation/admin-guide/perf/mrvl-pem-pmu.rst b/Documentation/admin-guide/perf/mrvl-pem-pmu.rst
new file mode 100644
index 000000000000..c39007149b97
--- /dev/null
+++ b/Documentation/admin-guide/perf/mrvl-pem-pmu.rst
@@ -0,0 +1,56 @@
+=================================================================
+Marvell Odyssey PEM Performance Monitoring Unit (PMU UNCORE)
+=================================================================
+
+The PCI Express Interface Units(PEM) are associated with a corresponding
+monitoring unit. This includes performance counters to track various
+characteristics of the data that is transmitted over the PCIe link.
+
+The counters track inbound and outbound transactions which
+includes separate counters for posted/non-posted/completion TLPs.
+Also, inbound and outbound memory read requests along with their
+latencies can also be monitored. Address Translation Services(ATS)events
+such as ATS Translation, ATS Page Request, ATS Invalidation along with
+their corresponding latencies are also tracked.
+
+There are separate 64 bit counters to measure posted/non-posted/completion
+tlps in inbound and outbound transactions. ATS events are measured by
+different counters.
+
+The PMU driver exposes the available events and format options under sysfs,
+/sys/bus/event_source/devices/mrvl_pcie_rc_pmu_<>/events/
+/sys/bus/event_source/devices/mrvl_pcie_rc_pmu_<>/format/
+
+Examples::
+
+ # perf list | grep mrvl_pcie_rc_pmu
+ mrvl_pcie_rc_pmu_<>/ats_inv/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ats_inv_latency/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ats_pri/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ats_pri_latency/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ats_trans/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ats_trans_latency/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_inflight/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_reads/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_req_no_ro_ebus/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_req_no_ro_ncb/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_tlp_cpl_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_cpl_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_npr/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_pr/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_tlp_npr/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ib_tlp_pr/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_inflight_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_merges_cpl_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_merges_npr_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_merges_pr_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_reads_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_tlp_cpl_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_cpl_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_npr_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_pr_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_tlp_npr_partid/ [Kernel PMU event]
+ mrvl_pcie_rc_pmu_<>/ob_tlp_pr_partid/ [Kernel PMU event]
+
+
+ # perf stat -e ib_inflight,ib_reads,ib_req_no_ro_ebus,ib_req_no_ro_ncb <workload>
diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst
index 2e0d47cfe7ea..f538ef67e0e8 100644
--- a/Documentation/admin-guide/perf/nvidia-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-pmu.rst
@@ -34,7 +34,7 @@ strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see
traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
-see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>.
+see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>.
Example usage:
@@ -66,7 +66,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
-see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
+see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>.
Example usage:
@@ -86,6 +86,22 @@ Example usage:
perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/
+The NVLink-C2C has two ports that can be connected to one GPU (occupying both
+ports) or to two GPUs (one GPU per port). The user can use "port" bitmap
+parameter to select the port(s) to monitor. Each bit represents the port number,
+e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The
+PMU will monitor both ports by default if not specified.
+
+Example for port filtering:
+
+* Count event id 0x0 from the GPU connected with socket 0 on port 0::
+
+ perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/
+
+* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1::
+
+ perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/
+
NVLink-C2C1 PMU
-------------------
@@ -96,7 +112,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about
the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
-see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
+see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>.
Example usage:
@@ -116,6 +132,22 @@ Example usage:
perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/
+The NVLink-C2C has two ports that can be connected to one GPU (occupying both
+ports) or to two GPUs (one GPU per port). The user can use "port" bitmap
+parameter to select the port(s) to monitor. Each bit represents the port number,
+e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The
+PMU will monitor both ports by default if not specified.
+
+Example for port filtering:
+
+* Count event id 0x0 from the GPU connected with socket 0 on port 0::
+
+ perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/
+
+* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1::
+
+ perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/
+
CNVLink PMU
---------------
@@ -125,13 +157,14 @@ to local memory. For PCIE traffic, this PMU captures read and relaxed ordered
for more info about the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
-see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>.
+see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>.
Each SoC socket can be connected to one or more sockets via CNVLink. The user can
use "rem_socket" bitmap parameter to select the remote socket(s) to monitor.
Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to
-socket 1 to 3.
-/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
+socket 1 to 3. The PMU will monitor all remote sockets by default if not
+specified.
+/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket
shows the valid bits that can be set in the "rem_socket" parameter.
The PMU can not distinguish the remote traffic initiator, therefore it does not
@@ -165,12 +198,13 @@ local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section
for more info about the PMU traffic coverage.
The events and configuration options of this PMU device are described in sysfs,
-see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>.
+see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>.
Each SoC socket can support multiple root ports. The user can use
"root_port" bitmap parameter to select the port(s) to monitor, i.e.
-"root_port=0xF" corresponds to root port 0 to 3.
-/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
+"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root
+ports by default if not specified.
+/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port
shows the valid bits that can be set in the "root_port" parameter.
Example usage:
diff --git a/Documentation/admin-guide/perf/qcom_l2_pmu.rst b/Documentation/admin-guide/perf/qcom_l2_pmu.rst
index c130178a4a55..c37c6be9b8d8 100644
--- a/Documentation/admin-guide/perf/qcom_l2_pmu.rst
+++ b/Documentation/admin-guide/perf/qcom_l2_pmu.rst
@@ -10,7 +10,7 @@ There is one logical L2 PMU exposed, which aggregates the results from
the physical PMUs.
The driver provides a description of its available events and configuration
-options in sysfs, see /sys/devices/l2cache_0.
+options in sysfs, see /sys/bus/event_source/devices/l2cache_0.
The "format" directory describes the format of the events.
diff --git a/Documentation/admin-guide/perf/qcom_l3_pmu.rst b/Documentation/admin-guide/perf/qcom_l3_pmu.rst
index a3d014a46bfd..a66556b7e985 100644
--- a/Documentation/admin-guide/perf/qcom_l3_pmu.rst
+++ b/Documentation/admin-guide/perf/qcom_l3_pmu.rst
@@ -9,7 +9,7 @@ PMU with device name l3cache_<socket>_<instance>. User space is responsible
for aggregating across slices.
The driver provides a description of its available events and configuration
-options in sysfs, see /sys/devices/l3cache*. Given that these are uncore PMUs
+options in sysfs, see /sys/bus/event_source/devices/l3cache*. Given that these are uncore PMUs
the driver also exposes a "cpumask" sysfs attribute which contains a mask
consisting of one CPU per socket which will be used to handle all the PMU
events on that socket.
diff --git a/Documentation/admin-guide/perf/starfive_starlink_pmu.rst b/Documentation/admin-guide/perf/starfive_starlink_pmu.rst
new file mode 100644
index 000000000000..2932ddb4eb76
--- /dev/null
+++ b/Documentation/admin-guide/perf/starfive_starlink_pmu.rst
@@ -0,0 +1,46 @@
+================================================
+StarFive StarLink Performance Monitor Unit (PMU)
+================================================
+
+StarFive StarLink Performance Monitor Unit (PMU) exists within the
+StarLink Coherent Network on Chip (CNoC) that connects multiple CPU
+clusters with an L3 memory system.
+
+The uncore PMU supports overflow interrupt, up to 16 programmable 64bit
+event counters, and an independent 64bit cycle counter.
+The PMU can only be accessed via Memory Mapped I/O and are common to the
+cores connected to the same PMU.
+
+Driver exposes supported PMU events in sysfs "events" directory under::
+
+ /sys/bus/event_source/devices/starfive_starlink_pmu/events/
+
+Driver exposes cpu used to handle PMU events in sysfs "cpumask" directory
+under::
+
+ /sys/bus/event_source/devices/starfive_starlink_pmu/cpumask/
+
+Driver describes the format of config (event ID) in sysfs "format" directory
+under::
+
+ /sys/bus/event_source/devices/starfive_starlink_pmu/format/
+
+Example of perf usage::
+
+ $ perf list
+
+ starfive_starlink_pmu/cycles/ [Kernel PMU event]
+ starfive_starlink_pmu/read_hit/ [Kernel PMU event]
+ starfive_starlink_pmu/read_miss/ [Kernel PMU event]
+ starfive_starlink_pmu/read_request/ [Kernel PMU event]
+ starfive_starlink_pmu/release_request/ [Kernel PMU event]
+ starfive_starlink_pmu/write_hit/ [Kernel PMU event]
+ starfive_starlink_pmu/write_miss/ [Kernel PMU event]
+ starfive_starlink_pmu/write_request/ [Kernel PMU event]
+ starfive_starlink_pmu/writeback/ [Kernel PMU event]
+
+
+ $ perf stat -a -e /starfive_starlink_pmu/cycles/ sleep 1
+
+Sampling is not supported. As a result, "perf record" is not supported.
+Attaching to a task is not supported, only system-wide counting is supported.
diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst
index 01f158238ae1..9255f7bf9452 100644
--- a/Documentation/admin-guide/perf/thunderx2-pmu.rst
+++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst
@@ -22,7 +22,7 @@ The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and
L3C devices. Each PMU can be used to count up to 4 (DMC/L3C) or up to 8
(CCPI2) events simultaneously. The PMUs provide a description of their
available events and configuration options under sysfs, see
-/sys/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id.
+/sys/bus/event_source/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id.
The driver does not support sampling, therefore "perf record" will not
work. Per-task perf sessions are also not supported.
diff --git a/Documentation/admin-guide/perf/xgene-pmu.rst b/Documentation/admin-guide/perf/xgene-pmu.rst
index 644f8ed89152..98ccb8e777c4 100644
--- a/Documentation/admin-guide/perf/xgene-pmu.rst
+++ b/Documentation/admin-guide/perf/xgene-pmu.rst
@@ -13,7 +13,7 @@ PMU (perf) driver
The xgene-pmu driver registers several perf PMU drivers. Each of the perf
driver provides description of its available events and configuration options
-in sysfs, see /sys/devices/<l3cX/iobX/mcbX/mcX>/.
+in sysfs, see /sys/bus/event_source/devices/<l3cX/iobX/mcbX/mcX>/.
The "format" directory describes format of the config (event ID),
config1 (agent ID) fields of the perf_event_attr structure. The "events"
diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst
index 9eb26014d34b..412423c54f25 100644
--- a/Documentation/admin-guide/pm/amd-pstate.rst
+++ b/Documentation/admin-guide/pm/amd-pstate.rst
@@ -262,6 +262,17 @@ lowest non-linear performance in `AMD CPPC Performance Capability
<perf_cap_>`_.)
This attribute is read-only.
+``amd_pstate_hw_prefcore``
+
+Whether the platform supports the preferred core feature and it has been
+enabled. This attribute is read-only.
+
+``amd_pstate_prefcore_ranking``
+
+The performance ranking of the core. This number doesn't have any unit, but
+larger numbers are preferred at the time of reading. This can change at
+runtime based on platform conditions. This attribute is read-only.
+
``energy_performance_available_preferences``
A list of all the supported EPP preferences that could be used for
@@ -281,6 +292,22 @@ integer values defined between 0 to 255 when EPP feature is enabled by platform
firmware, if EPP feature is disabled, driver will ignore the written value
This attribute is read-write.
+``boost``
+The `boost` sysfs attribute provides control over the CPU core
+performance boost, allowing users to manage the maximum frequency limitation
+of the CPU. This attribute can be used to enable or disable the boost feature
+on individual CPUs.
+
+When the boost feature is enabled, the CPU can dynamically increase its frequency
+beyond the base frequency, providing enhanced performance for demanding workloads.
+On the other hand, disabling the boost feature restricts the CPU to operate at the
+base frequency, which may be desirable in certain scenarios to prioritize power
+efficiency or manage temperature.
+
+To manipulate the `boost` attribute, users can write a value of `0` to disable the
+boost or `1` to enable it, for the respective CPU using the sysfs path
+`/sys/devices/system/cpu/cpuX/cpufreq/boost`, where `X` represents the CPU number.
+
Other performance and frequency values can be read back from
``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`.
@@ -300,8 +327,8 @@ platforms. The AMD P-States mechanism is the more performance and energy
efficiency frequency management method on AMD processors.
-AMD Pstate Driver Operation Modes
-=================================
+``amd-pstate`` Driver Operation Modes
+======================================
``amd_pstate`` CPPC has 3 operation modes: autonomous (active) mode,
non-autonomous (passive) mode and guided autonomous (guided) mode.
@@ -353,6 +380,48 @@ is activated. In this mode, driver requests minimum and maximum performance
level and the platform autonomously selects a performance level in this range
and appropriate to the current workload.
+``amd-pstate`` Preferred Core
+=================================
+
+The core frequency is subjected to the process variation in semiconductors.
+Not all cores are able to reach the maximum frequency respecting the
+infrastructure limits. Consequently, AMD has redefined the concept of
+maximum frequency of a part. This means that a fraction of cores can reach
+maximum frequency. To find the best process scheduling policy for a given
+scenario, OS needs to know the core ordering informed by the platform through
+highest performance capability register of the CPPC interface.
+
+``amd-pstate`` preferred core enables the scheduler to prefer scheduling on
+cores that can achieve a higher frequency with lower voltage. The preferred
+core rankings can dynamically change based on the workload, platform conditions,
+thermals and ageing.
+
+The priority metric will be initialized by the ``amd-pstate`` driver. The ``amd-pstate``
+driver will also determine whether or not ``amd-pstate`` preferred core is
+supported by the platform.
+
+``amd-pstate`` driver will provide an initial core ordering when the system boots.
+The platform uses the CPPC interfaces to communicate the core ranking to the
+operating system and scheduler to make sure that OS is choosing the cores
+with highest performance firstly for scheduling the process. When ``amd-pstate``
+driver receives a message with the highest performance change, it will
+update the core ranking and set the cpu's priority.
+
+``amd-pstate`` Preferred Core Switch
+=====================================
+Kernel Parameters
+-----------------
+
+``amd-pstate`` peferred core`` has two states: enable and disable.
+Enable/disable states can be chosen by different kernel parameters.
+Default enable ``amd-pstate`` preferred core.
+
+``amd_prefcore=disable``
+
+For systems that support ``amd-pstate`` preferred core, the core rankings will
+always be advertised by the platform. But OS can choose to ignore that via the
+kernel parameter ``amd_prefcore=disable``.
+
User Space Interface in ``sysfs`` - General
===========================================
@@ -364,7 +433,7 @@ control its functionality at the system level. They are located in the
``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs.
``status``
- Operation mode of the driver: "active", "passive" or "disable".
+ Operation mode of the driver: "active", "passive", "guided" or "disable".
"active"
The driver is functional and in the ``active mode``
@@ -385,6 +454,19 @@ control its functionality at the system level. They are located in the
to the operation mode represented by that string - or to be
unregistered in the "disable" case.
+``prefcore``
+ Preferred core state of the driver: "enabled" or "disabled".
+
+ "enabled"
+ Enable the ``amd-pstate`` preferred core.
+
+ "disabled"
+ Disable the ``amd-pstate`` preferred core
+
+
+ This attribute is read-only to check the state of preferred core set
+ by the kernel parameter.
+
``cpupower`` tool support for ``amd-pstate``
===============================================
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 6adb7988e0eb..a21369eba034 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -267,6 +267,10 @@ are the following:
``related_cpus``
List of all (online and offline) CPUs belonging to this policy.
+``scaling_available_frequencies``
+ List of available frequencies of the CPUs belonging to this policy
+ (in kHz).
+
``scaling_available_governors``
List of ``CPUFreq`` scaling governors present in the kernel that can
be attached to this policy or (if the |intel_pstate| scaling driver is
@@ -421,8 +425,8 @@ This governor exposes only one tunable:
``rate_limit_us``
Minimum time (in microseconds) that has to pass between two consecutive
- runs of governor computations (default: 1000 times the scaling driver's
- transition latency).
+ runs of governor computations (default: 1.5 times the scaling driver's
+ transition latency or the maximum 2ms).
The purpose of this tunable is to reduce the scheduler context overhead
of the governor which might be excessive without it.
@@ -470,17 +474,17 @@ This governor exposes the following tunables:
This is how often the governor's worker routine should run, in
microseconds.
- Typically, it is set to values of the order of 10000 (10 ms). Its
- default value is equal to the value of ``cpuinfo_transition_latency``
- for each policy this governor is attached to (but since the unit here
- is greater by 1000, this means that the time represented by
- ``sampling_rate`` is 1000 times greater than the transition latency by
- default).
+ Typically, it is set to values of the order of 2000 (2 ms). Its
+ default value is to add a 50% breathing room
+ to ``cpuinfo_transition_latency`` on each policy this governor is
+ attached to. The minimum is typically the length of two scheduler
+ ticks.
If this tunable is per-policy, the following shell command sets the time
- represented by it to be 750 times as high as the transition latency::
+ represented by it to be 1.5 times as high as the transition latency
+ (the default)::
- # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
+ # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate
``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst
index 19754beb5a4e..eb58d7a5affd 100644
--- a/Documentation/admin-guide/pm/cpuidle.rst
+++ b/Documentation/admin-guide/pm/cpuidle.rst
@@ -269,27 +269,7 @@ Namely, when invoked to select an idle state for a CPU (i.e. an idle state that
the CPU will ask the processor hardware to enter), it attempts to predict the
idle duration and uses the predicted value for idle state selection.
-It first obtains the time until the closest timer event with the assumption
-that the scheduler tick will be stopped. That time, referred to as the *sleep
-length* in what follows, is the upper bound on the time before the next CPU
-wakeup. It is used to determine the sleep length range, which in turn is needed
-to get the sleep length correction factor.
-
-The ``menu`` governor maintains two arrays of sleep length correction factors.
-One of them is used when tasks previously running on the given CPU are waiting
-for some I/O operations to complete and the other one is used when that is not
-the case. Each array contains several correction factor values that correspond
-to different sleep length ranges organized so that each range represented in the
-array is approximately 10 times wider than the previous one.
-
-The correction factor for the given sleep length range (determined before
-selecting the idle state for the CPU) is updated after the CPU has been woken
-up and the closer the sleep length is to the observed idle duration, the closer
-to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
-The sleep length is multiplied by the correction factor for the range that it
-falls into to obtain the first approximation of the predicted idle duration.
-
-Next, the governor uses a simple pattern recognition algorithm to refine its
+It first uses a simple pattern recognition algorithm to obtain a preliminary
idle duration prediction. Namely, it saves the last 8 observed idle duration
values and, when predicting the idle duration next time, it computes the average
and variance of them. If the variance is small (smaller than 400 square
@@ -301,29 +281,39 @@ Again, if the variance of them is small (in the above sense), the average is
taken as the "typical interval" value and so on, until either the "typical
interval" is determined or too many data points are disregarded, in which case
the "typical interval" is assumed to equal "infinity" (the maximum unsigned
-integer value). The "typical interval" computed this way is compared with the
-sleep length multiplied by the correction factor and the minimum of the two is
-taken as the predicted idle duration.
-
-Then, the governor computes an extra latency limit to help "interactive"
-workloads. It uses the observation that if the exit latency of the selected
-idle state is comparable with the predicted idle duration, the total time spent
-in that state probably will be very short and the amount of energy to save by
-entering it will be relatively small, so likely it is better to avoid the
-overhead related to entering that state and exiting it. Thus selecting a
-shallower state is likely to be a better option then. The first approximation
-of the extra latency limit is the predicted idle duration itself which
-additionally is divided by a value depending on the number of tasks that
-previously ran on the given CPU and now they are waiting for I/O operations to
-complete. The result of that division is compared with the latency limit coming
-from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
-framework and the minimum of the two is taken as the limit for the idle states'
-exit latency.
+integer value).
+
+If the "typical interval" computed this way is long enough, the governor obtains
+the time until the closest timer event with the assumption that the scheduler
+tick will be stopped. That time, referred to as the *sleep length* in what follows,
+is the upper bound on the time before the next CPU wakeup. It is used to determine
+the sleep length range, which in turn is needed to get the sleep length correction
+factor.
+
+The ``menu`` governor maintains an array containing several correction factor
+values that correspond to different sleep length ranges organized so that each
+range represented in the array is approximately 10 times wider than the previous
+one.
+
+The correction factor for the given sleep length range (determined before
+selecting the idle state for the CPU) is updated after the CPU has been woken
+up and the closer the sleep length is to the observed idle duration, the closer
+to 1 the correction factor becomes (it must fall between 0 and 1 inclusive).
+The sleep length is multiplied by the correction factor for the range that it
+falls into to obtain an approximation of the predicted idle duration that is
+compared to the "typical interval" determined previously and the minimum of
+the two is taken as the idle duration prediction.
+
+If the "typical interval" value is small, which means that the CPU is likely
+to be woken up soon enough, the sleep length computation is skipped as it may
+be costly and the idle duration is simply predicted to equal the "typical
+interval" value.
Now, the governor is ready to walk the list of idle states and choose one of
them. For this purpose, it compares the target residency of each state with
-the predicted idle duration and the exit latency of it with the computed latency
-limit. It selects the state with the target residency closest to the predicted
+the predicted idle duration and the exit latency of it with the with the latency
+limit coming from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_,
+framework. It selects the state with the target residency closest to the predicted
idle duration, but still below it, and exit latency that does not exceed the
limit.
diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
index 5ab3440e6cee..5151ec312dc0 100644
--- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
+++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst
@@ -113,3 +113,62 @@ to apply at each uncore* level.
Support for "current_freq_khz" is available only at each fabric cluster
level (i.e., in uncore* directory).
+
+Efficiency vs. Latency Tradeoff
+-------------------------------
+
+The Efficiency Latency Control (ELC) feature improves performance
+per watt. With this feature hardware power management algorithms
+optimize trade-off between latency and power consumption. For some
+latency sensitive workloads further tuning can be done by SW to
+get desired performance.
+
+The hardware monitors the average CPU utilization across all cores
+in a power domain at regular intervals and decides an uncore frequency.
+While this may result in the best performance per watt, workload may be
+expecting higher performance at the expense of power. Consider an
+application that intermittently wakes up to perform memory reads on an
+otherwise idle system. In such cases, if hardware lowers uncore
+frequency, then there may be delay in ramp up of frequency to meet
+target performance.
+
+The ELC control defines some parameters which can be changed from SW.
+If the average CPU utilization is below a user-defined threshold
+(elc_low_threshold_percent attribute below), the user-defined uncore
+floor frequency will be used (elc_floor_freq_khz attribute below)
+instead of hardware calculated minimum.
+
+Similarly in high load scenario where the CPU utilization goes above
+the high threshold value (elc_high_threshold_percent attribute below)
+instead of jumping to maximum uncore frequency, frequency is increased
+in 100MHz steps. This avoids consuming unnecessarily high power
+immediately with CPU utilization spikes.
+
+Attributes for efficiency latency control:
+
+``elc_floor_freq_khz``
+ This attribute is used to get/set the efficiency latency floor frequency.
+ If this variable is lower than the 'min_freq_khz', it is ignored by
+ the firmware.
+
+``elc_low_threshold_percent``
+ This attribute is used to get/set the efficiency latency control low
+ threshold. This attribute is in percentages of CPU utilization.
+
+``elc_high_threshold_percent``
+ This attribute is used to get/set the efficiency latency control high
+ threshold. This attribute is in percentages of CPU utilization.
+
+``elc_high_threshold_enable``
+ This attribute is used to enable/disable the efficiency latency control
+ high threshold. Write '1' to enable, '0' to disable.
+
+Example system configuration below, which does following:
+ * when CPU utilization is less than 10%: sets uncore frequency to 800MHz
+ * when CPU utilization is higher than 95%: increases uncore frequency in
+ 100MHz steps, until power limit is reached
+
+ elc_floor_freq_khz:800000
+ elc_high_threshold_percent:95
+ elc_high_threshold_enable:1
+ elc_low_threshold_percent:10
diff --git a/Documentation/admin-guide/pmf.rst b/Documentation/admin-guide/pmf.rst
deleted file mode 100644
index 9ee729ffc19b..000000000000
--- a/Documentation/admin-guide/pmf.rst
+++ /dev/null
@@ -1,24 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Set udev rules for PMF Smart PC Builder
----------------------------------------
-
-AMD PMF(Platform Management Framework) Smart PC Solution builder has to set the system states
-like S0i3, Screen lock, hibernate etc, based on the output actions provided by the PMF
-TA (Trusted Application).
-
-In order for this to work the PMF driver generates a uevent for userspace to react to. Below are
-sample udev rules that can facilitate this experience when a machine has PMF Smart PC solution builder
-enabled.
-
-Please add the following line(s) to
-``/etc/udev/rules.d/99-local.rules``::
-
- DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="0", RUN+="/usr/bin/systemctl suspend"
- DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="1", RUN+="/usr/bin/systemctl hibernate"
- DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="2", RUN+="/bin/loginctl lock-sessions"
-
-EVENT_ID values:
-0= Put the system to S0i3/S2Idle
-1= Put the system to hibernate
-2= Lock the screen
diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst
index f08149bc53f8..07cfd8863b46 100644
--- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst
+++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst
@@ -733,7 +733,7 @@ can easily happen that your self-built kernel will lack modules for tasks you
did not perform before utilizing this make target. That's because those tasks
require kernel modules that are normally autoloaded when you perform that task
for the first time; if you didn't perform that task at least once before using
-localmodonfig, the latter will thus assume these modules are superfluous and
+localmodconfig, the latter will thus assume these modules are superfluous and
disable them.
You can try to avoid this by performing typical tasks that often will autoload
diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst
index e9f85142182d..2eabef31220d 100644
--- a/Documentation/admin-guide/ramoops.rst
+++ b/Documentation/admin-guide/ramoops.rst
@@ -23,6 +23,8 @@ and type of the memory area are set using three variables:
* ``mem_size`` for the size. The memory size will be rounded down to a
power of two.
* ``mem_type`` to specify if the memory type (default is pgprot_writecombine).
+ * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command
+ line parameter.
Typically the default value of ``mem_type=0`` should be used as that sets the pstore
mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use
@@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several different manners:
return ret;
}
+ D. Using a region of memory reserved via ``reserve_mem`` command line
+ parameter. The address and size will be defined by the ``reserve_mem``
+ parameter. Note, that ``reserve_mem`` may not always allocate memory
+ in the same location, and cannot be relied upon. Testing will need
+ to be done, and it may not work on every machine, nor every kernel.
+ Consider this a "best effort" approach. The ``reserve_mem`` option
+ takes a size, alignment and name as arguments. The name is used
+ to map the memory to a label that can be retrieved by ramoops.
+
+ reserve_mem=2M:4096:oops ramoops.mem_name=oops
+
You can specify either RAM memory or peripheral devices' memory. However, when
specifying RAM, be sure to reserve the memory by issuing memblock_reserve()
very early in the architecture code, e.g.::
diff --git a/Documentation/admin-guide/reporting-regressions.rst b/Documentation/admin-guide/reporting-regressions.rst
index d8adccdae23f..946518355a2c 100644
--- a/Documentation/admin-guide/reporting-regressions.rst
+++ b/Documentation/admin-guide/reporting-regressions.rst
@@ -31,7 +31,7 @@ The important bits (aka "TL;DR")
Linux kernel regression tracking bot "regzbot" track the issue by specifying
when the regression started like this::
- #regzbot introduced v5.13..v5.14-rc1
+ #regzbot introduced: v5.13..v5.14-rc1
All the details on Linux kernel regressions relevant for users
@@ -42,12 +42,12 @@ The important basics
--------------------
-What is a "regression" and what is the "no regressions rule"?
+What is a "regression" and what is the "no regressions" rule?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's a regression if some application or practical use case running fine with
one Linux kernel works worse or not at all with a newer version compiled using a
-similar configuration. The "no regressions rule" forbids this to take place; if
+similar configuration. The "no regressions" rule forbids this to take place; if
it happens by accident, developers that caused it are expected to quickly fix
the issue.
@@ -173,7 +173,7 @@ Additional details about regressions
------------------------------------
-What is the goal of the "no regressions rule"?
+What is the goal of the "no regressions" rule?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Users should feel safe when updating kernel versions and not have to worry
@@ -199,8 +199,8 @@ Exceptions to this rule are extremely rare; in the past developers almost always
turned out to be wrong when they assumed a particular situation was warranting
an exception.
-Who ensures the "no regressions" is actually followed?
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Who ensures the "no regressions" rule is actually followed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The subsystem maintainers should take care of that, which are watched and
supported by the tree maintainers -- e.g. Linus Torvalds for mainline and
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 47499a1742bd..08e89e031714 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -38,6 +38,11 @@ requests. ``aio-max-nr`` allows you to change the maximum value
``aio-max-nr`` does not result in the
pre-allocation or re-sizing of any kernel data structures.
+dentry-negative
+----------------------------
+
+Policy for negative dentries. Set to 1 to always delete the dentry when a
+file is removed, and 0 to disable it. By default, this behavior is disabled.
dentry-state
------------
@@ -332,3 +337,13 @@ Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes
on a 64-bit one.
The current default value for ``max_user_watches`` is 4% of the
available low memory, divided by the "watch" cost in bytes.
+
+5. /proc/sys/fs/fuse - Configuration options for FUSE filesystems
+=====================================================================
+
+This directory contains the following configuration options for FUSE
+filesystems:
+
+``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for
+setting/getting the maximum number of pages that can be used for servicing
+requests in FUSE.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 6584a1f9bfe3..dd49a89a62d3 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -212,6 +212,17 @@ pid>/``).
This value defaults to 0.
+core_sort_vma
+=============
+
+The default coredump writes VMAs in address order. By setting
+``core_sort_vma`` to 1, VMAs will be written from smallest size
+to largest size. This is known to break at least elfutils, but
+can be handy when dealing with very large (and truncated)
+coredumps where the more useful debugging details are included
+in the smaller VMAs.
+
+
core_uses_pid
=============
@@ -296,12 +307,30 @@ kernel panic). This will output the contents of the ftrace buffers to
the console. This is very useful for capturing traces that lead to
crashes and outputting them to a serial console.
-= ===================================================
-0 Disabled (default).
-1 Dump buffers of all CPUs.
-2 Dump the buffer of the CPU that triggered the oops.
-= ===================================================
+======================= ===========================================
+0 Disabled (default).
+1 Dump buffers of all CPUs.
+2(orig_cpu) Dump the buffer of the CPU that triggered the
+ oops.
+<instance> Dump the specific instance buffer on all CPUs.
+<instance>=2(orig_cpu) Dump the specific instance buffer on the CPU
+ that triggered the oops.
+======================= ===========================================
+
+Multiple instance dump is also supported, and instances are separated
+by commas. If global buffer also needs to be dumped, please specify
+the dump mode (1/2/orig_cpu) first for global buffer.
+So for example to dump "foo" and "bar" instance buffer on all CPUs,
+user can::
+
+ echo "foo,bar" > /proc/sys/kernel/ftrace_dump_on_oops
+
+To dump global buffer and "foo" instance buffer on all
+CPUs along with the "bar" instance buffer on CPU that triggered the
+oops, user can::
+
+ echo "1,foo,bar=2" > /proc/sys/kernel/ftrace_dump_on_oops
ftrace_enabled, stack_tracer_enabled
====================================
@@ -383,6 +412,15 @@ The upper bound on the number of tasks that are checked.
This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
+hung_task_detect_count
+======================
+
+Indicates the total number of tasks that have been detected as hung since
+the system boot.
+
+This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled.
+
+
hung_task_timeout_secs
======================
@@ -436,7 +474,7 @@ ignore-unaligned-usertrap
On architectures where unaligned accesses cause traps, and where this
feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``;
-currently, ``arc`` and ``loongarch``), controls whether all
+currently, ``arc``, ``parisc`` and ``loongarch``), controls whether all
unaligned traps are logged.
= =============================================================
@@ -594,6 +632,9 @@ default (``MSGMNB``).
``msgmni`` is the maximum number of IPC queues. 32000 by default
(``MSGMNI``).
+All of these parameters are set per ipc namespace. The maximum number of bytes
+in POSIX message queues is limited by ``RLIMIT_MSGQUEUE``. This limit is
+respected hierarchically in the each user namespace.
msg_next_id, sem_next_id, and shm_next_id (System V IPC)
========================================================
@@ -850,6 +891,7 @@ bit 3 print locks info if ``CONFIG_LOCKDEP`` is on
bit 4 print ftrace buffer
bit 5 print all printk messages in buffer
bit 6 print all CPUs backtrace (if available in the arch)
+bit 7 print only tasks in uninterruptible (blocked) state
===== ============================================
So for example to print tasks and memory info on panic, user can::
@@ -1274,15 +1316,20 @@ are doing anyway :)
shmall
======
-This parameter sets the total amount of shared memory pages that
-can be used system wide. Hence, ``shmall`` should always be at least
-``ceil(shmmax/PAGE_SIZE)``.
+This parameter sets the total amount of shared memory pages that can be used
+inside ipc namespace. The shared memory pages counting occurs for each ipc
+namespace separately and is not inherited. Hence, ``shmall`` should always be at
+least ``ceil(shmmax/PAGE_SIZE)``.
If you are not sure what the default ``PAGE_SIZE`` is on your Linux
system, you can run the following command::
# getconf PAGE_SIZE
+To reduce or disable the ability to allocate shared memory, you must create a
+new ipc namespace, set this parameter to the required value and prohibit the
+creation of a new ipc namespace in the current user namespace or cgroups can
+be used.
shmmax
======
@@ -1508,6 +1555,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff).
If a value outside of this range is written to ``threads-max`` an
``EINVAL`` error occurs.
+timer_migration
+===============
+
+When set to a non-zero value, attempt to migrate timers away from idle cpus to
+allow them to remain in low power states longer.
+
+Default is set (1).
traceoff_on_warning
===================
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 396091651955..7b0c4291c686 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -72,6 +72,7 @@ two flavors of JITs, the newer eBPF JIT currently supported on:
- riscv64
- riscv32
- loongarch64
+ - arc
And the older cBPF JIT supported on the following archs:
@@ -206,6 +207,11 @@ Will increase power usage.
Default: 0 (off)
+mem_pcpu_rsv
+------------
+
+Per-cpu reserved forward alloc cache size in page units. Default 1MB per CPU.
+
rmem_default
------------
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index c59889de122b..f48eaa98d22d 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm:
- dirtytime_expire_seconds
- dirty_writeback_centisecs
- drop_caches
+- enable_soft_offline
- extfrag_threshold
- highmem_is_dirtyable
- hugetlb_shm_group
@@ -43,6 +44,7 @@ Currently, these files are in /proc/sys/vm:
- legacy_va_layout
- lowmem_reserve_ratio
- max_map_count
+- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y)
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -266,6 +268,43 @@ used::
These are informational only. They do not mean that anything is wrong
with your system. To disable them, echo 4 (bit 2) into drop_caches.
+enable_soft_offline
+===================
+Correctable memory errors are very common on servers. Soft-offline is kernel's
+solution for memory pages having (excessive) corrected memory errors.
+
+For different types of page, soft-offline has different behaviors / costs.
+
+- For a raw error page, soft-offline migrates the in-use page's content to
+ a new raw page.
+
+- For a page that is part of a transparent hugepage, soft-offline splits the
+ transparent hugepage into raw pages, then migrates only the raw error page.
+ As a result, user is transparently backed by 1 less hugepage, impacting
+ memory access performance.
+
+- For a page that is part of a HugeTLB hugepage, soft-offline first migrates
+ the entire HugeTLB hugepage, during which a free hugepage will be consumed
+ as migration target. Then the original hugepage is dissolved into raw
+ pages without compensation, reducing the capacity of the HugeTLB pool by 1.
+
+It is user's call to choose between reliability (staying away from fragile
+physical memory) vs performance / capacity implications in transparent and
+HugeTLB cases.
+
+For all architectures, enable_soft_offline controls whether to soft offline
+memory pages. When set to 1, kernel attempts to soft offline the pages
+whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to
+the request to soft offline the pages. Its default value is 1.
+
+It is worth mentioning that after setting enable_soft_offline to 0, the
+following requests to soft offline pages will not be performed:
+
+- Request to soft offline pages from RAS Correctable Errors Collector.
+
+- On ARM, the request to soft offline pages from GHES driver.
+
+- On PARISC, the request to soft offline pages from Page Deallocation Table.
extfrag_threshold
=================
@@ -425,6 +464,21 @@ e.g., up to one or two maps per allocation.
The default value is 65530.
+mem_profiling
+==============
+
+Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y)
+
+1: Enable memory profiling.
+
+0: Disable memory profiling.
+
+Enabling memory profiling introduces a small performance overhead for all
+memory allocations.
+
+The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
+
+
memory_failure_early_kill:
==========================
diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst
index 2f2e5bd440f9..9c7aa817adc7 100644
--- a/Documentation/admin-guide/sysrq.rst
+++ b/Documentation/admin-guide/sysrq.rst
@@ -49,26 +49,26 @@ How do I use the magic SysRq key?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On x86
- You press the key combo :kbd:`ALT-SysRq-<command key>`.
+ You press the key combo `ALT-SysRq-<command key>`.
.. note::
Some
keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is
also known as the 'Print Screen' key. Also some keyboards cannot
handle so many keys being pressed at the same time, so you might
- have better luck with press :kbd:`Alt`, press :kbd:`SysRq`,
- release :kbd:`SysRq`, press :kbd:`<command key>`, release everything.
+ have better luck with press `Alt`, press `SysRq`,
+ release `SysRq`, press `<command key>`, release everything.
On SPARC
- You press :kbd:`ALT-STOP-<command key>`, I believe.
+ You press `ALT-STOP-<command key>`, I believe.
On the serial console (PC style standard serial ports only)
You send a ``BREAK``, then within 5 seconds a command key. Sending
``BREAK`` twice is interpreted as a normal BREAK.
On PowerPC
- Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:`<command key>`.
- :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`<command key>` may suffice.
+ Press `ALT - Print Screen` (or `F13`) - `<command key>`.
+ `Print Screen` (or `F13`) - `<command key>` may suffice.
On other
If you know of the key combos for other architectures, please
@@ -88,7 +88,7 @@ On all
echo _reisub > /proc/sysrq-trigger
-The :kbd:`<command key>` is case sensitive.
+The `<command key>` is case sensitive.
What are the 'command' keys?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -161,6 +161,8 @@ Command Function
will be printed to your console. (``0``, for example would make
it so that only emergency messages like PANICs or OOPSes would
make it to your console.)
+
+``R`` Replay the kernel log messages on consoles.
=========== ===================================================================
Okay, so what can I use them for?
@@ -211,14 +213,21 @@ processes.
"just thaw ``it(j)``" is useful if your system becomes unresponsive due to a
frozen (probably root) filesystem via the FIFREEZE ioctl.
+``Replay logs(R)`` is useful to view the kernel log messages when system is hung
+or you are not able to use dmesg command to view the messages in printk buffer.
+User may have to press the key combination multiple times if console system is
+busy. If it is completely locked up, then messages won't be printed. Output
+messages depend on current console loglevel, which can be modified using
+sysrq[0-9] (see above).
+
Sometimes SysRq seems to get 'stuck' after using it, what can I do?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When this happens, try tapping shift, alt and control on both sides of the
keyboard, and hitting an invalid sysrq sequence again. (i.e., something like
-:kbd:`alt-sysrq-z`).
+`alt-sysrq-z`).
-Switching to another virtual console (:kbd:`ALT+Fn`) and then back again
+Switching to another virtual console (`ALT+Fn`) and then back again
should also help.
I hit SysRq, but nothing seems to happen, what's wrong?
@@ -281,7 +290,7 @@ exception the header line from the sysrq command is passed to all console
consumers as if the current loglevel was maximum. If only the header
is emitted it is almost certain that the kernel loglevel is too low.
Should you require the output on the console channel then you will need
-to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or::
+to temporarily up the console loglevel using `alt-sysrq-8` or::
echo 8 > /proc/sysrq-trigger
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index 92a8a07f5c43..700aa72eecb1 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -34,7 +34,7 @@ name of the command ('Comm:') that triggered the event::
You'll find a 'Not tainted: ' there if the kernel was not tainted at the
time of the event; if it was, then it will print 'Tainted: ' and characters
-either letters or blanks. In above example it looks like this::
+either letters or blanks. In the example above it looks like this::
Tainted: P W O
@@ -52,7 +52,7 @@ At runtime, you can query the tainted state by reading
tainted; any other number indicates the reasons why it is. The easiest way to
decode that number is the script ``tools/debugging/kernel-chktaint``, which your
distribution might ship as part of a package called ``linux-tools`` or
-``kernel-tools``; if it doesn't you can download the script from
+``kernel-tools``; if it doesn't, you can download the script from
`git.kernel.org <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/tools/debugging/kernel-chktaint>`_
and execute it with ``sh kernel-chktaint``, which would print something like
this on the machine that had the statements in the logs that were quoted earlier::
@@ -182,3 +182,5 @@ More detailed explanation for tainting
produce extremely unusual kernel structure layouts (even performance
pathological ones), which is important to know when debugging. Set at
build time.
+
+ 18) ``N`` if an in-kernel test, such as a KUnit test, has been run.
diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
new file mode 100644
index 000000000000..03c55151346c
--- /dev/null
+++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
@@ -0,0 +1,2222 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+.. [see the bottom of this file for redistribution information]
+
+=========================================
+How to verify bugs and bisect regressions
+=========================================
+
+This document describes how to check if some Linux kernel problem occurs in code
+currently supported by developers -- to then explain how to locate the change
+causing the issue, if it is a regression (e.g. did not happen with earlier
+versions).
+
+The text aims at people running kernels from mainstream Linux distributions on
+commodity hardware who want to report a kernel bug to the upstream Linux
+developers. Despite this intent, the instructions work just as well for users
+who are already familiar with building their own kernels: they help avoid
+mistakes occasionally made even by experienced developers.
+
+..
+ Note: if you see this note, you are reading the text's source file. You
+ might want to switch to a rendered version: it makes it a lot easier to
+ read and navigate this document -- especially when you want to look something
+ up in the reference section, then jump back to where you left off.
+..
+ Find the latest rendered version of this text here:
+ https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html
+
+The essence of the process (aka 'TL;DR')
+========================================
+
+*[If you are new to building or bisecting Linux, ignore this section and head
+over to the* ':ref:`step-by-step guide <introguide_bissbs>`' *below. It utilizes
+the same commands as this section while describing them in brief fashion. The
+steps are nevertheless easy to follow and together with accompanying entries
+in a reference section mention many alternatives, pitfalls, and additional
+aspects, all of which might be essential in your present case.]*
+
+**In case you want to check if a bug is present in code currently supported by
+developers**, execute just the *preparations* and *segment 1*; while doing so,
+consider the newest Linux kernel you regularly use to be the 'working' kernel.
+In the following example that's assumed to be 6.0, which is why its sources
+will be used to prepare the .config file.
+
+**In case you face a regression**, follow the steps at least till the end of
+*segment 2*. Then you can submit a preliminary report -- or continue with
+*segment 3*, which describes how to perform a bisection needed for a
+full-fledged regression report. In the following example 6.0.13 is assumed to be
+the 'working' kernel and 6.1.5 to be the first 'broken', which is why 6.0
+will be considered the 'good' release and used to prepare the .config file.
+
+* **Preparations**: set up everything to build your own kernels::
+
+ # * Remove any software that depends on externally maintained kernel modules
+ # or builds any automatically during bootup.
+ # * Ensure Secure Boot permits booting self-compiled Linux kernels.
+ # * If you are not already running the 'working' kernel, reboot into it.
+ # * Install compilers and everything else needed for building Linux.
+ # * Ensure to have 15 Gigabyte free space in your home directory.
+ git clone -o mainline --no-checkout \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/
+ cd ~/linux/
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+ git switch --detach v6.0
+ # * Hint: if you used an existing clone, ensure no stale .config is around.
+ make olddefconfig
+ # * Ensure the former command picked the .config of the 'working' kernel.
+ # * Connect external hardware (USB keys, tokens, ...), start a VM, bring up
+ # VPNs, mount network shares, and briefly try the feature that is broken.
+ yes '' | make localmodconfig
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local'
+ ./scripts/config -e CONFIG_LOCALVERSION_AUTO
+ # * Note, when short on storage space, check the guide for an alternative:
+ ./scripts/config -d DEBUG_INFO_NONE -e KALLSYMS_ALL -e DEBUG_KERNEL \
+ -e DEBUG_INFO -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -e KALLSYMS
+ # * Hint: at this point you might want to adjust the build configuration;
+ # you'll have to, if you are running Debian.
+ make olddefconfig
+ cp .config ~/kernel-config-working
+
+* **Segment 1**: build a kernel from the latest mainline codebase.
+
+ This among others checks if the problem was fixed already and which developers
+ later need to be told about the problem; in case of a regression, this rules
+ out a .config change as root of the problem.
+
+ a) Checking out latest mainline code::
+
+ cd ~/linux/
+ git switch --discard-changes --detach mainline/master
+
+ b) Build, install, and boot a kernel::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Make sure there is enough disk space to hold another kernel:
+ df -h /boot/ /lib/modules/
+ # * Note: on Arch Linux, its derivatives and a few other distributions
+ # the following commands will do nothing at all or only part of the
+ # job. See the step-by-step guide for further details.
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ # * Check how much space your self-built kernel actually needs, which
+ # enables you to make better estimates later:
+ du -ch /boot/*$(make -s kernelrelease)* | tail -n 1
+ du -sh /lib/modules/$(make -s kernelrelease)/
+ # * Hint: the output of the following command will help you pick the
+ # right kernel from the boot menu:
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+ # * Once booted, ensure you are running the kernel you just built by
+ # checking if the output of the next two commands matches:
+ tail -n 1 ~/kernels-built
+ uname -r
+ cat /proc/sys/kernel/tainted
+
+ c) Check if the problem occurs with this kernel as well.
+
+* **Segment 2**: ensure the 'good' kernel is also a 'working' kernel.
+
+ This among others verifies the trimmed .config file actually works well, as
+ bisecting with it otherwise would be a waste of time:
+
+ a) Start by checking out the sources of the 'good' version::
+
+ cd ~/linux/
+ git switch --discard-changes --detach v6.0
+
+ b) Build, install, and boot a kernel as described earlier in *segment 1,
+ section b* -- just feel free to skip the 'du' commands, as you have a rough
+ estimate already.
+
+ c) Ensure the feature that regressed with the 'broken' kernel actually works
+ with this one.
+
+* **Segment 3**: perform and validate the bisection.
+
+ a) Retrieve the sources for your 'bad' version::
+
+ git remote set-branches --add stable linux-6.1.y
+ git fetch stable
+
+ b) Initialize the bisection::
+
+ cd ~/linux/
+ git bisect start
+ git bisect good v6.0
+ git bisect bad v6.1.5
+
+ c) Build, install, and boot a kernel as described earlier in *segment 1,
+ section b*.
+
+ In case building or booting the kernel fails for unrelated reasons, run
+ ``git bisect skip``. In all other outcomes, check if the regressed feature
+ works with the newly built kernel. If it does, tell Git by executing
+ ``git bisect good``; if it does not, run ``git bisect bad`` instead.
+
+ All three commands will make Git check out another commit; then re-execute
+ this step (e.g. build, install, boot, and test a kernel to then tell Git
+ the outcome). Do so again and again until Git shows which commit broke
+ things. If you run short of disk space during this process, check the
+ section 'Complementary tasks: cleanup during and after the process'
+ below.
+
+ d) Once your finished the bisection, put a few things away::
+
+ cd ~/linux/
+ git bisect log > ~/bisect-log
+ cp .config ~/bisection-config-culprit
+ git bisect reset
+
+ e) Try to verify the bisection result::
+
+ git switch --discard-changes --detach mainline/master
+ git revert --no-edit cafec0cacaca0
+ cp ~/kernel-config-working .config
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local-cafec0cacaca0-reverted'
+
+ This is optional, as some commits are impossible to revert. But if the
+ second command worked flawlessly, build, install, and boot one more kernel
+ kernel; just this time skip the first command copying the base .config file
+ over, as that already has been taken care off.
+
+* **Complementary tasks**: cleanup during and after the process.
+
+ a) To avoid running out of disk space during a bisection, you might need to
+ remove some kernels you built earlier. You most likely want to keep those
+ you built during segment 1 and 2 around for a while, but you will most
+ likely no longer need kernels tested during the actual bisection
+ (Segment 3 c). You can list them in build order using::
+
+ ls -ltr /lib/modules/*-local*
+
+ To then for example erase a kernel that identifies itself as
+ '6.0-rc1-local-gcafec0cacaca0', use this::
+
+ sudo rm -rf /lib/modules/6.0-rc1-local-gcafec0cacaca0
+ sudo kernel-install -v remove 6.0-rc1-local-gcafec0cacaca0
+ # * Note, on some distributions kernel-install is missing
+ # or does only part of the job.
+
+ b) If you performed a bisection and successfully validated the result, feel
+ free to remove all kernels built during the actual bisection (Segment 3 c);
+ the kernels you built earlier and later you might want to keep around for
+ a week or two.
+
+* **Optional task**: test a debug patch or a proposed fix later::
+
+ git fetch mainline
+ git switch --discard-changes --detach mainline/master
+ git apply /tmp/foobars-proposed-fix-v1.patch
+ cp ~/kernel-config-working .config
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local-foobars-fix-v1'
+
+ Build, install, and boot a kernel as described in *segment 1, section b* --
+ but this time omit the first command copying the build configuration over,
+ as that has been taken care of already.
+
+.. _introguide_bissbs:
+
+Step-by-step guide on how to verify bugs and bisect regressions
+===============================================================
+
+This guide describes how to set up your own Linux kernels for investigating bugs
+or regressions you intend to report. How far you want to follow the instructions
+depends on your issue:
+
+Execute all steps till the end of *segment 1* to **verify if your kernel problem
+is present in code supported by Linux kernel developers**. If it is, you are all
+set to report the bug -- unless it did not happen with earlier kernel versions,
+as then your want to at least continue with *segment 2* to **check if the issue
+qualifies as regression** which receive priority treatment. Depending on the
+outcome you then are ready to report a bug or submit a preliminary regression
+report; instead of the latter your could also head straight on and follow
+*segment 3* to **perform a bisection** for a full-fledged regression report
+developers are obliged to act upon.
+
+ :ref:`Preparations: set up everything to build your own kernels <introprep_bissbs>`.
+
+ :ref:`Segment 1: try to reproduce the problem with the latest codebase <introlatestcheck_bissbs>`.
+
+ :ref:`Segment 2: check if the kernels you build work fine <introworkingcheck_bissbs>`.
+
+ :ref:`Segment 3: perform a bisection and validate the result <introbisect_bissbs>`.
+
+ :ref:`Complementary tasks: cleanup during and after following this guide <introclosure_bissbs>`.
+
+ :ref:`Optional tasks: test reverts, patches, or later versions <introoptional_bissbs>`.
+
+The steps in each segment illustrate the important aspects of the process, while
+a comprehensive reference section holds additional details for almost all of the
+steps. The reference section sometimes also outlines alternative approaches,
+pitfalls, as well as problems that might occur at the particular step -- and how
+to get things rolling again.
+
+For further details on how to report Linux kernel issues or regressions check
+out Documentation/admin-guide/reporting-issues.rst, which works in conjunction
+with this document. It among others explains why you need to verify bugs with
+the latest 'mainline' kernel (e.g. versions like 6.0, 6.1-rc1, or 6.1-rc6),
+even if you face a problem with a kernel from a 'stable/longterm' series
+(say 6.0.13).
+
+For users facing a regression that document also explains why sending a
+preliminary report after segment 2 might be wise, as the regression and its
+culprit might be known already. For further details on what actually qualifies
+as a regression check out Documentation/admin-guide/reporting-regressions.rst.
+
+If you run into any problems while following this guide or have ideas how to
+improve it, :ref:`please let the kernel developers know <submit_improvements>`.
+
+.. _introprep_bissbs:
+
+Preparations: set up everything to build your own kernels
+---------------------------------------------------------
+
+The following steps lay the groundwork for all further tasks.
+
+Note: the instructions assume you are building and testing on the same
+machine; if you want to compile the kernel on another system, check
+:ref:`Build kernels on a different machine <buildhost_bis>` below.
+
+.. _backup_bissbs:
+
+* Create a fresh backup and put system repair and restore tools at hand, just
+ to be prepared for the unlikely case of something going sideways.
+
+ [:ref:`details <backup_bisref>`]
+
+.. _vanilla_bissbs:
+
+* Remove all software that depends on externally developed kernel drivers or
+ builds them automatically. That includes but is not limited to DKMS, openZFS,
+ VirtualBox, and Nvidia's graphics drivers (including the GPLed kernel module).
+
+ [:ref:`details <vanilla_bisref>`]
+
+.. _secureboot_bissbs:
+
+* On platforms with 'Secure Boot' or similar solutions, prepare everything to
+ ensure the system will permit your self-compiled kernel to boot. The
+ quickest and easiest way to achieve this on commodity x86 systems is to
+ disable such techniques in the BIOS setup utility; alternatively, remove
+ their restrictions through a process initiated by
+ ``mokutil --disable-validation``.
+
+ [:ref:`details <secureboot_bisref>`]
+
+.. _rangecheck_bissbs:
+
+* Determine the kernel versions considered 'good' and 'bad' throughout this
+ guide:
+
+ * Do you follow this guide to verify if a bug is present in the code the
+ primary developers care for? Then consider the version of the newest kernel
+ you regularly use currently as 'good' (e.g. 6.0, 6.0.13, or 6.1-rc2).
+
+ * Do you face a regression, e.g. something broke or works worse after
+ switching to a newer kernel version? In that case it depends on the version
+ range during which the problem appeared:
+
+ * Something regressed when updating from a stable/longterm release
+ (say 6.0.13) to a newer mainline series (like 6.1-rc7 or 6.1) or a
+ stable/longterm version based on one (say 6.1.5)? Then consider the
+ mainline release your working kernel is based on to be the 'good'
+ version (e.g. 6.0) and the first version to be broken as the 'bad' one
+ (e.g. 6.1-rc7, 6.1, or 6.1.5). Note, at this point it is merely assumed
+ that 6.0 is fine; this hypothesis will be checked in segment 2.
+
+ * Something regressed when switching from one mainline version (say 6.0) to
+ a later one (like 6.1-rc1) or a stable/longterm release based on it
+ (say 6.1.5)? Then regard the last working version (e.g. 6.0) as 'good' and
+ the first broken (e.g. 6.1-rc1 or 6.1.5) as 'bad'.
+
+ * Something regressed when updating within a stable/longterm series (say
+ from 6.0.13 to 6.0.15)? Then consider those versions as 'good' and 'bad'
+ (e.g. 6.0.13 and 6.0.15), as you need to bisect within that series.
+
+ *Note, do not confuse 'good' version with 'working' kernel; the latter term
+ throughout this guide will refer to the last kernel that has been working
+ fine.*
+
+ [:ref:`details <rangecheck_bisref>`]
+
+.. _bootworking_bissbs:
+
+* Boot into the 'working' kernel and briefly use the apparently broken feature.
+
+ [:ref:`details <bootworking_bisref>`]
+
+.. _diskspace_bissbs:
+
+* Ensure to have enough free space for building Linux. 15 Gigabyte in your home
+ directory should typically suffice. If you have less available, be sure to pay
+ attention to later steps about retrieving the Linux sources and handling of
+ debug symbols: both explain approaches reducing the amount of space, which
+ should allow you to master these tasks with about 4 Gigabytes free space.
+
+ [:ref:`details <diskspace_bisref>`]
+
+.. _buildrequires_bissbs:
+
+* Install all software required to build a Linux kernel. Often you will need:
+ 'bc', 'binutils' ('ld' et al.), 'bison', 'flex', 'gcc', 'git', 'openssl',
+ 'pahole', 'perl', and the development headers for 'libelf' and 'openssl'. The
+ reference section shows how to quickly install those on various popular Linux
+ distributions.
+
+ [:ref:`details <buildrequires_bisref>`]
+
+.. _sources_bissbs:
+
+* Retrieve the mainline Linux sources; then change into the directory holding
+ them, as all further commands in this guide are meant to be executed from
+ there.
+
+ *Note, the following describe how to retrieve the sources using a full
+ mainline clone, which downloads about 2,75 GByte as of early 2024. The*
+ :ref:`reference section describes two alternatives <sources_bisref>` *:
+ one downloads less than 500 MByte, the other works better with unreliable
+ internet connections.*
+
+ Execute the following command to retrieve a fresh mainline codebase while
+ preparing things to add branches for stable/longterm series later::
+
+ git clone -o mainline --no-checkout \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/
+ cd ~/linux/
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+
+ [:ref:`details <sources_bisref>`]
+
+.. _stablesources_bissbs:
+
+* Is one of the versions you earlier established as 'good' or 'bad' a stable or
+ longterm release (say 6.1.5)? Then download the code for the series it belongs
+ to ('linux-6.1.y' in this example)::
+
+ git remote set-branches --add stable linux-6.1.y
+ git fetch stable
+
+.. _oldconfig_bissbs:
+
+* Start preparing a kernel build configuration (the '.config' file).
+
+ Before doing so, ensure you are still running the 'working' kernel an earlier
+ step told you to boot; if you are unsure, check the current kernelrelease
+ identifier using ``uname -r``.
+
+ Afterwards check out the source code for the version earlier established as
+ 'good'. In the following example command this is assumed to be 6.0; note that
+ the version number in this and all later Git commands needs to be prefixed
+ with a 'v'::
+
+ git switch --discard-changes --detach v6.0
+
+ Now create a build configuration file::
+
+ make olddefconfig
+
+ The kernel build scripts then will try to locate the build configuration file
+ for the running kernel and then adjust it for the needs of the kernel sources
+ you checked out. While doing so, it will print a few lines you need to check.
+
+ Look out for a line starting with '# using defaults found in'. It should be
+ followed by a path to a file in '/boot/' that contains the release identifier
+ of your currently working kernel. If the line instead continues with something
+ like 'arch/x86/configs/x86_64_defconfig', then the build infra failed to find
+ the .config file for your running kernel -- in which case you have to put one
+ there manually, as explained in the reference section.
+
+ In case you can not find such a line, look for one containing '# configuration
+ written to .config'. If that's the case you have a stale build configuration
+ lying around. Unless you intend to use it, delete it; afterwards run
+ 'make olddefconfig' again and check if it now picked up the right config file
+ as base.
+
+ [:ref:`details <oldconfig_bisref>`]
+
+.. _localmodconfig_bissbs:
+
+* Disable any kernel modules apparently superfluous for your setup. This is
+ optional, but especially wise for bisections, as it speeds up the build
+ process enormously -- at least unless the .config file picked up in the
+ previous step was already tailored to your and your hardware needs, in which
+ case you should skip this step.
+
+ To prepare the trimming, connect external hardware you occasionally use (USB
+ keys, tokens, ...), quickly start a VM, and bring up VPNs. And if you rebooted
+ since you started that guide, ensure that you tried using the feature causing
+ trouble since you started the system. Only then trim your .config::
+
+ yes '' | make localmodconfig
+
+ There is a catch to this, as the 'apparently' in initial sentence of this step
+ and the preparation instructions already hinted at:
+
+ The 'localmodconfig' target easily disables kernel modules for features only
+ used occasionally -- like modules for external peripherals not yet connected
+ since booting, virtualization software not yet utilized, VPN tunnels, and a
+ few other things. That's because some tasks rely on kernel modules Linux only
+ loads when you execute tasks like the aforementioned ones for the first time.
+
+ This drawback of localmodconfig is nothing you should lose sleep over, but
+ something to keep in mind: if something is misbehaving with the kernels built
+ during this guide, this is most likely the reason. You can reduce or nearly
+ eliminate the risk with tricks outlined in the reference section; but when
+ building a kernel just for quick testing purposes this is usually not worth
+ spending much effort on, as long as it boots and allows to properly test the
+ feature that causes trouble.
+
+ [:ref:`details <localmodconfig_bisref>`]
+
+.. _tagging_bissbs:
+
+* Ensure all the kernels you will build are clearly identifiable using a special
+ tag and a unique version number::
+
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local'
+ ./scripts/config -e CONFIG_LOCALVERSION_AUTO
+
+ [:ref:`details <tagging_bisref>`]
+
+.. _debugsymbols_bissbs:
+
+* Decide how to handle debug symbols.
+
+ In the context of this document it is often wise to enable them, as there is a
+ decent chance you will need to decode a stack trace from a 'panic', 'Oops',
+ 'warning', or 'BUG'::
+
+ ./scripts/config -d DEBUG_INFO_NONE -e KALLSYMS_ALL -e DEBUG_KERNEL \
+ -e DEBUG_INFO -e DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT -e KALLSYMS
+
+ But if you are extremely short on storage space, you might want to disable
+ debug symbols instead::
+
+ ./scripts/config -d DEBUG_INFO -d DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT \
+ -d DEBUG_INFO_DWARF4 -d DEBUG_INFO_DWARF5 -e CONFIG_DEBUG_INFO_NONE
+
+ [:ref:`details <debugsymbols_bisref>`]
+
+.. _configmods_bissbs:
+
+* Check if you may want or need to adjust some other kernel configuration
+ options:
+
+ * Are you running Debian? Then you want to avoid known problems by performing
+ additional adjustments explained in the reference section.
+
+ [:ref:`details <configmods_distros_bisref>`].
+
+ * If you want to influence other aspects of the configuration, do so now using
+ your preferred tool. Note, to use make targets like 'menuconfig' or
+ 'nconfig', you will need to install the development files of ncurses; for
+ 'xconfig' you likewise need the Qt5 or Qt6 headers.
+
+ [:ref:`details <configmods_individual_bisref>`].
+
+.. _saveconfig_bissbs:
+
+* Reprocess the .config after the latest adjustments and store it in a safe
+ place::
+
+ make olddefconfig
+ cp .config ~/kernel-config-working
+
+ [:ref:`details <saveconfig_bisref>`]
+
+.. _introlatestcheck_bissbs:
+
+Segment 1: try to reproduce the problem with the latest codebase
+----------------------------------------------------------------
+
+The following steps verify if the problem occurs with the code currently
+supported by developers. In case you face a regression, it also checks that the
+problem is not caused by some .config change, as reporting the issue then would
+be a waste of time. [:ref:`details <introlatestcheck_bisref>`]
+
+.. _checkoutmaster_bissbs:
+
+* Check out the latest Linux codebase.
+
+ * Are your 'good' and 'bad' versions from the same stable or longterm series?
+ Then check the `front page of kernel.org <https://kernel.org/>`_: if it
+ lists a release from that series without an '[EOL]' tag, checkout the series
+ latest version ('linux-6.1.y' in the following example)::
+
+ cd ~/linux/
+ git switch --discard-changes --detach stable/linux-6.1.y
+
+ Your series is unsupported, if is not listed or carrying a 'end of life'
+ tag. In that case you might want to check if a successor series (say
+ linux-6.2.y) or mainline (see next point) fix the bug.
+
+ * In all other cases, run::
+
+ cd ~/linux/
+ git switch --discard-changes --detach mainline/master
+
+ [:ref:`details <checkoutmaster_bisref>`]
+
+.. _build_bissbs:
+
+* Build the image and the modules of your first kernel using the config file you
+ prepared::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+
+ If you want your kernel packaged up as deb, rpm, or tar file, see the
+ reference section for alternatives, which obviously will require other
+ steps to install as well.
+
+ [:ref:`details <build_bisref>`]
+
+.. _install_bissbs:
+
+* Install your newly built kernel.
+
+ Before doing so, consider checking if there is still enough space for it::
+
+ df -h /boot/ /lib/modules/
+
+ For now assume 150 MByte in /boot/ and 200 in /lib/modules/ will suffice; how
+ much your kernels actually require will be determined later during this guide.
+
+ Now install the kernel's modules and its image, which will be stored in
+ parallel to the your Linux distribution's kernels::
+
+ sudo make modules_install
+ command -v installkernel && sudo make install
+
+ The second command ideally will take care of three steps required at this
+ point: copying the kernel's image to /boot/, generating an initramfs, and
+ adding an entry for both to the boot loader's configuration.
+
+ Sadly some distributions (among them Arch Linux, its derivatives, and many
+ immutable Linux distributions) will perform none or only some of those tasks.
+ You therefore want to check if all of them were taken care of and manually
+ perform those that were not. The reference section provides further details on
+ that; your distribution's documentation might help, too.
+
+ Once you figured out the steps needed at this point, consider writing them
+ down: if you will build more kernels as described in segment 2 and 3, you will
+ have to perform those again after executing ``command -v installkernel [...]``.
+
+ [:ref:`details <install_bisref>`]
+
+.. _storagespace_bissbs:
+
+* In case you plan to follow this guide further, check how much storage space
+ the kernel, its modules, and other related files like the initramfs consume::
+
+ du -ch /boot/*$(make -s kernelrelease)* | tail -n 1
+ du -sh /lib/modules/$(make -s kernelrelease)/
+
+ Write down or remember those two values for later: they enable you to prevent
+ running out of disk space accidentally during a bisection.
+
+ [:ref:`details <storagespace_bisref>`]
+
+.. _kernelrelease_bissbs:
+
+* Show and store the kernelrelease identifier of the kernel you just built::
+
+ make -s kernelrelease | tee -a ~/kernels-built
+
+ Remember the identifier momentarily, as it will help you pick the right kernel
+ from the boot menu upon restarting.
+
+* Reboot into your newly built kernel. To ensure your actually started the one
+ you just built, you might want to verify if the output of these commands
+ matches::
+
+ tail -n 1 ~/kernels-built
+ uname -r
+
+.. _tainted_bissbs:
+
+* Check if the kernel marked itself as 'tainted'::
+
+ cat /proc/sys/kernel/tainted
+
+ If that command does not return '0', check the reference section, as the cause
+ for this might interfere with your testing.
+
+ [:ref:`details <tainted_bisref>`]
+
+.. _recheckbroken_bissbs:
+
+* Verify if your bug occurs with the newly built kernel. If it does not, check
+ out the instructions in the reference section to ensure nothing went sideways
+ during your tests.
+
+ [:ref:`details <recheckbroken_bisref>`]
+
+.. _recheckstablebroken_bissbs:
+
+* Did you just built a stable or longterm kernel? And were you able to reproduce
+ the regression with it? Then you should test the latest mainline codebase as
+ well, because the result determines which developers the bug must be submitted
+ to.
+
+ To prepare that test, check out current mainline::
+
+ cd ~/linux/
+ git switch --discard-changes --detach mainline/master
+
+ Now use the checked out code to build and install another kernel using the
+ commands the earlier steps already described in more detail::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ Confirm you booted the kernel you intended to start and check its tainted
+ status::
+
+ tail -n 1 ~/kernels-built
+ uname -r
+ cat /proc/sys/kernel/tainted
+
+ Now verify if this kernel is showing the problem. If it does, then you need
+ to report the bug to the primary developers; if it does not, report it to the
+ stable team. See Documentation/admin-guide/reporting-issues.rst for details.
+
+ [:ref:`details <recheckstablebroken_bisref>`]
+
+Do you follow this guide to verify if a problem is present in the code
+currently supported by Linux kernel developers? Then you are done at this
+point. If you later want to remove the kernel you just built, check out
+:ref:`Complementary tasks: cleanup during and after following this guide <introclosure_bissbs>`.
+
+In case you face a regression, move on and execute at least the next segment
+as well.
+
+.. _introworkingcheck_bissbs:
+
+Segment 2: check if the kernels you build work fine
+---------------------------------------------------
+
+In case of a regression, you now want to ensure the trimmed configuration file
+you created earlier works as expected; a bisection with the .config file
+otherwise would be a waste of time. [:ref:`details <introworkingcheck_bisref>`]
+
+.. _recheckworking_bissbs:
+
+* Build your own variant of the 'working' kernel and check if the feature that
+ regressed works as expected with it.
+
+ Start by checking out the sources for the version earlier established as
+ 'good' (once again assumed to be 6.0 here)::
+
+ cd ~/linux/
+ git switch --discard-changes --detach v6.0
+
+ Now use the checked out code to configure, build, and install another kernel
+ using the commands the previous subsection explained in more detail::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ When the system booted, you may want to verify once again that the
+ kernel you started is the one you just built::
+
+ tail -n 1 ~/kernels-built
+ uname -r
+
+ Now check if this kernel works as expected; if not, consult the reference
+ section for further instructions.
+
+ [:ref:`details <recheckworking_bisref>`]
+
+.. _introbisect_bissbs:
+
+Segment 3: perform the bisection and validate the result
+--------------------------------------------------------
+
+With all the preparations and precaution builds taken care of, you are now ready
+to begin the bisection. This will make you build quite a few kernels -- usually
+about 15 in case you encountered a regression when updating to a newer series
+(say from 6.0.13 to 6.1.5). But do not worry, due to the trimmed build
+configuration created earlier this works a lot faster than many people assume:
+overall on average it will often just take about 10 to 15 minutes to compile
+each kernel on commodity x86 machines.
+
+.. _bisectstart_bissbs:
+
+* Start the bisection and tell Git about the versions earlier established as
+ 'good' (6.0 in the following example command) and 'bad' (6.1.5)::
+
+ cd ~/linux/
+ git bisect start
+ git bisect good v6.0
+ git bisect bad v6.1.5
+
+ [:ref:`details <bisectstart_bisref>`]
+
+.. _bisectbuild_bissbs:
+
+* Now use the code Git checked out to build, install, and boot a kernel using
+ the commands introduced earlier::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ If compilation fails for some reason, run ``git bisect skip`` and restart
+ executing the stack of commands from the beginning.
+
+ In case you skipped the 'test latest codebase' step in the guide, check its
+ description as for why the 'df [...]' and 'make -s kernelrelease [...]'
+ commands are here.
+
+ Important note: the latter command from this point on will print release
+ identifiers that might look odd or wrong to you -- which they are not, as it's
+ totally normal to see release identifiers like '6.0-rc1-local-gcafec0cacaca0'
+ if you bisect between versions 6.1 and 6.2 for example.
+
+ [:ref:`details <bisectbuild_bisref>`]
+
+.. _bisecttest_bissbs:
+
+* Now check if the feature that regressed works in the kernel you just built.
+
+ You again might want to start by making sure the kernel you booted is the one
+ you just built::
+
+ cd ~/linux/
+ tail -n 1 ~/kernels-built
+ uname -r
+
+ Now verify if the feature that regressed works at this kernel bisection point.
+ If it does, run this::
+
+ git bisect good
+
+ If it does not, run this::
+
+ git bisect bad
+
+ Be sure about what you tell Git, as getting this wrong just once will send the
+ rest of the bisection totally off course.
+
+ While the bisection is ongoing, Git will use the information you provided to
+ find and check out another bisection point for you to test. While doing so, it
+ will print something like 'Bisecting: 675 revisions left to test after this
+ (roughly 10 steps)' to indicate how many further changes it expects to be
+ tested. Now build and install another kernel using the instructions from the
+ previous step; afterwards follow the instructions in this step again.
+
+ Repeat this again and again until you finish the bisection -- that's the case
+ when Git after tagging a change as 'good' or 'bad' prints something like
+ 'cafecaca0c0dacafecaca0c0dacafecaca0c0da is the first bad commit'; right
+ afterwards it will show some details about the culprit including the patch
+ description of the change. The latter might fill your terminal screen, so you
+ might need to scroll up to see the message mentioning the culprit;
+ alternatively, run ``git bisect log > ~/bisection-log``.
+
+ [:ref:`details <bisecttest_bisref>`]
+
+.. _bisectlog_bissbs:
+
+* Store Git's bisection log and the current .config file in a safe place before
+ telling Git to reset the sources to the state before the bisection::
+
+ cd ~/linux/
+ git bisect log > ~/bisection-log
+ cp .config ~/bisection-config-culprit
+ git bisect reset
+
+ [:ref:`details <bisectlog_bisref>`]
+
+.. _revert_bissbs:
+
+* Try reverting the culprit on top of latest mainline to see if this fixes your
+ regression.
+
+ This is optional, as it might be impossible or hard to realize. The former is
+ the case, if the bisection determined a merge commit as the culprit; the
+ latter happens if other changes depend on the culprit. But if the revert
+ succeeds, it is worth building another kernel, as it validates the result of
+ a bisection, which can easily deroute; it furthermore will let kernel
+ developers know, if they can resolve the regression with a quick revert.
+
+ Begin by checking out the latest codebase depending on the range you bisected:
+
+ * Did you face a regression within a stable/longterm series (say between
+ 6.0.13 and 6.0.15) that does not happen in mainline? Then check out the
+ latest codebase for the affected series like this::
+
+ git fetch stable
+ git switch --discard-changes --detach linux-6.0.y
+
+ * In all other cases check out latest mainline::
+
+ git fetch mainline
+ git switch --discard-changes --detach mainline/master
+
+ If you bisected a regression within a stable/longterm series that also
+ happens in mainline, there is one more thing to do: look up the mainline
+ commit-id. To do so, use a command like ``git show abcdcafecabcd`` to
+ view the patch description of the culprit. There will be a line near
+ the top which looks like 'commit cafec0cacaca0 upstream.' or
+ 'Upstream commit cafec0cacaca0'; use that commit-id in the next command
+ and not the one the bisection blamed.
+
+ Now try reverting the culprit by specifying its commit id::
+
+ git revert --no-edit cafec0cacaca0
+
+ If that fails, give up trying and move on to the next step; if it works,
+ adjust the tag to facilitate the identification and prevent accidentally
+ overwriting another kernel::
+
+ cp ~/kernel-config-working .config
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local-cafec0cacaca0-reverted'
+
+ Build a kernel using the familiar command sequence, just without copying the
+ the base .config over::
+
+ make olddefconfig &&
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+ Now check one last time if the feature that made you perform a bisection works
+ with that kernel: if everything went well, it should not show the regression.
+
+ [:ref:`details <revert_bisref>`]
+
+.. _introclosure_bissbs:
+
+Complementary tasks: cleanup during and after the bisection
+-----------------------------------------------------------
+
+During and after following this guide you might want or need to remove some of
+the kernels you installed: the boot menu otherwise will become confusing or
+space might run out.
+
+.. _makeroom_bissbs:
+
+* To remove one of the kernels you installed, look up its 'kernelrelease'
+ identifier. This guide stores them in '~/kernels-built', but the following
+ command will print them as well::
+
+ ls -ltr /lib/modules/*-local*
+
+ You in most situations want to remove the oldest kernels built during the
+ actual bisection (e.g. segment 3 of this guide). The two ones you created
+ beforehand (e.g. to test the latest codebase and the version considered
+ 'good') might become handy to verify something later -- thus better keep them
+ around, unless you are really short on storage space.
+
+ To remove the modules of a kernel with the kernelrelease identifier
+ '*6.0-rc1-local-gcafec0cacaca0*', start by removing the directory holding its
+ modules::
+
+ sudo rm -rf /lib/modules/6.0-rc1-local-gcafec0cacaca0
+
+ Afterwards try the following command::
+
+ sudo kernel-install -v remove 6.0-rc1-local-gcafec0cacaca0
+
+ On quite a few distributions this will delete all other kernel files installed
+ while also removing the kernel's entry from the boot menu. But on some
+ distributions kernel-install does not exist or leaves boot-loader entries or
+ kernel image and related files behind; in that case remove them as described
+ in the reference section.
+
+ [:ref:`details <makeroom_bisref>`]
+
+.. _finishingtouch_bissbs:
+
+* Once you have finished the bisection, do not immediately remove anything you
+ set up, as you might need a few things again. What is safe to remove depends
+ on the outcome of the bisection:
+
+ * Could you initially reproduce the regression with the latest codebase and
+ after the bisection were able to fix the problem by reverting the culprit on
+ top of the latest codebase? Then you want to keep those two kernels around
+ for a while, but safely remove all others with a '-local' in the release
+ identifier.
+
+ * Did the bisection end on a merge-commit or seems questionable for other
+ reasons? Then you want to keep as many kernels as possible around for a few
+ days: it's pretty likely that you will be asked to recheck something.
+
+ * In other cases it likely is a good idea to keep the following kernels around
+ for some time: the one built from the latest codebase, the one created from
+ the version considered 'good', and the last three or four you compiled
+ during the actual bisection process.
+
+ [:ref:`details <finishingtouch_bisref>`]
+
+.. _introoptional_bissbs:
+
+Optional: test reverts, patches, or later versions
+--------------------------------------------------
+
+While or after reporting a bug, you might want or potentially will be asked to
+test reverts, debug patches, proposed fixes, or other versions. In that case
+follow these instructions.
+
+* Update your Git clone and check out the latest code.
+
+ * In case you want to test mainline, fetch its latest changes before checking
+ its code out::
+
+ git fetch mainline
+ git switch --discard-changes --detach mainline/master
+
+ * In case you want to test a stable or longterm kernel, first add the branch
+ holding the series you are interested in (6.2 in the example), unless you
+ already did so earlier::
+
+ git remote set-branches --add stable linux-6.2.y
+
+ Then fetch the latest changes and check out the latest version from the
+ series::
+
+ git fetch stable
+ git switch --discard-changes --detach stable/linux-6.2.y
+
+* Copy your kernel build configuration over::
+
+ cp ~/kernel-config-working .config
+
+* Your next step depends on what you want to do:
+
+ * In case you just want to test the latest codebase, head to the next step,
+ you are already all set.
+
+ * In case you want to test if a revert fixes an issue, revert one or multiple
+ changes by specifying their commit ids::
+
+ git revert --no-edit cafec0cacaca0
+
+ Now give that kernel a special tag to facilitates its identification and
+ prevent accidentally overwriting another kernel::
+
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local-cafec0cacaca0-reverted'
+
+ * In case you want to test a patch, store the patch in a file like
+ '/tmp/foobars-proposed-fix-v1.patch' and apply it like this::
+
+ git apply /tmp/foobars-proposed-fix-v1.patch
+
+ In case of multiple patches, repeat this step with the others.
+
+ Now give that kernel a special tag to facilitates its identification and
+ prevent accidentally overwriting another kernel::
+
+ ./scripts/config --set-str CONFIG_LOCALVERSION '-local-foobars-fix-v1'
+
+* Build a kernel using the familiar commands, just without copying the kernel
+ build configuration over, as that has been taken care of already::
+
+ make olddefconfig &&
+ make -j $(nproc --all)
+ # * Check if the free space suffices holding another kernel:
+ df -h /boot/ /lib/modules/
+ sudo make modules_install
+ command -v installkernel && sudo make install
+ make -s kernelrelease | tee -a ~/kernels-built
+ reboot
+
+* Now verify you booted the newly built kernel and check it.
+
+[:ref:`details <introoptional_bisref>`]
+
+.. _submit_improvements:
+
+Conclusion
+----------
+
+You have reached the end of the step-by-step guide.
+
+Did you run into trouble following any of the above steps not cleared up by the
+reference section below? Did you spot errors? Or do you have ideas how to
+improve the guide?
+
+If any of that applies, please take a moment and let the maintainer of this
+document know by email (Thorsten Leemhuis <linux@leemhuis.info>), ideally while
+CCing the Linux docs mailing list (linux-doc@vger.kernel.org). Such feedback is
+vital to improve this text further, which is in everybody's interest, as it
+will enable more people to master the task described here -- and hopefully also
+improve similar guides inspired by this one.
+
+
+Reference section for the step-by-step guide
+============================================
+
+This section holds additional information for almost all the items in the above
+step-by-step guide.
+
+Preparations for building your own kernels
+------------------------------------------
+
+ *The steps in this section lay the groundwork for all further tests.*
+ [:ref:`... <introprep_bissbs>`]
+
+The steps in all later sections of this guide depend on those described here.
+
+[:ref:`back to step-by-step guide <introprep_bissbs>`].
+
+.. _backup_bisref:
+
+Prepare for emergencies
+~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Create a fresh backup and put system repair and restore tools at hand.*
+ [:ref:`... <backup_bissbs>`]
+
+Remember, you are dealing with computers, which sometimes do unexpected things
+-- especially if you fiddle with crucial parts like the kernel of an operating
+system. That's what you are about to do in this process. Hence, better prepare
+for something going sideways, even if that should not happen.
+
+[:ref:`back to step-by-step guide <backup_bissbs>`]
+
+.. _vanilla_bisref:
+
+Remove anything related to externally maintained kernel modules
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Remove all software that depends on externally developed kernel drivers or
+ builds them automatically.* [:ref:`...<vanilla_bissbs>`]
+
+Externally developed kernel modules can easily cause trouble during a bisection.
+
+But there is a more important reason why this guide contains this step: most
+kernel developers will not care about reports about regressions occurring with
+kernels that utilize such modules. That's because such kernels are not
+considered 'vanilla' anymore, as Documentation/admin-guide/reporting-issues.rst
+explains in more detail.
+
+[:ref:`back to step-by-step guide <vanilla_bissbs>`]
+
+.. _secureboot_bisref:
+
+Deal with techniques like Secure Boot
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *On platforms with 'Secure Boot' or similar techniques, prepare everything to
+ ensure the system will permit your self-compiled kernel to boot later.*
+ [:ref:`... <secureboot_bissbs>`]
+
+Many modern systems allow only certain operating systems to start; that's why
+they reject booting self-compiled kernels by default.
+
+You ideally deal with this by making your platform trust your self-built kernels
+with the help of a certificate. How to do that is not described
+here, as it requires various steps that would take the text too far away from
+its purpose; 'Documentation/admin-guide/module-signing.rst' and various web
+sides already explain everything needed in more detail.
+
+Temporarily disabling solutions like Secure Boot is another way to make your own
+Linux boot. On commodity x86 systems it is possible to do this in the BIOS Setup
+utility; the required steps vary a lot between machines and therefore cannot be
+described here.
+
+On mainstream x86 Linux distributions there is a third and universal option:
+disable all Secure Boot restrictions for your Linux environment. You can
+initiate this process by running ``mokutil --disable-validation``; this will
+tell you to create a one-time password, which is safe to write down. Now
+restart; right after your BIOS performed all self-tests the bootloader Shim will
+show a blue box with a message 'Press any key to perform MOK management'. Hit
+some key before the countdown exposes, which will open a menu. Choose 'Change
+Secure Boot state'. Shim's 'MokManager' will now ask you to enter three
+randomly chosen characters from the one-time password specified earlier. Once
+you provided them, confirm you really want to disable the validation.
+Afterwards, permit MokManager to reboot the machine.
+
+[:ref:`back to step-by-step guide <secureboot_bissbs>`]
+
+.. _bootworking_bisref:
+
+Boot the last kernel that was working
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Boot into the last working kernel and briefly recheck if the feature that
+ regressed really works.* [:ref:`...<bootworking_bissbs>`]
+
+This will make later steps that cover creating and trimming the configuration do
+the right thing.
+
+[:ref:`back to step-by-step guide <bootworking_bissbs>`]
+
+.. _diskspace_bisref:
+
+Space requirements
+~~~~~~~~~~~~~~~~~~
+
+ *Ensure to have enough free space for building Linux.*
+ [:ref:`... <diskspace_bissbs>`]
+
+The numbers mentioned are rough estimates with a big extra charge to be on the
+safe side, so often you will need less.
+
+If you have space constraints, be sure to hay attention to the :ref:`step about
+debug symbols' <debugsymbols_bissbs>` and its :ref:`accompanying reference
+section' <debugsymbols_bisref>`, as disabling then will reduce the consumed disk
+space by quite a few gigabytes.
+
+[:ref:`back to step-by-step guide <diskspace_bissbs>`]
+
+.. _rangecheck_bisref:
+
+Bisection range
+~~~~~~~~~~~~~~~
+
+ *Determine the kernel versions considered 'good' and 'bad' throughout this
+ guide.* [:ref:`...<rangecheck_bissbs>`]
+
+Establishing the range of commits to be checked is mostly straightforward,
+except when a regression occurred when switching from a release of one stable
+series to a release of a later series (e.g. from 6.0.13 to 6.1.5). In that case
+Git will need some hand holding, as there is no straight line of descent.
+
+That's because with the release of 6.0 mainline carried on to 6.1 while the
+stable series 6.0.y branched to the side. It's therefore theoretically possible
+that the issue you face with 6.1.5 only worked in 6.0.13, as it was fixed by a
+commit that went into one of the 6.0.y releases, but never hit mainline or the
+6.1.y series. Thankfully that normally should not happen due to the way the
+stable/longterm maintainers maintain the code. It's thus pretty safe to assume
+6.0 as a 'good' kernel. That assumption will be tested anyway, as that kernel
+will be built and tested in the segment '2' of this guide; Git would force you
+to do this as well, if you tried bisecting between 6.0.13 and 6.1.15.
+
+[:ref:`back to step-by-step guide <rangecheck_bissbs>`]
+
+.. _buildrequires_bisref:
+
+Install build requirements
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Install all software required to build a Linux kernel.*
+ [:ref:`...<buildrequires_bissbs>`]
+
+The kernel is pretty stand-alone, but besides tools like the compiler you will
+sometimes need a few libraries to build one. How to install everything needed
+depends on your Linux distribution and the configuration of the kernel you are
+about to build.
+
+Here are a few examples what you typically need on some mainstream
+distributions:
+
+* Arch Linux and derivatives::
+
+ sudo pacman --needed -S bc binutils bison flex gcc git kmod libelf openssl \
+ pahole perl zlib ncurses qt6-base
+
+* Debian, Ubuntu, and derivatives::
+
+ sudo apt install bc binutils bison dwarves flex gcc git kmod libelf-dev \
+ libssl-dev make openssl pahole perl-base pkg-config zlib1g-dev \
+ libncurses-dev qt6-base-dev g++
+
+* Fedora and derivatives::
+
+ sudo dnf install binutils \
+ /usr/bin/{bc,bison,flex,gcc,git,openssl,make,perl,pahole,rpmbuild} \
+ /usr/include/{libelf.h,openssl/pkcs7.h,zlib.h,ncurses.h,qt6/QtGui/QAction}
+
+* openSUSE and derivatives::
+
+ sudo zypper install bc binutils bison dwarves flex gcc git \
+ kernel-install-tools libelf-devel make modutils openssl openssl-devel \
+ perl-base zlib-devel rpm-build ncurses-devel qt6-base-devel
+
+These commands install a few packages that are often, but not always needed. You
+for example might want to skip installing the development headers for ncurses,
+which you will only need in case you later might want to adjust the kernel build
+configuration using make the targets 'menuconfig' or 'nconfig'; likewise omit
+the headers of Qt6 if you do not plan to adjust the .config using 'xconfig'.
+
+You furthermore might need additional libraries and their development headers
+for tasks not covered in this guide -- for example when building utilities from
+the kernel's tools/ directory.
+
+[:ref:`back to step-by-step guide <buildrequires_bissbs>`]
+
+.. _sources_bisref:
+
+Download the sources using Git
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Retrieve the Linux mainline sources.*
+ [:ref:`...<sources_bissbs>`]
+
+The step-by-step guide outlines how to download the Linux sources using a full
+Git clone of Linus' mainline repository. There is nothing more to say about
+that -- but there are two alternatives ways to retrieve the sources that might
+work better for you:
+
+* If you have an unreliable internet connection, consider
+ :ref:`using a 'Git bundle'<sources_bundle_bisref>`.
+
+* If downloading the complete repository would take too long or requires too
+ much storage space, consider :ref:`using a 'shallow
+ clone'<sources_shallow_bisref>`.
+
+.. _sources_bundle_bisref:
+
+Downloading Linux mainline sources using a bundle
+"""""""""""""""""""""""""""""""""""""""""""""""""
+
+Use the following commands to retrieve the Linux mainline sources using a
+bundle::
+
+ wget -c \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/clone.bundle
+ git clone --no-checkout clone.bundle ~/linux/
+ cd ~/linux/
+ git remote remove origin
+ git remote add mainline \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
+ git fetch mainline
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+
+In case the 'wget' command fails, just re-execute it, it will pick up where
+it left off.
+
+[:ref:`back to step-by-step guide <sources_bissbs>`]
+[:ref:`back to section intro <sources_bisref>`]
+
+.. _sources_shallow_bisref:
+
+Downloading Linux mainline sources using a shallow clone
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+First, execute the following command to retrieve the latest mainline codebase::
+
+ git clone -o mainline --no-checkout --depth 1 -b master \
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/
+ cd ~/linux/
+ git remote add -t master stable \
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
+
+Now deepen your clone's history to the second predecessor of the mainline
+release of your 'good' version. In case the latter are 6.0 or 6.0.13, 5.19 would
+be the first predecessor and 5.18 the second -- hence deepen the history up to
+that version::
+
+ git fetch --shallow-exclude=v5.18 mainline
+
+Afterwards add the stable Git repository as remote and all required stable
+branches as explained in the step-by-step guide.
+
+Note, shallow clones have a few peculiar characteristics:
+
+* For bisections the history needs to be deepened a few mainline versions
+ farther than it seems necessary, as explained above already. That's because
+ Git otherwise will be unable to revert or describe most of the commits within
+ a range (say 6.1..6.2), as they are internally based on earlier kernels
+ releases (like 6.0-rc2 or 5.19-rc3).
+
+* This document in most places uses ``git fetch`` with ``--shallow-exclude=``
+ to specify the earliest version you care about (or to be precise: its git
+ tag). You alternatively can use the parameter ``--shallow-since=`` to specify
+ an absolute (say ``'2023-07-15'``) or relative (``'12 months'``) date to
+ define the depth of the history you want to download. When using them while
+ bisecting mainline, ensure to deepen the history to at least 7 months before
+ the release of the mainline release your 'good' kernel is based on.
+
+* Be warned, when deepening your clone you might encounter an error like
+ 'fatal: error in object: unshallow cafecaca0c0dacafecaca0c0dacafecaca0c0da'.
+ In that case run ``git repack -d`` and try again.
+
+[:ref:`back to step-by-step guide <sources_bissbs>`]
+[:ref:`back to section intro <sources_bisref>`]
+
+.. _oldconfig_bisref:
+
+Start defining the build configuration for your kernel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Start preparing a kernel build configuration (the '.config' file).*
+ [:ref:`... <oldconfig_bissbs>`]
+
+*Note, this is the first of multiple steps in this guide that create or modify
+build artifacts. The commands used in this guide store them right in the source
+tree to keep things simple. In case you prefer storing the build artifacts
+separately, create a directory like '~/linux-builddir/' and add the parameter
+``O=~/linux-builddir/`` to all make calls used throughout this guide. You will
+have to point other commands there as well -- among them the ``./scripts/config
+[...]`` commands, which will require ``--file ~/linux-builddir/.config`` to
+locate the right build configuration.*
+
+Two things can easily go wrong when creating a .config file as advised:
+
+* The oldconfig target will use a .config file from your build directory, if
+ one is already present there (e.g. '~/linux/.config'). That's totally fine if
+ that's what you intend (see next step), but in all other cases you want to
+ delete it. This for example is important in case you followed this guide
+ further, but due to problems come back here to redo the configuration from
+ scratch.
+
+* Sometimes olddefconfig is unable to locate the .config file for your running
+ kernel and will use defaults, as briefly outlined in the guide. In that case
+ check if your distribution ships the configuration somewhere and manually put
+ it in the right place (e.g. '~/linux/.config') if it does. On distributions
+ where /proc/config.gz exists this can be achieved using this command::
+
+ zcat /proc/config.gz > .config
+
+ Once you put it there, run ``make olddefconfig`` again to adjust it to the
+ needs of the kernel about to be built.
+
+Note, the olddefconfig target will set any undefined build options to their
+default value. If you prefer to set such configuration options manually, use
+``make oldconfig`` instead. Then for each undefined configuration option you
+will be asked how to proceed; in case you are unsure what to answer, simply hit
+'enter' to apply the default value. Note though that for bisections you normally
+want to go with the defaults, as you otherwise might enable a new feature that
+causes a problem looking like regressions (for example due to security
+restrictions).
+
+Occasionally odd things happen when trying to use a config file prepared for one
+kernel (say 6.1) on an older mainline release -- especially if it is much older
+(say 5.15). That's one of the reasons why the previous step in the guide told
+you to boot the kernel where everything works. If you manually add a .config
+file you thus want to ensure it's from the working kernel and not from a one
+that shows the regression.
+
+In case you want to build kernels for another machine, locate its kernel build
+configuration; usually ``ls /boot/config-$(uname -r)`` will print its name. Copy
+that file to the build machine and store it as ~/linux/.config; afterwards run
+``make olddefconfig`` to adjust it.
+
+[:ref:`back to step-by-step guide <oldconfig_bissbs>`]
+
+.. _localmodconfig_bisref:
+
+Trim the build configuration for your kernel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Disable any kernel modules apparently superfluous for your setup.*
+ [:ref:`... <localmodconfig_bissbs>`]
+
+As explained briefly in the step-by-step guide already: with localmodconfig it
+can easily happen that your self-built kernels will lack modules for tasks you
+did not perform at least once before utilizing this make target. That happens
+when a task requires kernel modules which are only autoloaded when you execute
+it for the first time. So when you never performed that task since starting your
+kernel the modules will not have been loaded -- and from localmodconfig's point
+of view look superfluous, which thus disables them to reduce the amount of code
+to be compiled.
+
+You can try to avoid this by performing typical tasks that often will autoload
+additional kernel modules: start a VM, establish VPN connections, loop-mount a
+CD/DVD ISO, mount network shares (CIFS, NFS, ...), and connect all external
+devices (2FA keys, headsets, webcams, ...) as well as storage devices with file
+systems you otherwise do not utilize (btrfs, ext4, FAT, NTFS, XFS, ...). But it
+is hard to think of everything that might be needed -- even kernel developers
+often forget one thing or another at this point.
+
+Do not let that risk bother you, especially when compiling a kernel only for
+testing purposes: everything typically crucial will be there. And if you forget
+something important you can turn on a missing feature manually later and quickly
+run the commands again to compile and install a kernel that has everything you
+need.
+
+But if you plan to build and use self-built kernels regularly, you might want to
+reduce the risk by recording which modules your system loads over the course of
+a few weeks. You can automate this with `modprobed-db
+<https://github.com/graysky2/modprobed-db>`_. Afterwards use ``LSMOD=<path>`` to
+point localmodconfig to the list of modules modprobed-db noticed being used::
+
+ yes '' | make LSMOD='${HOME}'/.config/modprobed.db localmodconfig
+
+That parameter also allows you to build trimmed kernels for another machine in
+case you copied a suitable .config over to use as base (see previous step). Just
+run ``lsmod > lsmod_foo-machine`` on that system and copy the generated file to
+your build's host home directory. Then run these commands instead of the one the
+step-by-step guide mentions::
+
+ yes '' | make LSMOD=~/lsmod_foo-machine localmodconfig
+
+[:ref:`back to step-by-step guide <localmodconfig_bissbs>`]
+
+.. _tagging_bisref:
+
+Tag the kernels about to be build
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Ensure all the kernels you will build are clearly identifiable using a
+ special tag and a unique version identifier.* [:ref:`... <tagging_bissbs>`]
+
+This allows you to differentiate your distribution's kernels from those created
+during this process, as the file or directories for the latter will contain
+'-local' in the name; it also helps picking the right entry in the boot menu and
+not lose track of you kernels, as their version numbers will look slightly
+confusing during the bisection.
+
+[:ref:`back to step-by-step guide <tagging_bissbs>`]
+
+.. _debugsymbols_bisref:
+
+Decide to enable or disable debug symbols
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Decide how to handle debug symbols.* [:ref:`... <debugsymbols_bissbs>`]
+
+Having debug symbols available can be important when your kernel throws a
+'panic', 'Oops', 'warning', or 'BUG' later when running, as then you will be
+able to find the exact place where the problem occurred in the code. But
+collecting and embedding the needed debug information takes time and consumes
+quite a bit of space: in late 2022 the build artifacts for a typical x86 kernel
+trimmed with localmodconfig consumed around 5 Gigabyte of space with debug
+symbols, but less than 1 when they were disabled. The resulting kernel image and
+modules are bigger as well, which increases storage requirements for /boot/ and
+load times.
+
+In case you want a small kernel and are unlikely to decode a stack trace later,
+you thus might want to disable debug symbols to avoid those downsides. If it
+later turns out that you need them, just enable them as shown and rebuild the
+kernel.
+
+You on the other hand definitely want to enable them for this process, if there
+is a decent chance that you need to decode a stack trace later. The section
+'Decode failure messages' in Documentation/admin-guide/reporting-issues.rst
+explains this process in more detail.
+
+[:ref:`back to step-by-step guide <debugsymbols_bissbs>`]
+
+.. _configmods_bisref:
+
+Adjust build configuration
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Check if you may want or need to adjust some other kernel configuration
+ options:*
+
+Depending on your needs you at this point might want or have to adjust some
+kernel configuration options.
+
+.. _configmods_distros_bisref:
+
+Distro specific adjustments
+"""""""""""""""""""""""""""
+
+ *Are you running* [:ref:`... <configmods_bissbs>`]
+
+The following sections help you to avoid build problems that are known to occur
+when following this guide on a few commodity distributions.
+
+**Debian:**
+
+* Remove a stale reference to a certificate file that would cause your build to
+ fail::
+
+ ./scripts/config --set-str SYSTEM_TRUSTED_KEYS ''
+
+ Alternatively, download the needed certificate and make that configuration
+ option point to it, as `the Debian handbook explains in more detail
+ <https://debian-handbook.info/browse/stable/sect.kernel-compilation.html>`_
+ -- or generate your own, as explained in
+ Documentation/admin-guide/module-signing.rst.
+
+[:ref:`back to step-by-step guide <configmods_bissbs>`]
+
+.. _configmods_individual_bisref:
+
+Individual adjustments
+""""""""""""""""""""""
+
+ *If you want to influence the other aspects of the configuration, do so
+ now.* [:ref:`... <configmods_bissbs>`]
+
+At this point you can use a command like ``make menuconfig`` or ``make nconfig``
+to enable or disable certain features using a text-based user interface; to use
+a graphical configuration utility, run ``make xconfig`` instead. Both of them
+require development libraries from toolkits they are rely on (ncurses
+respectively Qt5 or Qt6); an error message will tell you if something required
+is missing.
+
+[:ref:`back to step-by-step guide <configmods_bissbs>`]
+
+.. _saveconfig_bisref:
+
+Put the .config file aside
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Reprocess the .config after the latest changes and store it in a safe place.*
+ [:ref:`... <saveconfig_bissbs>`]
+
+Put the .config you prepared aside, as you want to copy it back to the build
+directory every time during this guide before you start building another
+kernel. That's because going back and forth between different versions can alter
+.config files in odd ways; those occasionally cause side effects that could
+confuse testing or in some cases render the result of your bisection
+meaningless.
+
+[:ref:`back to step-by-step guide <saveconfig_bissbs>`]
+
+.. _introlatestcheck_bisref:
+
+Try to reproduce the problem with the latest codebase
+-----------------------------------------------------
+
+ *Verify the regression is not caused by some .config change and check if it
+ still occurs with the latest codebase.* [:ref:`... <introlatestcheck_bissbs>`]
+
+For some readers it might seem unnecessary to check the latest codebase at this
+point, especially if you did that already with a kernel prepared by your
+distributor or face a regression within a stable/longterm series. But it's
+highly recommended for these reasons:
+
+* You will run into any problems caused by your setup before you actually begin
+ a bisection. That will make it a lot easier to differentiate between 'this
+ most likely is some problem in my setup' and 'this change needs to be skipped
+ during the bisection, as the kernel sources at that stage contain an unrelated
+ problem that causes building or booting to fail'.
+
+* These steps will rule out if your problem is caused by some change in the
+ build configuration between the 'working' and the 'broken' kernel. This for
+ example can happen when your distributor enabled an additional security
+ feature in the newer kernel which was disabled or not yet supported by the
+ older kernel. That security feature might get into the way of something you
+ do -- in which case your problem from the perspective of the Linux kernel
+ upstream developers is not a regression, as
+ Documentation/admin-guide/reporting-regressions.rst explains in more detail.
+ You thus would waste your time if you'd try to bisect this.
+
+* If the cause for your regression was already fixed in the latest mainline
+ codebase, you'd perform the bisection for nothing. This holds true for a
+ regression you encountered with a stable/longterm release as well, as they are
+ often caused by problems in mainline changes that were backported -- in which
+ case the problem will have to be fixed in mainline first. Maybe it already was
+ fixed there and the fix is already in the process of being backported.
+
+* For regressions within a stable/longterm series it's furthermore crucial to
+ know if the issue is specific to that series or also happens in the mainline
+ kernel, as the report needs to be sent to different people:
+
+ * Regressions specific to a stable/longterm series are the stable team's
+ responsibility; mainline Linux developers might or might not care.
+
+ * Regressions also happening in mainline are something the regular Linux
+ developers and maintainers have to handle; the stable team does not care
+ and does not need to be involved in the report, they just should be told
+ to backport the fix once it's ready.
+
+ Your report might be ignored if you send it to the wrong party -- and even
+ when you get a reply there is a decent chance that developers tell you to
+ evaluate which of the two cases it is before they take a closer look.
+
+[:ref:`back to step-by-step guide <introlatestcheck_bissbs>`]
+
+.. _checkoutmaster_bisref:
+
+Check out the latest Linux codebase
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Check out the latest Linux codebase.*
+ [:ref:`... <checkoutmaster_bissbs>`]
+
+In case you later want to recheck if an ever newer codebase might fix the
+problem, remember to run that ``git fetch --shallow-exclude [...]`` command
+again mentioned earlier to update your local Git repository.
+
+[:ref:`back to step-by-step guide <checkoutmaster_bissbs>`]
+
+.. _build_bisref:
+
+Build your kernel
+~~~~~~~~~~~~~~~~~
+
+ *Build the image and the modules of your first kernel using the config file
+ you prepared.* [:ref:`... <build_bissbs>`]
+
+A lot can go wrong at this stage, but the instructions below will help you help
+yourself. Another subsection explains how to directly package your kernel up as
+deb, rpm or tar file.
+
+Dealing with build errors
+"""""""""""""""""""""""""
+
+When a build error occurs, it might be caused by some aspect of your machine's
+setup that often can be fixed quickly; other times though the problem lies in
+the code and can only be fixed by a developer. A close examination of the
+failure messages coupled with some research on the internet will often tell you
+which of the two it is. To perform such investigation, restart the build
+process like this::
+
+ make V=1
+
+The ``V=1`` activates verbose output, which might be needed to see the actual
+error. To make it easier to spot, this command also omits the ``-j $(nproc
+--all)`` used earlier to utilize every CPU core in the system for the job -- but
+this parallelism also results in some clutter when failures occur.
+
+After a few seconds the build process should run into the error again. Now try
+to find the most crucial line describing the problem. Then search the internet
+for the most important and non-generic section of that line (say 4 to 8 words);
+avoid or remove anything that looks remotely system-specific, like your username
+or local path names like ``/home/username/linux/``. First try your regular
+internet search engine with that string, afterwards search Linux kernel mailing
+lists via `lore.kernel.org/all/ <https://lore.kernel.org/all/>`_.
+
+This most of the time will find something that will explain what is wrong; quite
+often one of the hits will provide a solution for your problem, too. If you
+do not find anything that matches your problem, try again from a different angle
+by modifying your search terms or using another line from the error messages.
+
+In the end, most issues you run into have likely been encountered and
+reported by others already. That includes issues where the cause is not your
+system, but lies in the code. If you run into one of those, you might thus find
+a solution (e.g. a patch) or workaround for your issue, too.
+
+Package your kernel up
+""""""""""""""""""""""
+
+The step-by-step guide uses the default make targets (e.g. 'bzImage' and
+'modules' on x86) to build the image and the modules of your kernel, which later
+steps of the guide then install. You instead can also directly build everything
+and directly package it up by using one of the following targets:
+
+* ``make -j $(nproc --all) bindeb-pkg`` to generate a deb package
+
+* ``make -j $(nproc --all) binrpm-pkg`` to generate a rpm package
+
+* ``make -j $(nproc --all) tarbz2-pkg`` to generate a bz2 compressed tarball
+
+This is just a selection of available make targets for this purpose, see
+``make help`` for others. You can also use these targets after running
+``make -j $(nproc --all)``, as they will pick up everything already built.
+
+If you employ the targets to generate deb or rpm packages, ignore the
+step-by-step guide's instructions on installing and removing your kernel;
+instead install and remove the packages using the package utility for the format
+(e.g. dpkg and rpm) or a package management utility build on top of them (apt,
+aptitude, dnf/yum, zypper, ...). Be aware that the packages generated using
+these two make targets are designed to work on various distributions utilizing
+those formats, they thus will sometimes behave differently than your
+distribution's kernel packages.
+
+[:ref:`back to step-by-step guide <build_bissbs>`]
+
+.. _install_bisref:
+
+Put the kernel in place
+~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Install the kernel you just built.* [:ref:`... <install_bissbs>`]
+
+What you need to do after executing the command in the step-by-step guide
+depends on the existence and the implementation of ``/sbin/installkernel``
+executable on your distribution.
+
+If installkernel is found, the kernel's build system will delegate the actual
+installation of your kernel image to this executable, which then performs some
+or all of these tasks:
+
+* On almost all Linux distributions installkernel will store your kernel's
+ image in /boot/, usually as '/boot/vmlinuz-<kernelrelease_id>'; often it will
+ put a 'System.map-<kernelrelease_id>' alongside it.
+
+* On most distributions installkernel will then generate an 'initramfs'
+ (sometimes also called 'initrd'), which usually are stored as
+ '/boot/initramfs-<kernelrelease_id>.img' or
+ '/boot/initrd-<kernelrelease_id>'. Commodity distributions rely on this file
+ for booting, hence ensure to execute the make target 'modules_install' first,
+ as your distribution's initramfs generator otherwise will be unable to find
+ the modules that go into the image.
+
+* On some distributions installkernel will then add an entry for your kernel
+ to your bootloader's configuration.
+
+You have to take care of some or all of the tasks yourself, if your
+distribution lacks a installkernel script or does only handle part of them.
+Consult the distribution's documentation for details. If in doubt, install the
+kernel manually::
+
+ sudo install -m 0600 $(make -s image_name) /boot/vmlinuz-$(make -s kernelrelease)
+ sudo install -m 0600 System.map /boot/System.map-$(make -s kernelrelease)
+
+Now generate your initramfs using the tools your distribution provides for this
+process. Afterwards add your kernel to your bootloader configuration and reboot.
+
+[:ref:`back to step-by-step guide <install_bissbs>`]
+
+.. _storagespace_bisref:
+
+Storage requirements per kernel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Check how much storage space the kernel, its modules, and other related files
+ like the initramfs consume.* [:ref:`... <storagespace_bissbs>`]
+
+The kernels built during a bisection consume quite a bit of space in /boot/ and
+/lib/modules/, especially if you enabled debug symbols. That makes it easy to
+fill up volumes during a bisection -- and due to that even kernels which used to
+work earlier might fail to boot. To prevent that you will need to know how much
+space each installed kernel typically requires.
+
+Note, most of the time the pattern '/boot/*$(make -s kernelrelease)*' used in
+the guide will match all files needed to boot your kernel -- but neither the
+path nor the naming scheme are mandatory. On some distributions you thus will
+need to look in different places.
+
+[:ref:`back to step-by-step guide <storagespace_bissbs>`]
+
+.. _tainted_bisref:
+
+Check if your newly built kernel considers itself 'tainted'
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Check if the kernel marked itself as 'tainted'.*
+ [:ref:`... <tainted_bissbs>`]
+
+Linux marks itself as tainted when something happens that potentially leads to
+follow-up errors that look totally unrelated. That is why developers might
+ignore or react scantly to reports from tainted kernels -- unless of course the
+kernel set the flag right when the reported bug occurred.
+
+That's why you want check why a kernel is tainted as explained in
+Documentation/admin-guide/tainted-kernels.rst; doing so is also in your own
+interest, as your testing might be flawed otherwise.
+
+[:ref:`back to step-by-step guide <tainted_bissbs>`]
+
+.. _recheckbroken_bisref:
+
+Check the kernel built from a recent mainline codebase
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Verify if your bug occurs with the newly built kernel.*
+ [:ref:`... <recheckbroken_bissbs>`]
+
+There are a couple of reasons why your bug or regression might not show up with
+the kernel you built from the latest codebase. These are the most frequent:
+
+* The bug was fixed meanwhile.
+
+* What you suspected to be a regression was caused by a change in the build
+ configuration the provider of your kernel carried out.
+
+* Your problem might be a race condition that does not show up with your kernel;
+ the trimmed build configuration, a different setting for debug symbols, the
+ compiler used, and various other things can cause this.
+
+* In case you encountered the regression with a stable/longterm kernel it might
+ be a problem that is specific to that series; the next step in this guide will
+ check this.
+
+[:ref:`back to step-by-step guide <recheckbroken_bissbs>`]
+
+.. _recheckstablebroken_bisref:
+
+Check the kernel built from the latest stable/longterm codebase
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Are you facing a regression within a stable/longterm release, but failed to
+ reproduce it with the kernel you just built using the latest mainline sources?
+ Then check if the latest codebase for the particular series might already fix
+ the problem.* [:ref:`... <recheckstablebroken_bissbs>`]
+
+If this kernel does not show the regression either, there most likely is no need
+for a bisection.
+
+[:ref:`back to step-by-step guide <recheckstablebroken_bissbs>`]
+
+.. _introworkingcheck_bisref:
+
+Ensure the 'good' version is really working well
+------------------------------------------------
+
+ *Check if the kernels you build work fine.*
+ [:ref:`... <introworkingcheck_bissbs>`]
+
+This section will reestablish a known working base. Skipping it might be
+appealing, but is usually a bad idea, as it does something important:
+
+It will ensure the .config file you prepared earlier actually works as expected.
+That is in your own interest, as trimming the configuration is not foolproof --
+and you might be building and testing ten or more kernels for nothing before
+starting to suspect something might be wrong with the build configuration.
+
+That alone is reason enough to spend the time on this, but not the only reason.
+
+Many readers of this guide normally run kernels that are patched, use add-on
+modules, or both. Those kernels thus are not considered 'vanilla' -- therefore
+it's possible that the thing that regressed might never have worked in vanilla
+builds of the 'good' version in the first place.
+
+There is a third reason for those that noticed a regression between
+stable/longterm kernels of different series (e.g. 6.0.13..6.1.5): it will
+ensure the kernel version you assumed to be 'good' earlier in the process (e.g.
+6.0) actually is working.
+
+[:ref:`back to step-by-step guide <introworkingcheck_bissbs>`]
+
+.. _recheckworking_bisref:
+
+Build your own version of the 'good' kernel
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Build your own variant of the working kernel and check if the feature that
+ regressed works as expected with it.* [:ref:`... <recheckworking_bissbs>`]
+
+In case the feature that broke with newer kernels does not work with your first
+self-built kernel, find and resolve the cause before moving on. There are a
+multitude of reasons why this might happen. Some ideas where to look:
+
+* Check the taint status and the output of ``dmesg``, maybe something unrelated
+ went wrong.
+
+* Maybe localmodconfig did something odd and disabled the module required to
+ test the feature? Then you might want to recreate a .config file based on the
+ one from the last working kernel and skip trimming it down; manually disabling
+ some features in the .config might work as well to reduce the build time.
+
+* Maybe it's not a kernel regression and something that is caused by some fluke,
+ a broken initramfs (also known as initrd), new firmware files, or an updated
+ userland software?
+
+* Maybe it was a feature added to your distributor's kernel which vanilla Linux
+ at that point never supported?
+
+Note, if you found and fixed problems with the .config file, you want to use it
+to build another kernel from the latest codebase, as your earlier tests with
+mainline and the latest version from an affected stable/longterm series were
+most likely flawed.
+
+[:ref:`back to step-by-step guide <recheckworking_bissbs>`]
+
+Perform a bisection and validate the result
+-------------------------------------------
+
+ *With all the preparations and precaution builds taken care of, you are now
+ ready to begin the bisection.* [:ref:`... <introbisect_bissbs>`]
+
+The steps in this segment perform and validate the bisection.
+
+[:ref:`back to step-by-step guide <introbisect_bissbs>`].
+
+.. _bisectstart_bisref:
+
+Start the bisection
+~~~~~~~~~~~~~~~~~~~
+
+ *Start the bisection and tell Git about the versions earlier established as
+ 'good' and 'bad'.* [:ref:`... <bisectstart_bissbs>`]
+
+This will start the bisection process; the last of the commands will make Git
+check out a commit round about half-way between the 'good' and the 'bad' changes
+for you to test.
+
+[:ref:`back to step-by-step guide <bisectstart_bissbs>`]
+
+.. _bisectbuild_bisref:
+
+Build a kernel from the bisection point
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Build, install, and boot a kernel from the code Git checked out using the
+ same commands you used earlier.* [:ref:`... <bisectbuild_bissbs>`]
+
+There are two things worth of note here:
+
+* Occasionally building the kernel will fail or it might not boot due some
+ problem in the code at the bisection point. In that case run this command::
+
+ git bisect skip
+
+ Git will then check out another commit nearby which with a bit of luck should
+ work better. Afterwards restart executing this step.
+
+* Those slightly odd looking version identifiers can happen during bisections,
+ because the Linux kernel subsystems prepare their changes for a new mainline
+ release (say 6.2) before its predecessor (e.g. 6.1) is finished. They thus
+ base them on a somewhat earlier point like 6.1-rc1 or even 6.0 -- and then
+ get merged for 6.2 without rebasing nor squashing them once 6.1 is out. This
+ leads to those slightly odd looking version identifiers coming up during
+ bisections.
+
+[:ref:`back to step-by-step guide <bisectbuild_bissbs>`]
+
+.. _bisecttest_bisref:
+
+Bisection checkpoint
+~~~~~~~~~~~~~~~~~~~~
+
+ *Check if the feature that regressed works in the kernel you just built.*
+ [:ref:`... <bisecttest_bissbs>`]
+
+Ensure what you tell Git is accurate: getting it wrong just one time will bring
+the rest of the bisection totally off course, hence all testing after that point
+will be for nothing.
+
+[:ref:`back to step-by-step guide <bisecttest_bissbs>`]
+
+.. _bisectlog_bisref:
+
+Put the bisection log away
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Store Git's bisection log and the current .config file in a safe place.*
+ [:ref:`... <bisectlog_bissbs>`]
+
+As indicated above: declaring just one kernel wrongly as 'good' or 'bad' will
+render the end result of a bisection useless. In that case you'd normally have
+to restart the bisection from scratch. The log can prevent that, as it might
+allow someone to point out where a bisection likely went sideways -- and then
+instead of testing ten or more kernels you might only have to build a few to
+resolve things.
+
+The .config file is put aside, as there is a decent chance that developers might
+ask for it after you report the regression.
+
+[:ref:`back to step-by-step guide <bisectlog_bissbs>`]
+
+.. _revert_bisref:
+
+Try reverting the culprit
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *Try reverting the culprit on top of the latest codebase to see if this fixes
+ your regression.* [:ref:`... <revert_bissbs>`]
+
+This is an optional step, but whenever possible one you should try: there is a
+decent chance that developers will ask you to perform this step when you bring
+the bisection result up. So give it a try, you are in the flow already, building
+one more kernel shouldn't be a big deal at this point.
+
+The step-by-step guide covers everything relevant already except one slightly
+rare thing: did you bisected a regression that also happened with mainline using
+a stable/longterm series, but Git failed to revert the commit in mainline? Then
+try to revert the culprit in the affected stable/longterm series -- and if that
+succeeds, test that kernel version instead.
+
+[:ref:`back to step-by-step guide <revert_bissbs>`]
+
+Cleanup steps during and after following this guide
+---------------------------------------------------
+
+ *During and after following this guide you might want or need to remove some
+ of the kernels you installed.* [:ref:`... <introclosure_bissbs>`]
+
+The steps in this section describe clean-up procedures.
+
+[:ref:`back to step-by-step guide <introclosure_bissbs>`].
+
+.. _makeroom_bisref:
+
+Cleaning up during the bisection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ *To remove one of the kernels you installed, look up its 'kernelrelease'
+ identifier.* [:ref:`... <makeroom_bissbs>`]
+
+The kernels you install during this process are easy to remove later, as its
+parts are only stored in two places and clearly identifiable. You thus do not
+need to worry to mess up your machine when you install a kernel manually (and
+thus bypass your distribution's packaging system): all parts of your kernels are
+relatively easy to remove later.
+
+One of the two places is a directory in /lib/modules/, which holds the modules
+for each installed kernel. This directory is named after the kernel's release
+identifier; hence, to remove all modules for one of the kernels you built,
+simply remove its modules directory in /lib/modules/.
+
+The other place is /boot/, where typically two up to five files will be placed
+during installation of a kernel. All of them usually contain the release name in
+their file name, but how many files and their exact names depend somewhat on
+your distribution's installkernel executable and its initramfs generator. On
+some distributions the ``kernel-install remove...`` command mentioned in the
+step-by-step guide will delete all of these files for you while also removing
+the menu entry for the kernel from your bootloader configuration. On others you
+have to take care of these two tasks yourself. The following command should
+interactively remove the three main files of a kernel with the release name
+'6.0-rc1-local-gcafec0cacaca0'::
+
+ rm -i /boot/{System.map,vmlinuz,initr}-6.0-rc1-local-gcafec0cacaca0
+
+Afterwards check for other files in /boot/ that have
+'6.0-rc1-local-gcafec0cacaca0' in their name and consider deleting them as well.
+Now remove the boot entry for the kernel from your bootloader's configuration;
+the steps to do that vary quite a bit between Linux distributions.
+
+Note, be careful with wildcards like '*' when deleting files or directories
+for kernels manually: you might accidentally remove files of a 6.0.13 kernel
+when all you want is to remove 6.0 or 6.0.1.
+
+[:ref:`back to step-by-step guide <makeroom_bissbs>`]
+
+Cleaning up after the bisection
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. _finishingtouch_bisref:
+
+ *Once you have finished the bisection, do not immediately remove anything
+ you set up, as you might need a few things again.*
+ [:ref:`... <finishingtouch_bissbs>`]
+
+When you are really short of storage space removing the kernels as described in
+the step-by-step guide might not free as much space as you would like. In that
+case consider running ``rm -rf ~/linux/*`` as well now. This will remove the
+build artifacts and the Linux sources, but will leave the Git repository
+(~/linux/.git/) behind -- a simple ``git reset --hard`` thus will bring the
+sources back.
+
+Removing the repository as well would likely be unwise at this point: there
+is a decent chance developers will ask you to build another kernel to
+perform additional tests -- like testing a debug patch or a proposed fix.
+Details on how to perform those can be found in the section :ref:`Optional
+tasks: test reverts, patches, or later versions <introoptional_bissbs>`.
+
+Additional tests are also the reason why you want to keep the
+~/kernel-config-working file around for a few weeks.
+
+[:ref:`back to step-by-step guide <finishingtouch_bissbs>`]
+
+.. _introoptional_bisref:
+
+Test reverts, patches, or later versions
+----------------------------------------
+
+ *While or after reporting a bug, you might want or potentially will be asked
+ to test reverts, patches, proposed fixes, or other versions.*
+ [:ref:`... <introoptional_bissbs>`]
+
+All the commands used in this section should be pretty straight forward, so
+there is not much to add except one thing: when setting a kernel tag as
+instructed, ensure it is not much longer than the one used in the example, as
+problems will arise if the kernelrelease identifier exceeds 63 characters.
+
+[:ref:`back to step-by-step guide <introoptional_bissbs>`].
+
+
+Additional information
+======================
+
+.. _buildhost_bis:
+
+Build kernels on a different machine
+------------------------------------
+
+To compile kernels on another system, slightly alter the step-by-step guide's
+instructions:
+
+* Start following the guide on the machine where you want to install and test
+ the kernels later.
+
+* After executing ':ref:`Boot into the working kernel and briefly use the
+ apparently broken feature <bootworking_bissbs>`', save the list of loaded
+ modules to a file using ``lsmod > ~/test-machine-lsmod``. Then locate the
+ build configuration for the running kernel (see ':ref:`Start defining the
+ build configuration for your kernel <oldconfig_bisref>`' for hints on where
+ to find it) and store it as '~/test-machine-config-working'. Transfer both
+ files to the home directory of your build host.
+
+* Continue the guide on the build host (e.g. with ':ref:`Ensure to have enough
+ free space for building [...] <diskspace_bissbs>`').
+
+* When you reach ':ref:`Start preparing a kernel build configuration[...]
+ <oldconfig_bissbs>`': before running ``make olddefconfig`` for the first time,
+ execute the following command to base your configuration on the one from the
+ test machine's 'working' kernel::
+
+ cp ~/test-machine-config-working ~/linux/.config
+
+* During the next step to ':ref:`disable any apparently superfluous kernel
+ modules <localmodconfig_bissbs>`' use the following command instead::
+
+ yes '' | make localmodconfig LSMOD=~/lsmod_foo-machine localmodconfig
+
+* Continue the guide, but ignore the instructions outlining how to compile,
+ install, and reboot into a kernel every time they come up. Instead build
+ like this::
+
+ cp ~/kernel-config-working .config
+ make olddefconfig &&
+ make -j $(nproc --all) targz-pkg
+
+ This will generate a gzipped tar file whose name is printed in the last
+ line shown; for example, a kernel with the kernelrelease identifier
+ '6.0.0-rc1-local-g928a87efa423' built for x86 machines usually will
+ be stored as '~/linux/linux-6.0.0-rc1-local-g928a87efa423-x86.tar.gz'.
+
+ Copy that file to your test machine's home directory.
+
+* Switch to the test machine to check if you have enough space to hold another
+ kernel. Then extract the file you transferred::
+
+ sudo tar -xvzf ~/linux-6.0.0-rc1-local-g928a87efa423-x86.tar.gz -C /
+
+ Afterwards :ref:`generate the initramfs and add the kernel to your boot
+ loader's configuration <install_bisref>`; on some distributions the following
+ command will take care of both these tasks::
+
+ sudo /sbin/installkernel 6.0.0-rc1-local-g928a87efa423 /boot/vmlinuz-6.0.0-rc1-local-g928a87efa423
+
+ Now reboot and ensure you started the intended kernel.
+
+This approach even works when building for another architecture: just install
+cross-compilers and add the appropriate parameters to every invocation of make
+(e.g. ``make ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- [...]``).
+
+Additional reading material
+---------------------------
+
+* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and
+ `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_
+ in the Git documentation.
+* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_
+ from kernel developer Nathan Chancellor.
+* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_.
+* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_.
+
+..
+ end-of-content
+..
+ This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If
+ you spot a typo or small mistake, feel free to let him know directly and
+ he'll fix it. You are free to do the same in a mostly informal way if you
+ want to contribute changes to the text -- but for copyright reasons please CC
+ linux-doc@vger.kernel.org and 'sign-off' your contribution as
+ Documentation/process/submitting-patches.rst explains in the section 'Sign
+ your work - the Developer's Certificate of Origin'.
+..
+ This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top
+ of the file. If you want to distribute this text under CC-BY-4.0 only,
+ please use 'The Linux kernel development community' for author attribution
+ and link this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst
+
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst
index b2e254ec8ee8..6be38c1b9c5b 100644
--- a/Documentation/admin-guide/workload-tracing.rst
+++ b/Documentation/admin-guide/workload-tracing.rst
@@ -83,7 +83,7 @@ scripts/ver_linux is a good way to check if your system already has
the necessary tools::
sudo apt-get build-essentials flex bison yacc
- sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev
+ sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev
cscope is a good tool to browse kernel sources. Let's install it now::