diff options
Diffstat (limited to 'Documentation/admin-guide')
133 files changed, 7074 insertions, 1802 deletions
diff --git a/Documentation/admin-guide/LSM/apparmor.rst b/Documentation/admin-guide/LSM/apparmor.rst index 6cf81bbd7ce8..47939ee89d74 100644 --- a/Documentation/admin-guide/LSM/apparmor.rst +++ b/Documentation/admin-guide/LSM/apparmor.rst @@ -18,8 +18,11 @@ set ``CONFIG_SECURITY_APPARMOR=y`` If AppArmor should be selected as the default security module then set:: - CONFIG_DEFAULT_SECURITY="apparmor" - CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1 + CONFIG_DEFAULT_SECURITY_APPARMOR=y + +The CONFIG_LSM parameter manages the order and selection of LSMs. +Specify apparmor as the first "major" module (e.g. AppArmor, SELinux, Smack) +in the list. Build the kernel diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst index a6ba95fbaa9f..b44ef68f6e4d 100644 --- a/Documentation/admin-guide/LSM/index.rst +++ b/Documentation/admin-guide/LSM/index.rst @@ -47,3 +47,5 @@ subdirectories. tomoyo Yama SafeSetID + ipe + landlock diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst new file mode 100644 index 000000000000..dc7088451f9d --- /dev/null +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -0,0 +1,824 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Integrity Policy Enforcement (IPE) +================================== + +.. NOTE:: + + This is the documentation for admins, system builders, or individuals + attempting to use IPE. If you're looking for more developer-focused + documentation about IPE please see :doc:`the design docs </security/ipe>`. + +Overview +-------- + +Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a +complementary approach to access control. Unlike traditional access control +mechanisms that rely on labels and paths for decision-making, IPE focuses +on the immutable security properties inherent to system components. These +properties are fundamental attributes or features of a system component +that cannot be altered, ensuring a consistent and reliable basis for +security decisions. + +To elaborate, in the context of IPE, system components primarily refer to +files or the devices these files reside on. However, this is just a +starting point. The concept of system components is flexible and can be +extended to include new elements as the system evolves. The immutable +properties include the origin of a file, which remains constant and +unchangeable over time. For example, IPE policies can be crafted to trust +files originating from the initramfs. Since initramfs is typically verified +by the bootloader, its files are deemed trustworthy; "file is from +initramfs" becomes an immutable property under IPE's consideration. + +The immutable property concept extends to the security features enabled on +a file's origin, such as dm-verity or fs-verity, which provide a layer of +integrity and trust. For example, IPE allows the definition of policies +that trust files from a dm-verity protected device. dm-verity ensures the +integrity of an entire device by providing a verifiable and immutable state +of its contents. Similarly, fs-verity offers filesystem-level integrity +checks, allowing IPE to enforce policies that trust files protected by +fs-verity. These two features cannot be turned off once established, so +they are considered immutable properties. These examples demonstrate how +IPE leverages immutable properties, such as a file's origin and its +integrity protection mechanisms, to make access control decisions. + +For the IPE policy, specifically, it grants the ability to enforce +stringent access controls by assessing security properties against +reference values defined within the policy. This assessment can be based on +the existence of a security property (e.g., verifying if a file originates +from initramfs) or evaluating the internal state of an immutable security +property. The latter includes checking the roothash of a dm-verity +protected device, determining whether dm-verity possesses a valid +signature, assessing the digest of a fs-verity protected file, or +determining whether fs-verity possesses a valid built-in signature. This +nuanced approach to policy enforcement enables a highly secure and +customizable system defense mechanism, tailored to specific security +requirements and trust models. + +To enable IPE, ensure that ``CONFIG_SECURITY_IPE`` (under +:menuselection:`Security -> Integrity Policy Enforcement (IPE)`) config +option is enabled. + +Use Cases +--------- + +IPE works best in fixed-function devices: devices in which their purpose +is clearly defined and not supposed to be changed (e.g. network firewall +device in a data center, an IoT device, etcetera), where all software and +configuration is built and provisioned by the system owner. + +IPE is a long-way off for use in general-purpose computing: the Linux +community as a whole tends to follow a decentralized trust model (known as +the web of trust), which IPE has no support for it yet. Instead, IPE +supports PKI (public key infrastructure), which generally designates a +set of trusted entities that provide a measure of absolute trust. + +Additionally, while most packages are signed today, the files inside +the packages (for instance, the executables), tend to be unsigned. This +makes it difficult to utilize IPE in systems where a package manager is +expected to be functional, without major changes to the package manager +and ecosystem behind it. + +The digest_cache LSM [#digest_cache_lsm]_ is a system that when combined with IPE, +could be used to enable and support general-purpose computing use cases. + +Known Limitations +----------------- + +IPE cannot verify the integrity of anonymous executable memory, such as +the trampolines created by gcc closures and libffi (<3.4.2), or JIT'd code. +Unfortunately, as this is dynamically generated code, there is no way +for IPE to ensure the integrity of this code to form a trust basis. + +IPE cannot verify the integrity of programs written in interpreted +languages when these scripts are invoked by passing these program files +to the interpreter. This is because the way interpreters execute these +files; the scripts themselves are not evaluated as executable code +through one of IPE's hooks, but they are merely text files that are read +(as opposed to compiled executables) [#interpreters]_. + +Threat Model +------------ + +IPE specifically targets the risk of tampering with user-space executable +code after the kernel has initially booted, including the kernel modules +loaded from userspace via ``modprobe`` or ``insmod``. + +To illustrate, consider a scenario where an untrusted binary, possibly +malicious, is downloaded along with all necessary dependencies, including a +loader and libc. The primary function of IPE in this context is to prevent +the execution of such binaries and their dependencies. + +IPE achieves this by verifying the integrity and authenticity of all +executable code before allowing them to run. It conducts a thorough +check to ensure that the code's integrity is intact and that they match an +authorized reference value (digest, signature, etc) as per the defined +policy. If a binary does not pass this verification process, either +because its integrity has been compromised or it does not meet the +authorization criteria, IPE will deny its execution. Additionally, IPE +generates audit logs which may be utilized to detect and analyze failures +resulting from policy violation. + +Tampering threat scenarios include modification or replacement of +executable code by a range of actors including: + +- Actors with physical access to the hardware +- Actors with local network access to the system +- Actors with access to the deployment system +- Compromised internal systems under external control +- Malicious end users of the system +- Compromised end users of the system +- Remote (external) compromise of the system + +IPE does not mitigate threats arising from malicious but authorized +developers (with access to a signing certificate), or compromised +developer tools used by them (i.e. return-oriented programming attacks). +Additionally, IPE draws hard security boundary between userspace and +kernelspace. As a result, kernel-level exploits are considered outside +the scope of IPE and mitigation is left to other mechanisms. + +Policy +------ + +IPE policy is a plain-text [#devdoc]_ policy composed of multiple statements +over several lines. There is one required line, at the top of the +policy, indicating the policy name, and the policy version, for +instance:: + + policy_name=Ex_Policy policy_version=0.0.0 + +The policy name is a unique key identifying this policy in a human +readable name. This is used to create nodes under securityfs as well as +uniquely identify policies to deploy new policies vs update existing +policies. + +The policy version indicates the current version of the policy (NOT the +policy syntax version). This is used to prevent rollback of policy to +potentially insecure previous versions of the policy. + +The next portion of IPE policy are rules. Rules are formed by key=value +pairs, known as properties. IPE rules require two properties: ``action``, +which determines what IPE does when it encounters a match against the +rule, and ``op``, which determines when the rule should be evaluated. +The ordering is significant, a rule must start with ``op``, and end with +``action``. Thus, a minimal rule is:: + + op=EXECUTE action=ALLOW + +This example will allow any execution. Additional properties are used to +assess immutable security properties about the files being evaluated. +These properties are intended to be descriptions of systems within the +kernel that can provide a measure of integrity verification, such that IPE +can determine the trust of the resource based on the value of the property. + +Rules are evaluated top-to-bottom. As a result, any revocation rules, +or denies should be placed early in the file to ensure that these rules +are evaluated before a rule with ``action=ALLOW``. + +IPE policy supports comments. The character '#' will function as a +comment, ignoring all characters to the right of '#' until the newline. + +The default behavior of IPE evaluations can also be expressed in policy, +through the ``DEFAULT`` statement. This can be done at a global level, +or a per-operation level:: + + # Global + DEFAULT action=ALLOW + + # Operation Specific + DEFAULT op=EXECUTE action=ALLOW + +A default must be set for all known operations in IPE. If you want to +preserve older policies being compatible with newer kernels that can introduce +new operations, set a global default of ``ALLOW``, then override the +defaults on a per-operation basis (as above). + +With configurable policy-based LSMs, there's several issues with +enforcing the configurable policies at startup, around reading and +parsing the policy: + +1. The kernel *should* not read files from userspace, so directly reading + the policy file is prohibited. +2. The kernel command line has a character limit, and one kernel module + should not reserve the entire character limit for its own + configuration. +3. There are various boot loaders in the kernel ecosystem, so handing + off a memory block would be costly to maintain. + +As a result, IPE has addressed this problem through a concept of a "boot +policy". A boot policy is a minimal policy which is compiled into the +kernel. This policy is intended to get the system to a state where +userspace is set up and ready to receive commands, at which point a more +complex policy can be deployed via securityfs. The boot policy can be +specified via ``SECURITY_IPE_BOOT_POLICY`` config option, which accepts +a path to a plain-text version of the IPE policy to apply. This policy +will be compiled into the kernel. If not specified, IPE will be disabled +until a policy is deployed and activated through securityfs. + +Deploying Policies +~~~~~~~~~~~~~~~~~~ + +Policies can be deployed from userspace through securityfs. These policies +are signed through the PKCS#7 message format to enforce some level of +authorization of the policies (prohibiting an attacker from gaining +unconstrained root, and deploying an "allow all" policy). These +policies must be signed by a certificate that chains to the +``SYSTEM_TRUSTED_KEYRING``, or to the secondary and/or platform keyrings if +``CONFIG_IPE_POLICY_SIG_SECONDARY_KEYRING`` and/or +``CONFIG_IPE_POLICY_SIG_PLATFORM_KEYRING`` are enabled, respectively. +With openssl, the policy can be signed by:: + + openssl smime -sign \ + -in "$MY_POLICY" \ + -signer "$MY_CERTIFICATE" \ + -inkey "$MY_PRIVATE_KEY" \ + -noattr \ + -nodetach \ + -nosmimecap \ + -outform der \ + -out "$MY_POLICY.p7b" + +Deploying the policies is done through securityfs, through the +``new_policy`` node. To deploy a policy, simply cat the file into the +securityfs node:: + + cat "$MY_POLICY.p7b" > /sys/kernel/security/ipe/new_policy + +Upon success, this will create one subdirectory under +``/sys/kernel/security/ipe/policies/``. The subdirectory will be the +``policy_name`` field of the policy deployed, so for the example above, +the directory will be ``/sys/kernel/security/ipe/policies/Ex_Policy``. +Within this directory, there will be seven files: ``pkcs7``, ``policy``, +``name``, ``version``, ``active``, ``update``, and ``delete``. + +The ``pkcs7`` file is read-only. Reading it returns the raw PKCS#7 data +that was provided to the kernel, representing the policy. If the policy being +read is the boot policy, this will return ``ENOENT``, as it is not signed. + +The ``policy`` file is read only. Reading it returns the PKCS#7 inner +content of the policy, which will be the plain text policy. + +The ``active`` file is used to set a policy as the currently active policy. +This file is rw, and accepts a value of ``"1"`` to set the policy as active. +Since only a single policy can be active at one time, all other policies +will be marked inactive. The policy being marked active must have a policy +version greater or equal to the currently-running version. + +The ``update`` file is used to update a policy that is already present +in the kernel. This file is write-only and accepts a PKCS#7 signed +policy. Two checks will always be performed on this policy: First, the +``policy_names`` must match with the updated version and the existing +version. Second the updated policy must have a policy version greater than +the currently-running version. This is to prevent rollback attacks. + +The ``delete`` file is used to remove a policy that is no longer needed. +This file is write-only and accepts a value of ``1`` to delete the policy. +On deletion, the securityfs node representing the policy will be removed. +However, delete the current active policy is not allowed and will return +an operation not permitted error. + +Similarly, writing to both ``update`` and ``new_policy`` could result in +bad message(policy syntax error) or file exists error. The latter error happens +when trying to deploy a policy with a ``policy_name`` while the kernel already +has a deployed policy with the same ``policy_name``. + +Deploying a policy will *not* cause IPE to start enforcing the policy. IPE will +only enforce the policy marked active. Note that only one policy can be active +at a time. + +Once deployment is successful, the policy can be activated, by writing file +``/sys/kernel/security/ipe/policies/$policy_name/active``. +For example, the ``Ex_Policy`` can be activated by:: + + echo 1 > "/sys/kernel/security/ipe/policies/Ex_Policy/active" + +From above point on, ``Ex_Policy`` is now the enforced policy on the +system. + +IPE also provides a way to delete policies. This can be done via the +``delete`` securityfs node, +``/sys/kernel/security/ipe/policies/$policy_name/delete``. +Writing ``1`` to that file deletes the policy:: + + echo 1 > "/sys/kernel/security/ipe/policies/$policy_name/delete" + +There is only one requirement to delete a policy: the policy being deleted +must be inactive. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack), all + writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Modes +~~~~~ + +IPE supports two modes of operation: permissive (similar to SELinux's +permissive mode) and enforced. In permissive mode, all events are +checked and policy violations are logged, but the policy is not really +enforced. This allows users to test policies before enforcing them. + +The default mode is enforce, and can be changed via the kernel command +line parameter ``ipe.enforce=(0|1)``, or the securityfs node +``/sys/kernel/security/ipe/enforce``. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Audit Events +~~~~~~~~~~~~ + +1420 AUDIT_IPE_ACCESS +^^^^^^^^^^^^^^^^^^^^^ +Event Examples:: + + type=1420 audit(1653364370.067:61): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2241 comm="ld-linux.so" path="/deny/lib/libc.so.6" dev="sda2" ino=14549020 rule="DEFAULT action=DENY" + type=1300 audit(1653364370.067:61): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=7f1105a28000 a1=195000 a2=5 a3=812 items=0 ppid=2219 pid=2241 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="ld-linux.so" exe="/tmp/ipe-test/lib/ld-linux.so" subj=unconfined key=(null) + type=1327 audit(1653364370.067:61): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + + type=1420 audit(1653364735.161:64): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2472 comm="mmap_test" path=? dev=? ino=? rule="DEFAULT action=DENY" + type=1300 audit(1653364735.161:64): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=0 a1=1000 a2=4 a3=21 items=0 ppid=2219 pid=2472 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="mmap_test" exe="/root/overlake_test/upstream_test/vol_fsverity/bin/mmap_test" subj=unconfined key=(null) + type=1327 audit(1653364735.161:64): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + +This event indicates that IPE made an access control decision; the IPE +specific record (1420) is always emitted in conjunction with a +``AUDITSYSCALL`` record. + +Determining whether IPE is in permissive or enforced mode can be derived +from ``success`` property and exit code of the ``AUDITSYSCALL`` record. + + +Field descriptions: + ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===========+============+===========+=================================================================================+ +| ipe_op | string | No | The IPE operation name associated with the log | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ipe_hook | string | No | The name of the LSM hook that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| enforcing | integer | No | The current IPE enforcing state 1 is in enforcing mode, 0 is in permissive mode | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| pid | integer | No | The pid of the process that triggered the IPE event. | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| comm | string | No | The command line program name of the process that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| path | string | Yes | The absolute path to the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ino | integer | Yes | The inode number of the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| dev | string | Yes | The device name of the evaluated file, e.g. vda | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| rule | string | No | The matched policy rule | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ + +1421 AUDIT_IPE_CONFIG_CHANGE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1421 audit(1653425583.136:54): old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425583.136:54): SYSCALL arch=c000003e syscall=1 success=yes exit=2 a0=3 a1=5596fcae1fb0 a2=2 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425583.136:54): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2 + +This event indicates that IPE switched the active poliy from one to another +along with the version and the hash digest of the two policies. +Note IPE can only have one policy active at a time, all access decision +evaluation is based on the current active policy. +The normal procedure to deploy a new policy is loading the policy to deploy +into the kernel first, then switch the active policy to it. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++------------------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++========================+============+===========+===================================================+ +| old_active_pol_name | string | Yes | The name of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_active_pol_version | string | Yes | The version of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_policy_digest | string | Yes | The hash of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_name | string | No | The name of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_version | string | No | The version of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_policy_digest | string | No | The hash of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++------------------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++------------------------+------------+-----------+---------------------------------------------------+ + +1422 AUDIT_IPE_POLICY_LOAD +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 errno=0 + type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E + +This record indicates a new policy has been loaded into the kernel with the policy name, policy version and policy hash. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++----------------+------------+-----------+-------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+=============================================================+ +| policy_name | string | Yes | The policy_name | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_version | string | Yes | The policy_version | ++----------------+------------+-----------+-------------------------------------------------------------+ +| policy_digest | string | Yes | The policy hash | ++----------------+------------+-----------+-------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+-------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+-------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+-------------------------------------------------------------+ +| errno | integer | No | Error code from policy loading operations (see table below) | ++----------------+------------+-----------+-------------------------------------------------------------+ + +Policy error codes (errno): + +The following table lists the error codes that may appear in the errno field while loading or updating the policy: + ++----------------+--------------------------------------------------------+ +| Error Code | Description | ++================+========================================================+ +| 0 | Success | ++----------------+--------------------------------------------------------+ +| -EPERM | Insufficient permission | ++----------------+--------------------------------------------------------+ +| -EEXIST | Same name policy already deployed | ++----------------+--------------------------------------------------------+ +| -EBADMSG | Policy is invalid | ++----------------+--------------------------------------------------------+ +| -ENOMEM | Out of memory (OOM) | ++----------------+--------------------------------------------------------+ +| -ERANGE | Policy version number overflow | ++----------------+--------------------------------------------------------+ +| -EINVAL | Policy version parsing error | ++----------------+--------------------------------------------------------+ +| -ENOKEY | Key used to sign the IPE policy not found in keyring | ++----------------+--------------------------------------------------------+ +| -EKEYREJECTED | Policy signature verification failed | ++----------------+--------------------------------------------------------+ +| -ESTALE | Attempting to update an IPE policy with older version | ++----------------+--------------------------------------------------------+ +| -ENOENT | Policy was deleted while updating | ++----------------+--------------------------------------------------------+ + +1404 AUDIT_MAC_STATUS +^^^^^^^^^^^^^^^^^^^^^ + +Event Examples:: + + type=1404 audit(1653425689.008:55): enforcing=0 old_enforcing=1 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + + type=1404 audit(1653425689.008:55): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===============+============+===========+=================================================================================================+ +| enforcing | integer | No | The enforcing state IPE is being switched to, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old_enforcing | integer | No | The enforcing state IPE is being switched from, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| enabled | integer | No | The new TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old-enabled | integer | No | The old TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ + + +Success Auditing +^^^^^^^^^^^^^^^^ + +IPE supports success auditing. When enabled, all events that pass IPE +policy and are not blocked will emit an audit event. This is disabled by +default, and can be enabled via the kernel command line +``ipe.success_audit=(0|1)`` or +``/sys/kernel/security/ipe/success_audit`` securityfs file. + +This is *very* noisy, as IPE will check every userspace binary on the +system, but is useful for debugging policies. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Properties +---------- + +As explained above, IPE properties are ``key=value`` pairs expressed in IPE +policy. Two properties are built-into the policy parser: 'op' and 'action'. +The other properties are used to restrict immutable security properties +about the files being evaluated. Currently those properties are: +'``boot_verified``', '``dmverity_signature``', '``dmverity_roothash``', +'``fsverity_signature``', '``fsverity_digest``'. A description of all +properties supported by IPE are listed below: + +op +~~ + +Indicates the operation for a rule to apply to. Must be in every rule, +as the first token. IPE supports the following operations: + + ``EXECUTE`` + + Pertains to any file attempting to be executed, or loaded as an + executable. + + ``FIRMWARE``: + + Pertains to firmware being loaded via the firmware_class interface. + This covers both the preallocated buffer and the firmware file + itself. + + ``KMODULE``: + + Pertains to loading kernel modules via ``modprobe`` or ``insmod``. + + ``KEXEC_IMAGE``: + + Pertains to kernel images loading via ``kexec``. + + ``KEXEC_INITRAMFS`` + + Pertains to initrd images loading via ``kexec --initrd``. + + ``POLICY``: + + Controls loading policies via reading a kernel-space initiated read. + + An example of such is loading IMA policies by writing the path + to the policy file to ``$securityfs/ima/policy`` + + ``X509_CERT``: + + Controls loading IMA certificates through the Kconfigs, + ``CONFIG_IMA_X509_PATH`` and ``CONFIG_EVM_X509_PATH``. + +action +~~~~~~ + + Determines what IPE should do when a rule matches. Must be in every + rule, as the final clause. Can be one of: + + ``ALLOW``: + + If the rule matches, explicitly allow access to the resource to proceed + without executing any more rules. + + ``DENY``: + + If the rule matches, explicitly prohibit access to the resource to + proceed without executing any more rules. + +boot_verified +~~~~~~~~~~~~~ + + This property can be utilized for authorization of files from initramfs. + The format of this property is:: + + boot_verified=(TRUE|FALSE) + + + .. WARNING:: + + This property will trust files from initramfs(rootfs). It should + only be used during early booting stage. Before mounting the real + rootfs on top of the initramfs, initramfs script will recursively + remove all files and directories on the initramfs. This is typically + implemented by using switch_root(8) [#switch_root]_. Therefore the + initramfs will be empty and not accessible after the real + rootfs takes over. It is advised to switch to a different policy + that doesn't rely on the property after this point. + This ensures that the trust policies remain relevant and effective + throughout the system's operation. + +dmverity_roothash +~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization or revocation of + specific dm-verity volumes, identified via their root hashes. It has a + dependency on the DM_VERITY module. This property is controlled by + the ``IPE_PROP_DM_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``DM_VERITY`` are all enabled. + The format of this property is:: + + dmverity_roothash=DigestName:HexadecimalString + + The supported DigestNames for dmverity_roothash are [#dmveritydigests]_ + + + blake2b-512 + + blake2s-256 + + sha256 + + sha384 + + sha512 + + sha3-224 + + sha3-256 + + sha3-384 + + sha3-512 + + sm3 + + rmd160 + +dmverity_signature +~~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of all dm-verity + volumes that have a signed roothash that validated by a keyring + specified by dm-verity's configuration, either the system trusted + keyring, or the secondary keyring. It depends on + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` config option and is controlled by + the ``IPE_PROP_DM_VERITY_SIGNATURE`` config option, it will be automatically + selected when ``SECURITY_IPE``, ``DM_VERITY`` and + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` are all enabled. + The format of this property is:: + + dmverity_signature=(TRUE|FALSE) + +fsverity_digest +~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of specific fsverity + enabled files, identified via their fsverity digests. + It depends on ``FS_VERITY`` config option and is controlled by + the ``IPE_PROP_FS_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``FS_VERITY`` are all enabled. + The format of this property is:: + + fsverity_digest=DigestName:HexadecimalString + + The supported DigestNames for fsverity_digest are [#fsveritydigest]_ + + + sha256 + + sha512 + +fsverity_signature +~~~~~~~~~~~~~~~~~~ + + This property is used to authorize all fs-verity enabled files that have + been verified by fs-verity's built-in signature mechanism. The signature + verification relies on a key stored within the ".fs-verity" keyring. It + depends on ``FS_VERITY_BUILTIN_SIGNATURES`` config option and + it is controlled by the ``IPE_PROP_FS_VERITY`` config option, + it will be automatically selected when ``SECURITY_IPE``, ``FS_VERITY`` + and ``FS_VERITY_BUILTIN_SIGNATURES`` are all enabled. + The format of this property is:: + + fsverity_signature=(TRUE|FALSE) + +Policy Examples +--------------- + +Allow all +~~~~~~~~~ + +:: + + policy_name=Allow_All policy_version=0.0.0 + DEFAULT action=ALLOW + +Allow only initramfs +~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + +Allow any signed and validated dm-verity volume and the initramfs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_DMV_And_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Prohibit execution from a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Deny_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:cd2c5bae7c6c579edaae4353049d58eb5f2e8be0244bf05345bc8e5ed257baff action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Allow only a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:401fcec5944823ae12f62726e8184407a5fa9599783f030dec146938 action=ALLOW + +Allow any fs-verity file with a valid built-in signature +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_And_Validated_FSVerity policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_signature=TRUE action=ALLOW + +Allow execution of a specific fs-verity file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=ALLOW_FSV_By_Digest policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_digest=sha256:fd88f2b8824e197f850bf4c5109bea5cf0ee38104f710843bb72da796ba5af9e action=ALLOW + +Additional Information +---------------------- + +- `Github Repository <https://github.com/microsoft/ipe>`_ +- :doc:`Developer and design docs for IPE </security/ipe>` + +FAQ +--- + +Q: + What's the difference between other LSMs which provide a measure of + trust-based access control? + +A: + + In general, there's two other LSMs that can provide similar functionality: + IMA, and Loadpin. + + IMA and IPE are functionally very similar. The significant difference between + the two is the policy. [#devdoc]_ + + Loadpin and IPE differ fairly dramatically, as Loadpin only covers the IPE's + kernel read operations, whereas IPE is capable of controlling execution + on top of kernel read. The trust model is also different; Loadpin roots its + trust in the initial super-block, whereas trust in IPE is stemmed from kernel + itself (via ``SYSTEM_TRUSTED_KEYS``). + +----------- + +.. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/ + +.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_. + +.. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on + this topic. + +.. [#switch_root] https://man7.org/linux/man-pages/man8/switch_root.8.html + +.. [#dmveritydigests] These hash algorithms are based on values accepted by + the Linux crypto API; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. + +.. [#fsveritydigest] These hash algorithms are based on values accepted by the + kernel's fsverity support; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst new file mode 100644 index 000000000000..9e61607def08 --- /dev/null +++ b/Documentation/admin-guide/LSM/landlock.rst @@ -0,0 +1,158 @@ +.. SPDX-License-Identifier: GPL-2.0 +.. Copyright © 2025 Microsoft Corporation + +================================ +Landlock: system-wide management +================================ + +:Author: Mickaël Salaün +:Date: March 2025 + +Landlock can leverage the audit framework to log events. + +User space documentation can be found here: +Documentation/userspace-api/landlock.rst. + +Audit +===== + +Denied access requests are logged by default for a sandboxed program if `audit` +is enabled. This default behavior can be changed with the +sys_landlock_restrict_self() flags (cf. +Documentation/userspace-api/landlock.rst). Landlock logs can also be masked +thanks to audit rules. Landlock can generate 2 audit record types. + +Record types +------------ + +AUDIT_LANDLOCK_ACCESS + This record type identifies a denied access request to a kernel resource. + The ``domain`` field indicates the ID of the domain that blocked the + request. The ``blockers`` field indicates the cause(s) of this denial + (separated by a comma), and the following fields identify the kernel object + (similar to SELinux). There may be more than one of this record type per + audit event. + + Example with a file link request generating two records in the same event:: + + domain=195ba459b blockers=fs.refer path="/usr/bin" dev="vda2" ino=351 + domain=195ba459b blockers=fs.make_reg,fs.refer path="/usr/local" dev="vda2" ino=365 + +AUDIT_LANDLOCK_DOMAIN + This record type describes the status of a Landlock domain. The ``status`` + field can be either ``allocated`` or ``deallocated``. + + The ``allocated`` status is part of the same audit event and follows + the first logged ``AUDIT_LANDLOCK_ACCESS`` record of a domain. It identifies + Landlock domain information at the time of the sys_landlock_restrict_self() + call with the following fields: + + - the ``domain`` ID + - the enforcement ``mode`` + - the domain creator's ``pid`` + - the domain creator's ``uid`` + - the domain creator's executable path (``exe``) + - the domain creator's command line (``comm``) + + Example:: + + domain=195ba459b status=allocated mode=enforcing pid=300 uid=0 exe="/root/sandboxer" comm="sandboxer" + + The ``deallocated`` status is an event on its own and it identifies a + Landlock domain release. After such event, it is guarantee that the + related domain ID will never be reused during the lifetime of the system. + The ``domain`` field indicates the ID of the domain which is released, and + the ``denials`` field indicates the total number of denied access request, + which might not have been logged according to the audit rules and + sys_landlock_restrict_self()'s flags. + + Example:: + + domain=195ba459b status=deallocated denials=3 + + +Event samples +-------------- + +Here are two examples of log events (see serial numbers). + +In this example a sandboxed program (``kill``) tries to send a signal to the +init process, which is denied because of the signal scoping restriction +(``LL_SCOPED=s``):: + + $ LL_FS_RO=/ LL_FS_RW=/ LL_SCOPED=s LL_FORCE_LOG=1 ./sandboxer kill 1 + +This command generates two events, each identified with a unique serial +number following a timestamp (``msg=audit(1729738800.268:30)``). The first +event (serial ``30``) contains 4 records. The first record +(``type=LANDLOCK_ACCESS``) shows an access denied by the domain `1a6fdc66f`. +The cause of this denial is signal scopping restriction +(``blockers=scope.signal``). The process that would have receive this signal +is the init process (``opid=1 ocomm="systemd"``). + +The second record (``type=LANDLOCK_DOMAIN``) describes (``status=allocated``) +domain `1a6fdc66f`. This domain was created by process ``286`` executing the +``/root/sandboxer`` program launched by the root user. + +The third record (``type=SYSCALL``) describes the syscall, its provided +arguments, its result (``success=no exit=-1``), and the process that called it. + +The fourth record (``type=PROCTITLE``) shows the command's name as an +hexadecimal value. This can be translated with ``python -c +'print(bytes.fromhex("6B696C6C0031"))'``. + +Finally, the last record (``type=LANDLOCK_DOMAIN``) is also the only one from +the second event (serial ``31``). It is not tied to a direct user space action +but an asynchronous one to free resources tied to a Landlock domain +(``status=deallocated``). This can be useful to know that the following logs +will not concern the domain ``1a6fdc66f`` anymore. This record also summarize +the number of requests this domain denied (``denials=1``), whether they were +logged or not. + +.. code-block:: + + type=LANDLOCK_ACCESS msg=audit(1729738800.268:30): domain=1a6fdc66f blockers=scope.signal opid=1 ocomm="systemd" + type=LANDLOCK_DOMAIN msg=audit(1729738800.268:30): domain=1a6fdc66f status=allocated mode=enforcing pid=286 uid=0 exe="/root/sandboxer" comm="sandboxer" + type=SYSCALL msg=audit(1729738800.268:30): arch=c000003e syscall=62 success=no exit=-1 [..] ppid=272 pid=286 auid=0 uid=0 gid=0 [...] comm="kill" [...] + type=PROCTITLE msg=audit(1729738800.268:30): proctitle=6B696C6C0031 + type=LANDLOCK_DOMAIN msg=audit(1729738800.324:31): domain=1a6fdc66f status=deallocated denials=1 + +Here is another example showcasing filesystem access control:: + + $ LL_FS_RO=/ LL_FS_RW=/tmp LL_FORCE_LOG=1 ./sandboxer sh -c "echo > /etc/passwd" + +The related audit logs contains 8 records from 3 different events (serials 33, +34 and 35) created by the same domain `1a6fdc679`:: + + type=LANDLOCK_ACCESS msg=audit(1729738800.221:33): domain=1a6fdc679 blockers=fs.write_file path="/dev/tty" dev="devtmpfs" ino=9 + type=LANDLOCK_DOMAIN msg=audit(1729738800.221:33): domain=1a6fdc679 status=allocated mode=enforcing pid=289 uid=0 exe="/root/sandboxer" comm="sandboxer" + type=SYSCALL msg=audit(1729738800.221:33): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...] + type=PROCTITLE msg=audit(1729738800.221:33): proctitle=7368002D63006563686F203E202F6574632F706173737764 + type=LANDLOCK_ACCESS msg=audit(1729738800.221:34): domain=1a6fdc679 blockers=fs.write_file path="/etc/passwd" dev="vda2" ino=143821 + type=SYSCALL msg=audit(1729738800.221:34): arch=c000003e syscall=257 success=no exit=-13 [...] ppid=272 pid=289 auid=0 uid=0 gid=0 [...] comm="sh" [...] + type=PROCTITLE msg=audit(1729738800.221:34): proctitle=7368002D63006563686F203E202F6574632F706173737764 + type=LANDLOCK_DOMAIN msg=audit(1729738800.261:35): domain=1a6fdc679 status=deallocated denials=2 + + +Event filtering +--------------- + +If you get spammed with audit logs related to Landlock, this is either an +attack attempt or a bug in the security policy. We can put in place some +filters to limit noise with two complementary ways: + +- with sys_landlock_restrict_self()'s flags if we can fix the sandboxed + programs, +- or with audit rules (see :manpage:`auditctl(8)`). + +Additional documentation +======================== + +* `Linux Audit Documentation`_ +* Documentation/userspace-api/landlock.rst +* Documentation/security/landlock.rst +* https://landlock.io + +.. Links +.. _Linux Audit Documentation: + https://github.com/linux-audit/audit-documentation/wiki diff --git a/Documentation/admin-guide/LSM/tomoyo.rst b/Documentation/admin-guide/LSM/tomoyo.rst index 4bc9c2b4da6f..bdb2c2e2a1b2 100644 --- a/Documentation/admin-guide/LSM/tomoyo.rst +++ b/Documentation/admin-guide/LSM/tomoyo.rst @@ -9,8 +9,8 @@ TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel. LiveCD-based tutorials are available at -http://tomoyo.sourceforge.jp/1.8/ubuntu12.04-live.html -http://tomoyo.sourceforge.jp/1.8/centos6-live.html +https://tomoyo.sourceforge.net/1.8/ubuntu12.04-live.html +https://tomoyo.sourceforge.net/1.8/centos6-live.html Though these tutorials use non-LSM version of TOMOYO, they are useful for you to know what TOMOYO is. @@ -21,45 +21,32 @@ How to enable TOMOYO? Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on kernel's command line. -Please see http://tomoyo.osdn.jp/2.5/ for details. +Please see https://tomoyo.sourceforge.net/2.6/ for details. Where is documentation? ======================= User <-> Kernel interface documentation is available at -https://tomoyo.osdn.jp/2.5/policy-specification/index.html . +https://tomoyo.sourceforge.net/2.6/policy-specification/index.html . Materials we prepared for seminars and symposiums are available at -https://osdn.jp/projects/tomoyo/docs/?category_id=532&language_id=1 . +https://sourceforge.net/projects/tomoyo/files/docs/ . Below lists are chosen from three aspects. What is TOMOYO? TOMOYO Linux Overview - https://osdn.jp/projects/tomoyo/docs/lca2009-takeda.pdf + https://sourceforge.net/projects/tomoyo/files/docs/lca2009-takeda.pdf TOMOYO Linux: pragmatic and manageable security for Linux - https://osdn.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf + https://sourceforge.net/projects/tomoyo/files/docs/freedomhectaipei-tomoyo.pdf TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box - https://osdn.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf + https://sourceforge.net/projects/tomoyo/files/docs/PacSec2007-en-no-demo.pdf What can TOMOYO do? Deep inside TOMOYO Linux - https://osdn.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf + https://sourceforge.net/projects/tomoyo/files/docs/lca2009-kumaneko.pdf The role of "pathname based access control" in security. - https://osdn.jp/projects/tomoyo/docs/lfj2008-bof.pdf + https://sourceforge.net/projects/tomoyo/files/docs/lfj2008-bof.pdf History of TOMOYO? Realities of Mainlining - https://osdn.jp/projects/tomoyo/docs/lfj2008.pdf - -What is future plan? -==================== - -We believe that inode based security and name based security are complementary -and both should be used together. But unfortunately, so far, we cannot enable -multiple LSM modules at the same time. We feel sorry that you have to give up -SELinux/SMACK/AppArmor etc. when you want to use TOMOYO. - -We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM -version of TOMOYO, available at http://tomoyo.osdn.jp/1.8/ . -LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning -to port non-LSM version's functionalities to LSM versions. + https://sourceforge.net/projects/tomoyo/files/docs/lfj2008.pdf diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index f2bebff6a733..05301f03b717 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -165,7 +165,7 @@ Configuring the kernel "make xconfig" Qt based configuration tool. - "make gconfig" GTK+ based configuration tool. + "make gconfig" GTK based configuration tool. "make oldconfig" Default all questions based on the contents of your existing ./.config file and asking about @@ -176,7 +176,7 @@ Configuring the kernel values without prompting. "make defconfig" Create a ./.config file by using the default - symbol values from either arch/$ARCH/defconfig + symbol values from either arch/$ARCH/configs/defconfig or arch/$ARCH/configs/${PLATFORM}_defconfig, depending on the architecture. @@ -259,7 +259,7 @@ Configuring the kernel Compiling the kernel -------------------- - - Make sure you have at least gcc 5.1 available. + - Make sure you have at least gcc 8.1 available. For more information, refer to :ref:`Documentation/process/changes.rst <changes>`. - Do a ``make`` to create a compressed kernel image. It is also possible to do @@ -356,5 +356,5 @@ instructions at 'Documentation/admin-guide/reporting-issues.rst'. Hints on understanding kernel bug reports are in 'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel -with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and -'Documentation/dev-tools/kgdb.rst'. +with gdb is in 'Documentation/process/debugging/gdb-kernel-debugging.rst' and +'Documentation/process/debugging/kgdb.rst'. diff --git a/Documentation/admin-guide/abi-obsolete-files.rst b/Documentation/admin-guide/abi-obsolete-files.rst new file mode 100644 index 000000000000..3061a916b4b5 --- /dev/null +++ b/Documentation/admin-guide/abi-obsolete-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Obsolete ABI Files +================== + +.. kernel-abi:: obsolete + :no-symbols: diff --git a/Documentation/admin-guide/abi-obsolete.rst b/Documentation/admin-guide/abi-obsolete.rst index 594e697aa1b2..640f3903e847 100644 --- a/Documentation/admin-guide/abi-obsolete.rst +++ b/Documentation/admin-guide/abi-obsolete.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI obsolete symbols ==================== @@ -7,5 +9,5 @@ marked to be removed at some later point in time. The description of the interface will document the reason why it is obsolete and when it can be expected to be removed. -.. kernel-abi:: ABI/obsolete - :rst: +.. kernel-abi:: obsolete + :no-files: diff --git a/Documentation/admin-guide/abi-removed-files.rst b/Documentation/admin-guide/abi-removed-files.rst new file mode 100644 index 000000000000..f1bdfadd2ec4 --- /dev/null +++ b/Documentation/admin-guide/abi-removed-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Removed ABI Files +================= + +.. kernel-abi:: removed + :no-symbols: diff --git a/Documentation/admin-guide/abi-removed.rst b/Documentation/admin-guide/abi-removed.rst index f9e000c81828..88832d3eacd6 100644 --- a/Documentation/admin-guide/abi-removed.rst +++ b/Documentation/admin-guide/abi-removed.rst @@ -1,5 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI removed symbols =================== -.. kernel-abi:: ABI/removed - :rst: +.. kernel-abi:: removed + :no-files: diff --git a/Documentation/admin-guide/abi-stable-files.rst b/Documentation/admin-guide/abi-stable-files.rst new file mode 100644 index 000000000000..f867738fc178 --- /dev/null +++ b/Documentation/admin-guide/abi-stable-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Stable ABI Files +================ + +.. kernel-abi:: stable + :no-symbols: diff --git a/Documentation/admin-guide/abi-stable.rst b/Documentation/admin-guide/abi-stable.rst index fc3361d847b1..528c68401f4b 100644 --- a/Documentation/admin-guide/abi-stable.rst +++ b/Documentation/admin-guide/abi-stable.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI stable symbols ================== @@ -10,5 +12,5 @@ for at least 2 years. Most interfaces (like syscalls) are expected to never change and always be available. -.. kernel-abi:: ABI/stable - :rst: +.. kernel-abi:: stable + :no-files: diff --git a/Documentation/admin-guide/abi-testing-files.rst b/Documentation/admin-guide/abi-testing-files.rst new file mode 100644 index 000000000000..1da868e42fdb --- /dev/null +++ b/Documentation/admin-guide/abi-testing-files.rst @@ -0,0 +1,7 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Testing ABI Files +================= + +.. kernel-abi:: testing + :no-symbols: diff --git a/Documentation/admin-guide/abi-testing.rst b/Documentation/admin-guide/abi-testing.rst index 19767926b344..6153ebd38e2d 100644 --- a/Documentation/admin-guide/abi-testing.rst +++ b/Documentation/admin-guide/abi-testing.rst @@ -1,3 +1,5 @@ +.. SPDX-License-Identifier: GPL-2.0 + ABI testing symbols =================== @@ -16,5 +18,5 @@ Programs that use these interfaces are strongly encouraged to add their name to the description of these interfaces, so that the kernel developers can easily notify them if any changes occur. -.. kernel-abi:: ABI/testing - :rst: +.. kernel-abi:: testing + :no-files: diff --git a/Documentation/admin-guide/abi.rst b/Documentation/admin-guide/abi.rst index bcab3ef2597c..c6039359e585 100644 --- a/Documentation/admin-guide/abi.rst +++ b/Documentation/admin-guide/abi.rst @@ -1,7 +1,14 @@ +.. SPDX-License-Identifier: GPL-2.0 + ===================== Linux ABI description ===================== +.. kernel-abi:: README + +ABI symbols +----------- + .. toctree:: :maxdepth: 2 @@ -9,3 +16,14 @@ Linux ABI description abi-testing abi-obsolete abi-removed + +ABI files +--------- + +.. toctree:: + :maxdepth: 2 + + abi-stable-files + abi-testing-files + abi-obsolete-files + abi-removed-files diff --git a/Documentation/admin-guide/blockdev/index.rst b/Documentation/admin-guide/blockdev/index.rst index 957ccf617797..3262397ebe8f 100644 --- a/Documentation/admin-guide/blockdev/index.rst +++ b/Documentation/admin-guide/blockdev/index.rst @@ -11,6 +11,7 @@ Block Devices nbd paride ramdisk + zoned_loop zram drbd/index diff --git a/Documentation/admin-guide/blockdev/zoned_loop.rst b/Documentation/admin-guide/blockdev/zoned_loop.rst new file mode 100644 index 000000000000..9c7aa3b482f3 --- /dev/null +++ b/Documentation/admin-guide/blockdev/zoned_loop.rst @@ -0,0 +1,169 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +Zoned Loop Block Device +======================= + +.. Contents: + + 1) Overview + 2) Creating a Zoned Device + 3) Deleting a Zoned Device + 4) Example + + +1) Overview +----------- + +The zoned loop block device driver (zloop) allows a user to create a zoned block +device using one regular file per zone as backing storage. This driver does not +directly control any hardware and uses read, write and truncate operations to +regular files of a file system to emulate a zoned block device. + +Using zloop, zoned block devices with a configurable capacity, zone size and +number of conventional zones can be created. The storage for each zone of the +device is implemented using a regular file with a maximum size equal to the zone +size. The size of a file backing a conventional zone is always equal to the zone +size. The size of a file backing a sequential zone indicates the amount of data +sequentially written to the file, that is, the size of the file directly +indicates the position of the write pointer of the zone. + +When resetting a sequential zone, its backing file size is truncated to zero. +Conversely, for a zone finish operation, the backing file is truncated to the +zone size. With this, the maximum capacity of a zloop zoned block device created +can be larger configured to be larger than the storage space available on the +backing file system. Of course, for such configuration, writing more data than +the storage space available on the backing file system will result in write +errors. + +The zoned loop block device driver implements a complete zone transition state +machine. That is, zones can be empty, implicitly opened, explicitly opened, +closed or full. The current implementation does not support any limits on the +maximum number of open and active zones. + +No user tools are necessary to create and delete zloop devices. + +2) Creating a Zoned Device +-------------------------- + +Once the zloop module is loaded (or if zloop is compiled in the kernel), the +character device file /dev/zloop-control can be used to add a zloop device. +This is done by writing an "add" command directly to the /dev/zloop-control +device:: + + $ modprobe zloop + $ ls -l /dev/zloop* + crw-------. 1 root root 10, 123 Jan 6 19:18 /dev/zloop-control + + $ mkdir -p <base directory/<device ID> + $ echo "add [options]" > /dev/zloop-control + +The options available for the add command can be listed by reading the +/dev/zloop-control device:: + + $ cat /dev/zloop-control + add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u,buffered_io + remove id=%d + +In more details, the options that can be used with the "add" command are as +follows. + +================ =========================================================== +id Device number (the X in /dev/zloopX). + Default: automatically assigned. +capacity_mb Device total capacity in MiB. This is always rounded up to + the nearest higher multiple of the zone size. + Default: 16384 MiB (16 GiB). +zone_size_mb Device zone size in MiB. Default: 256 MiB. +zone_capacity_mb Device zone capacity (must always be equal to or lower than + the zone size. Default: zone size. +conv_zones Total number of conventioanl zones starting from sector 0. + Default: 8. +base_dir Path to the base directoy where to create the directory + containing the zone files of the device. + Default=/var/local/zloop. + The device directory containing the zone files is always + named with the device ID. E.g. the default zone file + directory for /dev/zloop0 is /var/local/zloop/0. +nr_queues Number of I/O queues of the zoned block device. This value is + always capped by the number of online CPUs + Default: 1 +queue_depth Maximum I/O queue depth per I/O queue. + Default: 64 +buffered_io Do buffered IOs instead of direct IOs (default: false) +================ =========================================================== + +3) Deleting a Zoned Device +-------------------------- + +Deleting an unused zoned loop block device is done by issuing the "remove" +command to /dev/zloop-control, specifying the ID of the device to remove:: + + $ echo "remove id=X" > /dev/zloop-control + +The remove command does not have any option. + +A zoned device that was removed can be re-added again without any change to the +state of the device zones: the device zones are restored to their last state +before the device was removed. Adding again a zoned device after it was removed +must always be done using the same configuration as when the device was first +added. If a zone configuration change is detected, an error will be returned and +the zoned device will not be created. + +To fully delete a zoned device, after executing the remove operation, the device +base directory containing the backing files of the device zones must be deleted. + +4) Example +---------- + +The following sequence of commands creates a 2GB zoned device with zones of 64 +MB and a zone capacity of 63 MB:: + + $ modprobe zloop + $ mkdir -p /var/local/zloop/0 + $ echo "add capacity_mb=2048,zone_size_mb=64,zone_capacity=63MB" > /dev/zloop-control + +For the device created (/dev/zloop0), the zone backing files are all created +under the default base directory (/var/local/zloop):: + + $ ls -l /var/local/zloop/0 + total 0 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000000 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000001 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000002 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000003 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000004 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000005 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000006 + -rw-------. 1 root root 67108864 Jan 6 22:23 cnv-000007 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000008 + -rw-------. 1 root root 0 Jan 6 22:23 seq-000009 + ... + +The zoned device created (/dev/zloop0) can then be used normally:: + + $ lsblk -z + NAME ZONED ZONE-SZ ZONE-NR ZONE-AMAX ZONE-OMAX ZONE-APP ZONE-WGRAN + zloop0 host-managed 64M 32 0 0 1M 4K + $ blkzone report /dev/zloop0 + start: 0x000000000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000020000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000040000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000060000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000080000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000a0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000c0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x0000e0000, len 0x020000, cap 0x020000, wptr 0x000000 reset:0 non-seq:0, zcond: 0(nw) [type: 1(CONVENTIONAL)] + start: 0x000100000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + start: 0x000120000, len 0x020000, cap 0x01f800, wptr 0x000000 reset:0 non-seq:0, zcond: 1(em) [type: 2(SEQ_WRITE_REQUIRED)] + ... + +Deleting this device is done using the command:: + + $ echo "remove id=0" > /dev/zloop-control + +The removed device can be re-added again using the same "add" command as when +the device was first created. To fully delete a zoned device, its backing files +should also be deleted after executing the remove command:: + + $ rm -r /var/local/zloop/0 diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index ee2b0030d416..3e273c1bb749 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -47,12 +47,14 @@ The list of possible return codes: -ENOMEM zram was not able to allocate enough memory to fulfil your needs. -EINVAL invalid input has been provided. +-EAGAIN re-try operation later (e.g. when attempting to run recompress + and writeback simultaneously). ======== ============================================================= If you use 'echo', the returned value is set by the 'echo' utility, and, in general case, something like:: - echo 3 > /sys/block/zram0/max_comp_streams + echo foo > /sys/block/zram0/comp_algorithm if [ $? -ne 0 ]; then handle_error fi @@ -71,21 +73,7 @@ This creates 4 devices: /dev/zram{0,1,2,3} num_devices parameter is optional and tells zram how many devices should be pre-created. Default: 1. -2) Set max number of compression streams -======================================== - -Regardless of the value passed to this attribute, ZRAM will always -allocate multiple compression streams - one per online CPU - thus -allowing several concurrent compression operations. The number of -allocated compression streams goes down when some of the CPUs -become offline. There is no single-compression-stream mode anymore, -unless you are running a UP system or have only 1 CPU online. - -To find out how many streams are currently available:: - - cat /sys/block/zram0/max_comp_streams - -3) Select compression algorithm +2) Select compression algorithm =============================== Using comp_algorithm device attribute one can see available and @@ -102,15 +90,39 @@ Examples:: #select lzo compression algorithm echo lzo > /sys/block/zram0/comp_algorithm -For the time being, the `comp_algorithm` content does not necessarily -show every compression algorithm supported by the kernel. We keep this -list primarily to simplify device configuration and one can configure -a new device with a compression algorithm that is not listed in -`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API -and, if some of the algorithms were built as modules, it's impossible -to list all of them using, for instance, /proc/crypto or any other -method. This, however, has an advantage of permitting the usage of -custom crypto compression modules (implementing S/W or H/W compression). +For the time being, the `comp_algorithm` content shows only compression +algorithms that are supported by zram. + +3) Set compression algorithm parameters: Optional +================================================= + +Compression algorithms may support specific parameters which can be +tweaked for particular dataset. ZRAM has an `algorithm_params` device +attribute which provides a per-algorithm params configuration. + +For example, several compression algorithms support `level` parameter. +In addition, certain compression algorithms support pre-trained dictionaries, +which significantly change algorithms' characteristics. In order to configure +compression algorithm to use external pre-trained dictionary, pass full +path to the `dict` along with other parameters:: + + #pass path to pre-trained zstd dictionary + echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params + + #same, but using algorithm priority + echo "priority=1 dict=/etc/dictionary" > \ + /sys/block/zram0/algorithm_params + + #pass path to pre-trained zstd dictionary and compression level + echo "algo=zstd level=8 dict=/etc/dictionary" > \ + /sys/block/zram0/algorithm_params + +Parameters are algorithm specific: not all algorithms support pre-trained +dictionaries, not all algorithms support `level`. Furthermore, for certain +algorithms `level` controls the compression level (the higher the value the +better the compression ratio, it even can take negatives values for some +algorithms), for other algorithms `level` is acceleration level (the higher +the value the lower the compression ratio). 4) Set Disksize =============== @@ -202,9 +214,8 @@ mem_limit WO specifies the maximum amount of memory ZRAM can writeback_limit WO specifies the maximum amount of write IO zram can write out to backing device as 4KB unit writeback_limit_enable RW show and set writeback_limit feature -max_comp_streams RW the number of possible concurrent compress - operations comp_algorithm RW show and change the compression algorithm +algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes backing_dev RW set up backend storage for zram to write out @@ -284,7 +295,7 @@ a single line of text and contains the following stats separated by whitespace: ============== ============================================================= 9) Deactivate -============= +============== :: @@ -306,6 +317,26 @@ a single line of text and contains the following stats separated by whitespace: Optional Feature ================ +IDLE pages tracking +------------------- + +zram has built-in support for idle pages tracking (that is, allocated but +not used pages). This feature is useful for e.g. zram writeback and +recompression. In order to mark pages as idle, execute the following command:: + + echo all > /sys/block/zramX/idle + +This will mark all allocated zram pages as idle. The idle mark will be +removed only when the page (block) is accessed (e.g. overwritten or freed). +Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled, pages can be +marked as idle based on how many seconds have passed since the last access to +a particular zram page:: + + echo 86400 > /sys/block/zramX/idle + +In this example, all pages which haven't been accessed in more than 86400 +seconds (one day) will be marked idle. + writeback --------- @@ -320,24 +351,7 @@ If admin wants to use incompressible page writeback, they could do it via:: echo huge > /sys/block/zramX/writeback -To use idle page writeback, first, user need to declare zram pages -as idle:: - - echo all > /sys/block/zramX/idle - -From now on, any pages on zram are idle pages. The idle mark -will be removed until someone requests access of the block. -IOW, unless there is access request, those pages are still idle pages. -Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be -marked as idle based on how long (in seconds) it's been since they were -last accessed:: - - echo 86400 > /sys/block/zramX/idle - -In this example all pages which haven't been accessed in more than 86400 -seconds (one day) will be marked idle. - -Admin can request writeback of those idle pages at right timing via:: +Admin can request writeback of idle pages at right timing via:: echo idle > /sys/block/zramX/writeback @@ -358,6 +372,23 @@ they could write a page index into the interface:: echo "page_index=1251" > /sys/block/zramX/writeback +In Linux 6.16 this interface underwent some rework. First, the interface +now supports `key=value` format for all of its parameters (`type=huge_idle`, +etc.) Second, the support for `page_indexes` was introduced, which specify +`LOW-HIGH` range (or ranges) of pages to be written-back. This reduces the +number of syscalls, but more importantly this enables optimal post-processing +target selection strategy. Usage example:: + + echo "type=idle" > /sys/block/zramX/writeback + echo "page_indexes=1-100 page_indexes=200-300" > \ + /sys/block/zramX/writeback + +We also now permit multiple page_index params per call and a mix of +single pages and page ranges:: + + echo page_index=42 page_index=99 page_indexes=100-200 \ + page_indexes=500-700 > /sys/block/zramX/writeback + If there are lots of write IO with flash device, potentially, it has flash wearout problem so that admin needs to design write limitation to guarantee storage health for entire product life. @@ -466,7 +497,10 @@ of equal or greater size::: #recompress idle pages larger than 2000 bytes echo "type=idle threshold=2000" > /sys/block/zramX/recompress -Recompression of idle pages requires memory tracking. +It is also possible to limit the number of pages zram re-compression will +attempt to recompress::: + + echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress During re-compression for every page, that matches re-compression criteria, ZRAM iterates the list of registered alternative compression algorithms in @@ -482,11 +516,14 @@ registered compression algorithms, increases our chances of finding the algorithm that successfully compresses a particular page. Sometimes, however, it is convenient (and sometimes even necessary) to limit recompression to only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a algo=NAME parameter::: +This can be achieved by providing a `algo` or `priority` parameter::: #use zstd algorithm only (if registered) echo "type=huge algo=zstd" > /sys/block/zramX/recompress + #use zstd algorithm only (if zstd was registered under priority 1) + echo "type=huge priority=1" > /sys/block/zramX/recompress + memory tracking =============== diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst index 18e79337dcfd..153472e93cae 100644 --- a/Documentation/admin-guide/braille-console.rst +++ b/Documentation/admin-guide/braille-console.rst @@ -21,8 +21,8 @@ override the baud rate to 115200, etc. By default, the braille device will just show the last kernel message (console mode). To review previous messages, press the Insert key to switch to the VT review mode. In review mode, the arrow keys permit to browse in the VT content, -:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and -the :kbd:`HOME` key goes back +`PAGE-UP`/`PAGE-DOWN` keys go at the top/bottom of the screen, and +the `HOME` key goes back to the cursor, hence providing very basic screen reviewing facility. Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst index 325c5d0ed34a..f4f867cabb17 100644 --- a/Documentation/admin-guide/bug-bisect.rst +++ b/Documentation/admin-guide/bug-bisect.rst @@ -1,76 +1,165 @@ -Bisecting a bug -+++++++++++++++ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) +.. [see the bottom of this file for redistribution information] -Last updated: 28 October 2016 +====================== +Bisecting a regression +====================== -Introduction -============ +This document describes how to use a ``git bisect`` to find the source code +change that broke something -- for example when some functionality stopped +working after upgrading from Linux 6.0 to 6.1. -Always try the latest kernel from kernel.org and build from source. If you are -not confident in doing that please report the bug to your distribution vendor -instead of to a kernel developer. +The text focuses on the gist of the process. If you are new to bisecting the +kernel, better follow Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +instead: it depicts everything from start to finish while covering multiple +aspects even kernel developers occasionally forget. This includes detecting +situations early where a bisection would be a waste of time, as nobody would +care about the result -- for example, because the problem happens after the +kernel marked itself as 'tainted', occurs in an abandoned version, was already +fixed, or is caused by a .config change you or your Linux distributor performed. -Finding bugs is not always easy. Have a go though. If you can't find it don't -give up. Report as much as you have found to the relevant maintainer. See -MAINTAINERS for who that is for the subsystem you have worked on. +Finding the change causing a kernel issue using a bisection +=========================================================== -Before you submit a bug report read -'Documentation/admin-guide/reporting-issues.rst'. +*Note: the following process assumes you prepared everything for a bisection. +This includes having a Git clone with the appropriate sources, installing the +software required to build and install kernels, as well as a .config file stored +in a safe place (the following example assumes '~/prepared_kernel_.config') to +use as pristine base at each bisection step; ideally, you have also worked out +a fully reliable and straight-forward way to reproduce the regression, too.* -Devices not appearing -===================== +* Preparation: start the bisection and tell Git about the points in the history + you consider to be working and broken, which Git calls 'good' and 'bad':: -Often this is caused by udev/systemd. Check that first before blaming it -on the kernel. + git bisect start + git bisect good v6.0 + git bisect bad v6.1 -Finding patch that caused a bug -=============================== + Instead of Git tags like 'v6.0' and 'v6.1' you can specify commit-ids, too. -Using the provided tools with ``git`` makes finding bugs easy provided the bug -is reproducible. +1. Copy your prepared .config into the build directory and adjust it to the + needs of the codebase Git checked out for testing:: -Steps to do it: + cp ~/prepared_kernel_.config .config + make olddefconfig -- build the Kernel from its git source -- start bisect with [#f1]_:: - - $ git bisect start - -- mark the broken changeset with:: - - $ git bisect bad [commit] - -- mark a changeset where the code is known to work with:: - - $ git bisect good [commit] - -- rebuild the Kernel and test -- interact with git bisect by using either:: - - $ git bisect good - - or:: - - $ git bisect bad - - depending if the bug happened on the changeset you're testing -- After some interactions, git bisect will give you the changeset that - likely caused the bug. - -- For example, if you know that the current version is bad, and version - 4.8 is good, you could do:: - - $ git bisect start - $ git bisect bad # Current version is bad - $ git bisect good v4.8 - - -.. [#f1] You can, optionally, provide both good and bad arguments at git - start with ``git bisect start [BAD] [GOOD]`` - -For further references, please read: - -- The man page for ``git-bisect`` -- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_ -- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_ -- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_ +2. Now build, install, and boot a kernel. This might fail for unrelated reasons, + for example, when a compile error happens at the current stage of the + bisection a later change resolves. In such cases run ``git bisect skip`` and + go back to step 1. + +3. Check if the functionality that regressed works in the kernel you just built. + + If it works, execute:: + + git bisect good + + If it is broken, run:: + + git bisect bad + + Note, getting this wrong just once will send the rest of the bisection + totally off course. To prevent having to start anew later you thus want to + ensure what you tell Git is correct; it is thus often wise to spend a few + minutes more on testing in case your reproducer is unreliable. + + After issuing one of these two commands, Git will usually check out another + bisection point and print something like 'Bisecting: 675 revisions left to + test after this (roughly 10 steps)'. In that case go back to step 1. + + If Git instead prints something like 'cafecaca0c0dacafecaca0c0dacafecaca0c0da + is the first bad commit', then you have finished the bisection. In that case + move to the next point below. Note, right after displaying that line Git will + show some details about the culprit including its patch description; this can + easily fill your terminal, so you might need to scroll up to see the message + mentioning the culprit's commit-id. + + In case you missed Git's output, you can always run ``git bisect log`` to + print the status: it will show how many steps remain or mention the result of + the bisection. + +* Recommended complementary task: put the bisection log and the current .config + file aside for the bug report; furthermore tell Git to reset the sources to + the state before the bisection:: + + git bisect log > ~/bisection-log + cp .config ~/bisection-config-culprit + git bisect reset + +* Recommended optional task: try reverting the culprit on top of the latest + codebase and check if that fixes your bug; if that is the case, it validates + the bisection and enables developers to resolve the regression through a + revert. + + To try this, update your clone and check out latest mainline. Then tell Git + to revert the change by specifying its commit-id:: + + git revert --no-edit cafec0cacaca0 + + Git might reject this, for example when the bisection landed on a merge + commit. In that case, abandon the attempt. Do the same, if Git fails to revert + the culprit on its own because later changes depend on it -- at least unless + you bisected a stable or longterm kernel series, in which case you want to + check out its latest codebase and try a revert there. + + If a revert succeeds, build and test another kernel to check if reverting + resolved your regression. + +With that the process is complete. Now report the regression as described by +Documentation/admin-guide/reporting-issues.rst. + +Bisecting linux-next +-------------------- + +If you face a problem only happening in linux-next, bisect between the +linux-next branches 'stable' and 'master'. The following commands will start +the process for a linux-next tree you added as a remote called 'next':: + + git bisect start + git bisect good next/stable + git bisect bad next/master + +The 'stable' branch refers to the state of linux-mainline that the current +linux-next release (found in the 'master' branch) is based on -- the former +thus should be free of any problems that show up in -next, but not in Linus' +tree. + +This will bisect across a wide range of changes, some of which you might have +used in earlier linux-next releases without problems. Sadly there is no simple +way to avoid checking them: bisecting from one linux-next release to a later +one (say between 'next-20241020' and 'next-20241021') is impossible, as they +share no common history. + +Additional reading material +--------------------------- + +* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and + `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_ + in the Git documentation. +* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_ + from kernel developer Nathan Chancellor. +* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_. +* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_. + +.. + end-of-content +.. + This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If + you spot a typo or small mistake, feel free to let him know directly and + he'll fix it. You are free to do the same in a mostly informal way if you + want to contribute changes to the text -- but for copyright reasons please CC + linux-doc@vger.kernel.org and 'sign-off' your contribution as + Documentation/process/submitting-patches.rst explains in the section 'Sign + your work - the Developer's Certificate of Origin'. +.. + This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top + of the file. If you want to distribute this text under CC-BY-4.0 only, + please use 'The Linux kernel development community' for author attribution + and link this as source: + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/bug-bisect.rst + +.. + Note: Only the content of this RST file as found in the Linux kernel sources + is available under CC-BY-4.0, as versions of this text that were processed + (for example by the kernel's build system) might contain content taken from + files which use a more restrictive license. diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 95299b08c405..30858757c9f2 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -196,7 +196,7 @@ will see the assembler code for the routine shown, but if your kernel has debug symbols the C code will also be available. (Debug symbols can be enabled in the kernel hacking menu of the menu configuration.) For example:: - $ objdump -r -S -l --disassemble net/dccp/ipv4.o + $ objdump -r -S -l --disassemble net/ipv4/tcp.o .. note:: @@ -244,14 +244,14 @@ Reporting the bug Once you find where the bug happened, by inspecting its location, you could either try to fix it yourself or report it upstream. -In order to report it upstream, you should identify the mailing list -used for the development of the affected code. This can be done by using -the ``get_maintainer.pl`` script. +In order to report it upstream, you should identify the bug tracker, if any, or +mailing list used for the development of the affected code. This can be done by +using the ``get_maintainer.pl`` script. For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: - $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c + $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%) @@ -267,11 +267,12 @@ Please notice that it will point to: - The driver maintainer (Hans Verkuil); - The subsystem maintainer (Mauro Carvalho Chehab); - The driver and/or subsystem mailing list (linux-media@vger.kernel.org); -- the Linux Kernel mailing list (linux-kernel@vger.kernel.org). +- The Linux Kernel mailing list (linux-kernel@vger.kernel.org); +- The bug reporting URIs for the driver/subsystem (none in the above example). -Usually, the fastest way to have your bug fixed is to report it to mailing -list used for the development of the code (linux-media ML) copying the -driver maintainer (Hans). +If the listing contains bug reporting URIs at the end, please prefer them over +email. Otherwise, please report bugs to the mailing list used for the +development of the code (linux-media ML) copying the driver maintainer (Hans). If you are totally stumped as to whom to send the report, and ``get_maintainer.pl`` didn't provide you anything useful, send it to @@ -367,12 +368,3 @@ processed by ``klogd``:: Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128] Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3 ---------------------------------------------------------------------------- - -:: - - Dr. G.W. Wettstein Oncology Research Div. Computing Facility - Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com - 820 4th St. N. - Fargo, ND 58122 - Phone: 701-234-7556 diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index 9343148ee993..463f98453323 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -13,7 +13,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. Modified by Paul Jackson <pj@sgi.com> -Modified by Christoph Lameter <cl@linux.com> +Modified by Christoph Lameter <cl@gentwo.org> .. CONTENTS: @@ -570,7 +570,7 @@ visible to cgroup_for_each_child/descendant_*() iterators. The subsystem may choose to fail creation by returning -errno. This callback can be used to implement reliable state sharing and propagation along the hierarchy. See the comment on -cgroup_for_each_descendant_pre() for details. +cgroup_for_each_live_descendant_pre() for details. ``void css_offline(struct cgroup *cgrp);`` (cgroup_mutex held by caller) diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 7d3415eea05d..c7909e5ac136 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -10,7 +10,7 @@ Written by Simon.Derr@bull.net - Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. - Modified by Paul Jackson <pj@sgi.com> -- Modified by Christoph Lameter <cl@linux.com> +- Modified by Christoph Lameter <cl@gentwo.org> - Modified by Paul Menage <menage@google.com> - Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> @@ -568,7 +568,7 @@ on the next tick. For some applications in special situation, waiting The 'cpuset.sched_relax_domain_level' file allows you to request changing this searching range as you like. This file takes int value which -indicates size of searching range in levels ideally as follows, +indicates size of searching range in levels approximately as follows, otherwise initial value -1 that indicates the cpuset has no request. ====== =========================================================== @@ -581,6 +581,11 @@ otherwise initial value -1 that indicates the cpuset has no request. 5 search system wide [on NUMA system] ====== =========================================================== +Not all levels can be present and values can change depending on the +system architecture and kernel configuration. Check +/sys/kernel/debug/sched/domains/cpu*/domain*/ for system-specific +details. + The system default is architecture dependent. The system default can be changed using the relax_domain_level= boot parameter. diff --git a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst index 582d3427de3f..a964aff373b1 100644 --- a/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst +++ b/Documentation/admin-guide/cgroup-v1/freezer-subsystem.rst @@ -125,3 +125,7 @@ to unfreeze all tasks in the container:: This is the basic mechanism which should do the right thing for user space task in a simple scenario. + +This freezer implementation is affected by shortcomings (see commit +76f969e8948d8 ("cgroup: cgroup v2 freezer")) and cgroup v2 freezer is +recommended. diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 1f128458ddea..9f8e27355cba 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -102,7 +102,7 @@ Under below explanation, we assume CONFIG_SWAP=y. The logic is very clear. (About migration, see below) Note: - __remove_from_page_cache() is called by remove_from_page_cache() + __filemap_remove_folio() is called by filemap_remove_folio() and __remove_mapping(). 6. Shmem(tmpfs) Page Cache diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index ca7d9402f6be..d6b1db8cc7eb 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -78,18 +78,23 @@ Brief summary of control files. memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded memory.soft_limit_in_bytes set/show soft limit of memory usage This knob is not available on CONFIG_PREEMPT_RT systems. + This knob is deprecated and shouldn't be + used. memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled This knob is deprecated and shouldn't be used. memory.force_empty trigger forced page reclaim memory.pressure_level set memory pressure notifications + This knob is deprecated and shouldn't be + used. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) - memory.move_charge_at_immigrate set/show controls of moving charges + Per memcg knob does not exist in cgroup v2. + memory.move_charge_at_immigrate This knob is deprecated. + memory.oom_control set/show oom controls. This knob is deprecated and shouldn't be used. - memory.oom_control set/show oom controls. memory.numa_stat show the number of memory usage per numa node memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel @@ -105,10 +110,18 @@ Brief summary of control files. memory.kmem.max_usage_in_bytes show max kernel memory usage recorded memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.failcnt show the number of tcp buf memory usage hits limits + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded + This knob is deprecated and shouldn't be + used. ==================================== ========================================== 1. History @@ -229,10 +242,6 @@ behind this approach is that a cgroup that aggressively uses a shared page will eventually get charged for it (once it is uncharged from the cgroup that brought it in -- this will happen on memory pressure). -But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a -task to another cgroup, its pages may be recharged to the new cgroup, if -move_charge_at_immigrate has been chosen. - 2.4 Swap Extension -------------------------------------- @@ -300,14 +309,14 @@ When oom event notifier is registered, event will be delivered. Lock order is as follows:: - Page lock (PG_locked bit of page->flags) + folio_lock mm->page_table_lock or split pte_lock folio_memcg_lock (memcg->move_lock) mapping->i_pages lock lruvec->lru_lock. Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by -lruvec->lru_lock; PG_lru bit of page->flags is cleared before +lruvec->lru_lock; the folio LRU flag is cleared before isolating a page from its LRU under lruvec->lru_lock. .. _cgroup-v1-memory-kernel-extension: @@ -601,6 +610,10 @@ memory.stat file includes following statistics: 'rss + mapped_file" will give you resident set size of cgroup. + Note that some kernel configurations might account complete larger + allocations (e.g., THP) towards 'rss' and 'mapped_file', even if + only some, but not all that memory is mapped. + (Note: file and shmem may be shared among other cgroups. In that case, mapped_file is accounted only when the memory cgroup is owner of page cache.) @@ -693,8 +706,10 @@ For compatibility reasons writing 1 to memory.use_hierarchy will always pass:: # echo 1 > memory.use_hierarchy -7. Soft limits -============== +7. Soft limits (DEPRECATED) +=========================== + +THIS IS DEPRECATED! Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided @@ -740,78 +755,8 @@ If we want to change this to 1G, we can at any time use:: THIS IS DEPRECATED! -It's expensive and unreliable! It's better practice to launch workload -tasks directly from inside their target cgroup. Use dedicated workload -cgroups to allow fine-grained policy adjustments without having to -move physical pages between control domains. - -Users can move charges associated with a task along with task migration, that -is, uncharge task's pages from the old cgroup and charge them to the new cgroup. -This feature is not supported in !CONFIG_MMU environments because of lack of -page tables. - -8.1 Interface -------------- - -This feature is disabled by default. It can be enabled (and disabled again) by -writing to memory.move_charge_at_immigrate of the destination cgroup. - -If you want to enable it:: - - # echo (some positive value) > memory.move_charge_at_immigrate - -.. note:: - Each bits of move_charge_at_immigrate has its own meaning about what type - of charges should be moved. See :ref:`section 8.2 - <cgroup-v1-memory-movable-charges>` for details. - -.. note:: - Charges are moved only when you move mm->owner, in other words, - a leader of a thread group. - -.. note:: - If we cannot find enough space for the task in the destination cgroup, we - try to make space by reclaiming memory. Task migration may fail if we - cannot make enough space. - -.. note:: - It can take several seconds if you move charges much. - -And if you want disable it again:: - - # echo 0 > memory.move_charge_at_immigrate - -.. _cgroup-v1-memory-movable-charges: - -8.2 Type of charges which can be moved --------------------------------------- - -Each bit in move_charge_at_immigrate has its own meaning about what type of -charges should be moved. But in any case, it must be noted that an account of -a page or a swap can be moved only when it is charged to the task's current -(old) memory cgroup. - -+---+--------------------------------------------------------------------------+ -|bit| what type of charges would be moved ? | -+===+==========================================================================+ -| 0 | A charge of an anonymous page (or swap of it) used by the target task. | -| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | -+---+--------------------------------------------------------------------------+ -| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | -| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | -| | anonymous pages, file pages (and swaps) in the range mmapped by the task | -| | will be moved even if the task hasn't done page fault, i.e. they might | -| | not be the task's "RSS", but other task's "RSS" that maps the same file. | -| | And mapcount of the page is ignored (the page can be moved even if | -| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | -| | enable move of swap charges. | -+---+--------------------------------------------------------------------------+ - -8.3 TODO --------- - -- All of moving charge operations are done under cgroup_mutex. It's not good - behavior to hold the mutex too long, so we may need some trick. +Reading memory.move_charge_at_immigrate will always return 0 and writing +to it will always return -EINVAL. 9. Memory thresholds ==================== @@ -834,8 +779,10 @@ It's applicable for root and non-root cgroup. .. _cgroup-v1-memory-oom-control: -10. OOM Control -=============== +10. OOM Control (DEPRECATED) +============================ + +THIS IS DEPRECATED! memory.oom_control file is for OOM notification and other controls. @@ -882,8 +829,10 @@ At reading, current status of OOM is shown. The number of processes belonging to this cgroup killed by any kind of OOM killer. -11. Memory Pressure -=================== +11. Memory Pressure (DEPRECATED) +================================ + +THIS IS DEPRECATED! The pressure level notifications can be used to monitor the memory allocation cost; based on the pressure, applications can implement diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst index 6acebd9e72c8..0f9f9a7b1f6c 100644 --- a/Documentation/admin-guide/cgroup-v1/pids.rst +++ b/Documentation/admin-guide/cgroup-v1/pids.rst @@ -36,7 +36,8 @@ superset of parent/child/pids.current. The pids.events file contains event counters: - - max: Number of times fork failed because limit was hit. + - max: Number of times fork failed in the cgroup because limit was hit in + self or ancestors. Example ------- diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 17e6e9565156..bd98ea3175ec 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-6. Device 5-7. RDMA 5-7-1. RDMA Interface Files - 5-8. HugeTLB - 5.8-1. HugeTLB Interface Files - 5-9. Misc - 5.9-1 Miscellaneous cgroup Interface Files - 5.9-2 Migration and Ownership - 5-10. Others - 5-10-1. perf_event + 5-8. DMEM + 5-9. HugeTLB + 5.9-1. HugeTLB Interface Files + 5-10. Misc + 5.10-1 Miscellaneous cgroup Interface Files + 5.10-2 Migration and Ownership + 5-11. Others + 5-11-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -239,6 +240,13 @@ cgroup v2 currently supports the following mount options. will not be tracked by the memory controller (even if cgroup v2 is remounted later on). + pids_localevents + The option restores v1-like behavior of pids.events:max, that is only + local (inside cgroup proper) fork failures are counted. Without this + option pids.events.max represents any pids.max enforcemnt across + cgroup's subtree. + + Organizing Processes and Threads -------------------------------- @@ -526,10 +534,12 @@ cgroup namespace on namespace creation. Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them. For the first method, this is -achieved by not granting access to these files. For the second, the -kernel rejects writes to all files other than "cgroup.procs" and -"cgroup.subtree_control" on a namespace root from inside the -namespace. +achieved by not granting access to these files. For the second, files +outside the namespace should be hidden from the delegatee by the means +of at least mount namespacing, and the kernel rejects writes to all +files on a namespace root from inside the cgroup namespace, except for +those files listed in "/sys/kernel/cgroup/delegate" (including +"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.). The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, @@ -974,6 +984,14 @@ All cgroup core files are prefixed with "cgroup." A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + nr_subsys_<cgroup_subsys> + Total number of live cgroup subsystems (e.g memory + cgroup) at and beneath the current cgroup. + + nr_dying_subsys_<cgroup_subsys> + Total number of dying cgroup subsystems (e.g. memory + cgroup) at and beneath the current cgroup. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". @@ -1058,30 +1076,53 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 doesn't yet support control of realtime processes and -the cpu controller can only be enabled when all RT processes are in -the root cgroup. Be aware that system management software may already -have placed RT processes into nonroot cgroups during the system boot -process, and these processes may need to be moved to the root cgroup -before the cpu controller can be enabled. +WARNING: cgroup2 cpu controller doesn't yet support the (bandwidth) control of +realtime processes. For a kernel built with the CONFIG_RT_GROUP_SCHED option +enabled for group scheduling of realtime processes, the cpu controller can only +be enabled when all RT processes are in the root cgroup. Be aware that system +management software may already have placed RT processes into non-root cgroups +during the system boot process, and these processes may need to be moved to the +root cgroup before the cpu controller can be enabled with a +CONFIG_RT_GROUP_SCHED enabled kernel. + +With CONFIG_RT_GROUP_SCHED disabled, this limitation does not apply and some of +the interface files either affect realtime processes or account for them. See +the following section for details. Only the cpu controller is affected by +CONFIG_RT_GROUP_SCHED. Other controllers can be used for the resource control of +realtime processes irrespective of CONFIG_RT_GROUP_SCHED. CPU Interface Files ~~~~~~~~~~~~~~~~~~~ -All time durations are in microseconds. +The interaction of a process with the cpu controller depends on its scheduling +policy and the underlying scheduler. From the point of view of the cpu controller, +processes can be categorized as follows: + +* Processes under the fair-class scheduler +* Processes under a BPF scheduler with the ``cgroup_set_weight`` callback +* Everything else: ``SCHED_{FIFO,RR,DEADLINE}`` and processes under a BPF scheduler + without the ``cgroup_set_weight`` callback + +For details on when a process is under the fair-class scheduler or a BPF scheduler, +check out :ref:`Documentation/scheduler/sched-ext.rst <sched-ext>`. + +For each of the following interface files, the above categories +will be referred to. All time durations are in microseconds. cpu.stat A read-only flat-keyed file. This file exists whether the controller is enabled or not. - It always reports the following three stats: + It always reports the following three stats, which account for all the + processes in the cgroup: - usage_usec - user_usec - system_usec - and the following five when the controller is enabled: + and the following five when the controller is enabled, which account for + only the processes under the fair-class scheduler: - nr_periods - nr_throttled @@ -1099,6 +1140,10 @@ All time durations are in microseconds. If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1), then the weight will show as a 0. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.weight.nice A read-write single value file which exists on non-root cgroups. The default is "0". @@ -1111,6 +1156,10 @@ All time durations are in microseconds. granularity is coarser for the nice values, the read value is the closest approximation of the current weight. + This file affects only processes under the fair-class scheduler and a BPF + scheduler with the ``cgroup_set_weight`` callback depending on what the + callback actually does. + cpu.max A read-write two value file which exists on non-root cgroups. The default is "max 100000". @@ -1123,43 +1172,55 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + This file affects only processes under the fair-class scheduler. + cpu.max.burst A read-write single value file which exists on non-root cgroups. The default is "0". The burst in the range [0, $MAX]. + This file affects only processes under the fair-class scheduler. + cpu.pressure A read-write nested-keyed file. Shows pressure stall information for CPU. See :ref:`Documentation/accounting/psi.rst <psi>` for details. + This file accounts for all the processes in the cgroup. + cpu.uclamp.min - A read-write single value file which exists on non-root cgroups. - The default is "0", i.e. no utilization boosting. + A read-write single value file which exists on non-root cgroups. + The default is "0", i.e. no utilization boosting. - The requested minimum utilization (protection) as a percentage - rational number, e.g. 12.34 for 12.34%. + The requested minimum utilization (protection) as a percentage + rational number, e.g. 12.34 for 12.34%. - This interface allows reading and setting minimum utilization clamp - values similar to the sched_setattr(2). This minimum utilization - value is used to clamp the task specific minimum utilization clamp. + This interface allows reading and setting minimum utilization clamp + values similar to the sched_setattr(2). This minimum utilization + value is used to clamp the task specific minimum utilization clamp, + including those of realtime processes. - The requested minimum utilization (protection) is always capped by - the current value for the maximum utilization (limit), i.e. - `cpu.uclamp.max`. + The requested minimum utilization (protection) is always capped by + the current value for the maximum utilization (limit), i.e. + `cpu.uclamp.max`. + + This file affects all the processes in the cgroup. cpu.uclamp.max - A read-write single value file which exists on non-root cgroups. - The default is "max". i.e. no utilization capping + A read-write single value file which exists on non-root cgroups. + The default is "max". i.e. no utilization capping - The requested maximum utilization (limit) as a percentage rational - number, e.g. 98.76 for 98.76%. + The requested maximum utilization (limit) as a percentage rational + number, e.g. 98.76 for 98.76%. - This interface allows reading and setting maximum utilization clamp - values similar to the sched_setattr(2). This maximum utilization - value is used to clamp the task specific maximum utilization clamp. + This interface allows reading and setting maximum utilization clamp + values similar to the sched_setattr(2). This maximum utilization + value is used to clamp the task specific maximum utilization clamp, + including those of realtime processes. + + This file affects all the processes in the cgroup. cpu.idle A read-write single value file which exists on non-root cgroups. @@ -1171,7 +1232,7 @@ All time durations are in microseconds. own relative priorities, but the cgroup itself will be treated as very low priority relative to its peers. - + This file affects only processes under the fair-class scheduler. Memory ------ @@ -1273,6 +1334,18 @@ PAGE_SIZE multiple when read back. monitors the limited cgroup to alleviate heavy reclaim pressure. + If memory.high is opened with O_NONBLOCK then the synchronous + reclaim is bypassed. This is useful for admin processes that + need to dynamically adjust the job's memory limits without + expending their own CPU resources on memory reclamation. The + job will trigger the reclaim and/or get throttled on its + next charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.max A read-write single value file which exists on non-root cgroups. The default is "max". @@ -1290,23 +1363,28 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. + If memory.max is opened with O_NONBLOCK, then the synchronous + reclaim and oom-kill are bypassed. This is useful for admin + processes that need to dynamically adjust the job's memory limits + without expending their own CPU resources on memory reclamation. + The job will trigger the reclaim and/or oom-kill on its next + charge request. + + Please note that with O_NONBLOCK, there is a chance that the + target memory cgroup may take indefinite amount of time to + reduce usage below the limit due to delayed charge request or + busy-hitting its memory to slow down reclaim. + memory.reclaim A write-only nested-keyed file which exists for all cgroups. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1318,12 +1396,29 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + + The valid range for swappiness is [0-200, max], setting + swappiness=max exclusively reclaims anonymous memory. + memory.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max memory usage recorded for the cgroup and its descendants since + either the creation of the cgroup or the most recent reset for that FD. - The max memory usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.oom.group A read-write single value file which exists on non-root @@ -1412,7 +1507,10 @@ PAGE_SIZE multiple when read back. anon Amount of memory used in anonymous mappings such as - brk(), sbrk(), and mmap(MAP_ANONYMOUS) + brk(), sbrk(), and mmap(MAP_ANONYMOUS). Note that + some kernel configurations might account complete larger + allocations (e.g., THP) if only some, but not all the + memory of such an allocation is mapped anymore. file Amount of memory used to cache filesystem data, @@ -1432,7 +1530,7 @@ PAGE_SIZE multiple when read back. sec_pagetables Amount of memory allocated for secondary page tables, this currently includes KVM mmu allocations on x86 - and arm64. + and arm64 and IOMMU page tables. percpu (npn) Amount of memory used for storing per-cpu kernel @@ -1455,7 +1553,10 @@ PAGE_SIZE multiple when read back. Amount of application memory swapped out to zswap. file_mapped - Amount of cached filesystem data mapped with mmap() + Amount of cached filesystem data mapped with mmap(). Note + that some kernel configurations might account complete + larger allocations (e.g., THP) if only some, but not + not all the memory of such an allocation is mapped. file_dirty Amount of cached filesystem data that was modified but @@ -1527,6 +1628,12 @@ PAGE_SIZE multiple when read back. workingset_nodereclaim Number of times a shadow node has been reclaimed + pswpin (npn) + Number of pages swapped into memory + + pswpout (npn) + Number of pages swapped out of memory + pgscan (npn) Amount of scanned pages (in an inactive LRU list) @@ -1542,6 +1649,9 @@ PAGE_SIZE multiple when read back. pgscan_khugepaged (npn) Amount of scanned pages by khugepaged (in an inactive LRU list) + pgscan_proactive (npn) + Amount of scanned pages proactively (in an inactive LRU list) + pgsteal_kswapd (npn) Amount of reclaimed pages by kswapd @@ -1551,6 +1661,9 @@ PAGE_SIZE multiple when read back. pgsteal_khugepaged (npn) Amount of reclaimed pages by khugepaged + pgsteal_proactive (npn) + Amount of reclaimed pages proactively + pgfault (npn) Total number of page faults incurred @@ -1572,6 +1685,24 @@ PAGE_SIZE multiple when read back. pglazyfreed (npn) Amount of reclaimed lazyfree pages + swpin_zero + Number of pages swapped into memory and filled with zero, where I/O + was optimized out because the page content was detected to be zero + during swapout. + + swpout_zero + Number of zero-filled pages swapped out with I/O skipped due to the + content being detected as zero. + + zswpin + Number of pages moved in to memory from zswap. + + zswpout + Number of pages moved out of memory to zswap. + + zswpwb + Number of pages written from zswap to swap. + thp_fault_alloc (npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE @@ -1591,6 +1722,33 @@ PAGE_SIZE multiple when read back. Usually because failed to allocate some continuous swap space for the huge page. + numa_pages_migrated (npn) + Number of pages migrated by NUMA balancing. + + numa_pte_updates (npn) + Number of pages whose page table entries are modified by + NUMA balancing to produce NUMA hinting faults on access. + + numa_hint_faults (npn) + Number of NUMA hinting faults. + + pgdemote_kswapd + Number of pages demoted by kswapd. + + pgdemote_direct + Number of pages demoted directly. + + pgdemote_khugepaged + Number of pages demoted by khugepaged. + + pgdemote_proactive + Number of pages demoted by proactively. + + hugetlb + Amount of memory used by hugetlb pages. This metric only shows + up if hugetlb usage is accounted for in memory.current (i.e. + cgroup is mounted with the memory_hugetlb_accounting option). + memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups. @@ -1640,11 +1798,14 @@ PAGE_SIZE multiple when read back. Healthy workloads are not expected to reach this limit. memory.swap.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max swap usage recorded for the cgroup and its descendants since + the creation of the cgroup or the most recent reset for that FD. - The max swap usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.swap.max A read-write single value file which exists on non-root @@ -1694,9 +1855,10 @@ PAGE_SIZE multiple when read back. entries fault back in or are written out to disk. memory.zswap.writeback - A read-write single value file. The default value is "1". The - initial value of the root cgroup is 1, and when a new cgroup is - created, it inherits the current value of its parent. + A read-write single value file. The default value is "1". + Note that this setting is hierarchical, i.e. the writeback would be + implicitly disabled for child cgroups if the upper hierarchy + does so. When this is set to 0, all swapping attempts to swapping devices are disabled. This included both zswap writebacks, and swapping due @@ -1707,6 +1869,8 @@ PAGE_SIZE multiple when read back. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be written to the zswap pool. + This setting has no effect if zswap is disabled, and swapping + is allowed unless memory.swap.max is set to 0. memory.pressure A read-only nested-keyed file. @@ -2181,11 +2345,31 @@ PID Interface Files Hard limit of number of processes. pids.current - A read-only single value file which exists on all cgroups. + A read-only single value file which exists on non-root cgroups. The number of processes currently in the cgroup and its descendants. + pids.peak + A read-only single value file which exists on non-root cgroups. + + The maximum value that the number of processes in the cgroup and its + descendants has ever reached. + + pids.events + A read-only flat-keyed file which exists on non-root cgroups. Unless + specified otherwise, a value change in this file generates a file + modified event. The following entries are defined. + + max + The number of times the cgroup's total number of processes hit the pids.max + limit (see also pids_localevents). + + pids.events.local + Similar to pids.events but the fields in the file are local + to the cgroup i.e. not hierarchical. The file modified event + generated on this file reflects only the local events. + Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough @@ -2320,8 +2504,12 @@ Cpuset Interface Files is always a subset of it. Users can manually set it to a value that is different from - "cpuset.cpus". The only constraint in setting it is that the - list of CPUs must be exclusive with respect to its sibling. + "cpuset.cpus". One constraint in setting it is that the list of + CPUs must be exclusive with respect to "cpuset.cpus.exclusive" + of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup + isn't set, its "cpuset.cpus" value, if set, cannot be a subset + of it to leave at least one CPU available when the exclusive + CPUs are taken away. For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an @@ -2337,8 +2525,8 @@ Cpuset Interface Files cpuset-enabled cgroups. This file shows the effective set of exclusive CPUs that - can be used to create a partition root. The content of this - file will always be a subset of "cpuset.cpus" and its parent's + can be used to create a partition root. The content + of this file will always be a subset of its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" if it is set. If "cpuset.cpus.exclusive" is not set, it is @@ -2527,6 +2715,49 @@ RDMA Interface Files mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 +DMEM +---- + +The "dmem" controller regulates the distribution and accounting of +device memory regions. Because each memory region may have its own page size, +which does not have to be equal to the system page size, the units are always bytes. + +DMEM Interface Files +~~~~~~~~~~~~~~~~~~~~ + + dmem.max, dmem.min, dmem.low + A readwrite nested-keyed file that exists for all the cgroups + except root that describes current configured resource limit + for a region. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 1073741824 + drm/0000:03:00.0/stolen max + + The semantics are the same as for the memory cgroup controller, and are + calculated in the same way. + + dmem.capacity + A read-only file that describes maximum region capacity. + It only exists on the root cgroup. Not all memory can be + allocated by cgroups, as the kernel reserves some for + internal use. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 8514437120 + drm/0000:03:00.0/stolen 67108864 + + dmem.current + A read-only file that describes current resource usage. + It exists for all the cgroup except root. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 12550144 + drm/0000:03:00.0/stolen 8650752 + HugeTLB ------- @@ -2599,6 +2830,15 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_ res_a 3 res_b 0 + misc.peak + A read-only flat-keyed file shown in all cgroups. It shows the + historical maximum usage of the resources in the cgroup and its + children.:: + + $ cat misc.peak + res_a 10 + res_b 8 + misc.max A read-write flat-keyed file shown in the non root cgroups. Allowed maximum usage of the resources in the cgroup and its children.:: @@ -2628,6 +2868,11 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_ The number of times the cgroup's resource usage was about to go over the max boundary. + misc.events.local + Similar to misc.events but the fields in the file are local to the + cgroup i.e. not hierarchical. The file modified event generated on + this file reflects only the local events. + Migration and Ownership ~~~~~~~~~~~~~~~~~~~~~~~ @@ -2836,7 +3081,7 @@ Filesystem Support for Writeback -------------------------------- A filesystem can support cgroup writeback by updating -address_space_operations->writepage[s]() to annotate bio's using the +address_space_operations->writepages() to annotate bio's using the following two functions. wbc_init_bio(@wbc, @bio) @@ -2846,7 +3091,7 @@ following two functions. a queue (device) has been associated with the bio and before submission. - wbc_account_cgroup_owner(@wbc, @page, @bytes) + wbc_account_cgroup_owner(@wbc, @folio, @bytes) Should be called for each data segment being written out. While this function doesn't care exactly when it's called during the writeback session, it's the easiest and most @@ -2878,8 +3123,8 @@ Deprecated v1 Core Features - "cgroup.clone_children" is removed. -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file - at the root instead. +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or + "cgroup.stat" files at the root instead. Issues with v1 and Rationales for v2 diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index aa8290a29dc8..c09674a75a9e 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -723,40 +723,26 @@ Configuration pseudo-files: ======================= ======================================================= SecurityFlags Flags which control security negotiation and also packet signing. Authentication (may/must) - flags (e.g. for NTLM and/or NTLMv2) may be combined with + flags (e.g. for NTLMv2) may be combined with the signing flags. Specifying two different password hashing mechanisms (as "must use") on the other hand does not make much sense. Default flags are:: - 0x07007 - - (NTLM, NTLMv2 and packet signing allowed). The maximum - allowable flags if you want to allow mounts to servers - using weaker password hashes is 0x37037 (lanman, - plaintext, ntlm, ntlmv2, signing allowed). Some - SecurityFlags require the corresponding menuconfig - options to be enabled. Enabling plaintext - authentication currently requires also enabling - lanman authentication in the security flags - because the cifs module only supports sending - laintext passwords using the older lanman dialect - form of the session setup SMB. (e.g. for authentication - using plain text passwords, set the SecurityFlags - to 0x30030):: + 0x00C5 + + (NTLMv2 and packet signing allowed). Some SecurityFlags + may require enabling a corresponding menuconfig option. may use packet signing 0x00001 must use packet signing 0x01001 - may use NTLM (most common password hash) 0x00002 - must use NTLM 0x02002 may use NTLMv2 0x00004 must use NTLMv2 0x04004 - may use Kerberos security 0x00008 - must use Kerberos 0x08008 - may use lanman (weak) password hash 0x00010 - must use lanman password hash 0x10010 - may use plaintext passwords 0x00020 - must use plaintext passwords 0x20020 - (reserved for future packet encryption) 0x00040 + may use Kerberos security (krb5) 0x00008 + must use Kerberos 0x08008 + may use NTLMSSP 0x00080 + must use NTLMSSP 0x80080 + seal (packet encryption) 0x00040 + must seal 0x40040 cifsFYI If set to non-zero value, additional debug information will be logged to the system error log. This field diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst index 917ba8c33359..4d667228e744 100644 --- a/Documentation/admin-guide/device-mapper/delay.rst +++ b/Documentation/admin-guide/device-mapper/delay.rst @@ -3,29 +3,52 @@ dm-delay ======== Device-Mapper's "delay" target delays reads and/or writes -and maps them to different devices. +and/or flushs and optionally maps them to different devices. -Parameters:: +Arguments:: <device> <offset> <delay> [<write_device> <write_offset> <write_delay> [<flush_device> <flush_offset> <flush_delay>]] -With separate write parameters, the first set is only used for reads. +Table line has to either have 3, 6 or 9 arguments: + +3: apply offset and delay to read, write and flush operations on device + +6: apply offset and delay to device, also apply write_offset and write_delay + to write and flush operations on optionally different write_device with + optionally different sector offset + +9: same as 6 arguments plus define flush_offset and flush_delay explicitely + on/with optionally different flush_device/flush_offset. + Offsets are specified in sectors. + Delays are specified in milliseconds. + Example scripts =============== :: - #!/bin/sh - # Create device delaying rw operation for 500ms - echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed + # + # Create mapped device named "delayed" delaying read, write and flush operations for 500ms. + # + dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 500" :: + #!/bin/sh + # + # Create mapped device delaying write and flush operations for 400ms and + # splitting reads to device $1 but writes and flushs to different device $2 + # to different offsets of 2048 and 4096 sectors respectively. + # + dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400" +:: #!/bin/sh - # Create device delaying only write operation for 500ms and - # splitting reads and writes to different devices $1 $2 - echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed + # + # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms + # onto the same backing device at offset 0 sectors. + # + dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333" diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index aa2d04d95df6..4467f6d4b632 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -113,6 +113,11 @@ same_cpu_crypt The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs. +high_priority + Set dm-crypt workqueues and the writer thread to high priority. This + improves throughput and latency of dm-crypt while degrading general + responsiveness of the system. + submit_from_crypt_cpus Disable offloading writes to a separate thread after encryption. There are some situations where offloading write bios from the @@ -141,6 +146,11 @@ integrity:<bytes>:<type> integrity for the encrypted device. The additional space is then used for storing authentication tag (and persistent IV if needed). +integrity_key_size:<bytes> + Optionally set the integrity key size if it differs from the digest size. + It allows the use of wrapped key algorithms where the key size is + independent of the cryptographic key size. + sector_size:<bytes> Use <bytes> as the encryption unit instead of 512 bytes sectors. This option can be in range 512 - 4096 bytes and must be power of two. @@ -155,6 +165,27 @@ iv_large_sectors The <iv_offset> must be multiple of <sector_size> (in 512 bytes units) if this flag is specified. +integrity_key_size:<bytes> + Use an integrity key of <bytes> size instead of using an integrity key size + of the digest size of the used HMAC algorithm. + + +Module parameters:: + max_read_size + Maximum size of read requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + + max_write_size + Maximum size of write requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + + Example scripts =============== LUKS (Linux Unified Key Setup) is now the preferred way to set up disk diff --git a/Documentation/admin-guide/device-mapper/dm-integrity.rst b/Documentation/admin-guide/device-mapper/dm-integrity.rst index d8a5f14d0e3c..c2e18ecc065c 100644 --- a/Documentation/admin-guide/device-mapper/dm-integrity.rst +++ b/Documentation/admin-guide/device-mapper/dm-integrity.rst @@ -92,6 +92,11 @@ Target arguments: allowed. This mode is useful for data recovery if the device cannot be activated in any of the other standard modes. + I - inline mode - in this mode, dm-integrity will store integrity + data directly in the underlying device sectors. + The underlying device must have an integrity profile that + allows storing user integrity data and provides enough + space for the selected integrity tag. 5. the number of additional arguments diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst index 7e1ecafdf91e..a14e6d3e787c 100644 --- a/Documentation/admin-guide/device-mapper/vdo.rst +++ b/Documentation/admin-guide/device-mapper/vdo.rst @@ -241,6 +241,7 @@ Messages All vdo devices accept messages in the form: :: + dmsetup message <target-name> 0 <message-name> <message-parameters> The messages are: @@ -250,7 +251,12 @@ The messages are: by the vdostats userspace program to interpret the output buffer. - dump: + config: + Outputs useful vdo configuration information. Mostly used + by users who want to recreate a similar VDO volume and + want to know the creation configuration used. + + dump: Dumps many internal structures to the system log. This is not always safe to run, so it should only be used to debug a hung vdo. Optional parameters to specify structures to diff --git a/Documentation/admin-guide/device-mapper/verity.rst b/Documentation/admin-guide/device-mapper/verity.rst index a65c1602cb23..8c3f1f967a3c 100644 --- a/Documentation/admin-guide/device-mapper/verity.rst +++ b/Documentation/admin-guide/device-mapper/verity.rst @@ -87,6 +87,15 @@ panic_on_corruption Panic the device when a corrupted block is discovered. This option is not compatible with ignore_corruption and restart_on_corruption. +restart_on_error + Restart the system when an I/O error is detected. + This option can be combined with the restart_on_corruption option. + +panic_on_error + Panic the device when an I/O error is detected. This option is + not compatible with the restart_on_error option but can be combined + with the panic_on_corruption option. + ignore_zero_blocks Do not verify blocks that are expected to contain zeroes and always return zeroes instead. This may be useful if the partition contains unused blocks @@ -142,8 +151,15 @@ root_hash_sig_key_desc <key_description> already in the secondary trusted keyring. try_verify_in_tasklet - If verity hashes are in cache, verify data blocks in kernel tasklet instead - of workqueue. This option can reduce IO latency. + If verity hashes are in cache and the IO size does not exceed the limit, + verify data blocks in bottom half instead of workqueue. This option can + reduce IO latency. The size limits can be configured via + /sys/module/dm_verity/parameters/use_bh_bytes. The four parameters + correspond to limits for IOPRIO_CLASS_NONE, IOPRIO_CLASS_RT, + IOPRIO_CLASS_BE and IOPRIO_CLASS_IDLE in turn. + For example: + <none>,<rt>,<be>,<idle> + 4096,4096,4096,4096 Theory of operation =================== diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index 0e9b48daf690..7c036590cd07 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -26,6 +26,11 @@ Dynamic debug provides: - format string - class name (as known/declared by each module) +NOTE: To actually get the debug-print output on the console, you may +need to adjust the kernel ``loglevel=``, or use ``ignore_loglevel``. +Read about these kernel parameters in +Documentation/admin-guide/kernel-parameters.rst. + Viewing Dynamic Debug Behaviour =============================== diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index 5740d85439ff..b857eb6ca1b6 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -212,16 +212,6 @@ When mounting an ext4 filesystem, the following option are accepted: that ext4's inode table readahead algorithm will pre-read into the buffer cache. The default value is 32 blocks. - nouser_xattr - Disables Extended User Attributes. See the attr(5) manual page for - more information about extended attributes. - - noacl - This option disables POSIX Access Control List support. If ACL support - is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL - is enabled by default on mount. See the acl(5) manual page for more - information about acl. - bsddf (*) Make 'df' act like BSD. @@ -248,11 +238,10 @@ When mounting an ext4 filesystem, the following option are accepted: configured using tune2fs) data_err=ignore(*) - Just print an error message if an error occurs in a file data buffer in - ordered mode. + Just print an error message if an error occurs in a file data buffer. + data_err=abort - Abort the journal if an error occurs in a file data buffer in ordered - mode. + Abort the journal if an error occurs in a file data buffer. grpid | bsdgroups New objects have the group ID of their parent. diff --git a/Documentation/admin-guide/gpio/gpio-aggregator.rst b/Documentation/admin-guide/gpio/gpio-aggregator.rst index 5cd1e7221756..8374a9df9105 100644 --- a/Documentation/admin-guide/gpio/gpio-aggregator.rst +++ b/Documentation/admin-guide/gpio/gpio-aggregator.rst @@ -69,6 +69,113 @@ write-only attribute files in sysfs. $ echo gpio-aggregator.0 > delete_device +Aggregating GPIOs using Configfs +-------------------------------- + +**Group:** ``/config/gpio-aggregator`` + + This is the root directory of the gpio-aggregator configfs tree. + +**Group:** ``/config/gpio-aggregator/<example-name>`` + + This directory represents a GPIO aggregator device. You can assign any + name to ``<example-name>`` (e.g. ``agg0``), except names starting with + ``_sysfs`` prefix, which are reserved for auto-generated configfs + entries corresponding to devices created via Sysfs. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/live`` + + The ``live`` attribute allows to trigger the actual creation of the device + once it's fully configured. Accepted values are: + + * ``1``, ``yes``, ``true`` : enable the virtual device + * ``0``, ``no``, ``false`` : disable the virtual device + +**Attribute:** ``/config/gpio-aggregator/<example-name>/dev_name`` + + The read-only ``dev_name`` attribute exposes the name of the device as it + will appear in the system on the platform bus (e.g. ``gpio-aggregator.0``). + This is useful for identifying a character device for the newly created + aggregator. If it's ``gpio-aggregator.0``, + ``/sys/devices/platform/gpio-aggregator.0/gpiochipX`` path tells you that the + GPIO device id is ``X``. + +You must create subdirectories for each virtual line you want to +instantiate, named exactly as ``line0``, ``line1``, ..., ``lineY``, when +you want to instantiate ``Y+1`` (Y >= 0) lines. Configure all lines before +activating the device by setting ``live`` to 1. + +**Group:** ``/config/gpio-aggregator/<example-name>/<lineY>/`` + + This directory represents a GPIO line to include in the aggregator. + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/key`` + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/offset`` + + The default values after creating the ``<lineY>`` directory are: + + * ``key`` : <empty> + * ``offset`` : -1 + + ``key`` must always be explicitly configured, while ``offset`` depends. + Two configuration patterns exist for each ``<lineY>``: + + (a). For lookup by GPIO line name: + + * Set ``key`` to the line name. + * Ensure ``offset`` remains -1 (the default). + + (b). For lookup by GPIO chip name and the line offset within the chip: + + * Set ``key`` to the chip name. + * Set ``offset`` to the line offset (0 <= ``offset`` < 65535). + +**Attribute:** ``/config/gpio-aggregator/<example-name>/<lineY>/name`` + + The ``name`` attribute sets a custom name for lineY. If left unset, the + line will remain unnamed. + +Once the configuration is done, the ``'live'`` attribute must be set to 1 +in order to instantiate the aggregator device. It can be set back to 0 to +destroy the virtual device. The module will synchronously wait for the new +aggregator device to be successfully probed and if this doesn't happen, writing +to ``'live'`` will result in an error. This is a different behaviour from the +case when you create it using sysfs ``new_device`` interface. + +.. note:: + + For aggregators created via Sysfs, the configfs entries are + auto-generated and appear as ``/config/gpio-aggregator/_sysfs.<N>/``. You + cannot add or remove line directories with mkdir(2)/rmdir(2). To modify + lines, you must use the "delete_device" interface to tear down the + existing device and reconfigure it from scratch. However, you can still + toggle the aggregator with the ``live`` attribute and adjust the + ``key``, ``offset``, and ``name`` attributes for each line when ``live`` + is set to 0 by hand (i.e. it's not waiting for deferred probe). + +Sample configuration commands +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. code-block:: sh + + # Create a directory for an aggregator device + $ mkdir /sys/kernel/config/gpio-aggregator/agg0 + + # Configure each line + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line0 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line0/key + $ echo 6 > /sys/kernel/config/gpio-aggregator/agg0/line0/offset + $ echo test0 > /sys/kernel/config/gpio-aggregator/agg0/line0/name + $ mkdir /sys/kernel/config/gpio-aggregator/agg0/line1 + $ echo gpiochip0 > /sys/kernel/config/gpio-aggregator/agg0/line1/key + $ echo 7 > /sys/kernel/config/gpio-aggregator/agg0/line1/offset + $ echo test1 > /sys/kernel/config/gpio-aggregator/agg0/line1/name + + # Activate the aggregator device + $ echo 1 > /sys/kernel/config/gpio-aggregator/agg0/live + + Generic GPIO Driver ------------------- diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index 1cc5567a4bbe..35d49ccd49e0 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -71,7 +71,7 @@ specific lines. The name of those subdirectories must take the form of: ``'line<offset>'`` (e.g. ``'line0'``, ``'line20'``, etc.) as the name will be used by the module to assign the config to the specific line at given offset. -Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in +Once the configuration is complete, the ``'live'`` attribute must be set to 1 in order to instantiate the chip. It can be set back to 0 to destroy the simulated chip. The module will synchronously wait for the new simulated device to be successfully probed and if this doesn't happen, writing to ``'live'`` will diff --git a/Documentation/admin-guide/gpio/gpio-virtuser.rst b/Documentation/admin-guide/gpio/gpio-virtuser.rst new file mode 100644 index 000000000000..7e7c0df51640 --- /dev/null +++ b/Documentation/admin-guide/gpio/gpio-virtuser.rst @@ -0,0 +1,177 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +Virtual GPIO Consumer +===================== + +The virtual GPIO Consumer module allows users to instantiate virtual devices +that request GPIOs and then control their behavior over debugfs. Virtual +consumer devices can be instantiated from device-tree or over configfs. + +A virtual consumer uses the driver-facing GPIO APIs and allows to cover it with +automated tests driven by user-space. The GPIOs are requested using +``gpiod_get_array()`` and so we support multiple GPIOs per connector ID. + +Creating GPIO consumers +----------------------- + +The gpio-consumer module registers a configfs subsystem called +``'gpio-virtuser'``. For details of the configfs filesystem, please refer to +the configfs documentation. + +The user can create a hierarchy of configfs groups and items as well as modify +values of exposed attributes. Once the consumer is instantiated, this hierarchy +will be translated to appropriate device properties. The general structure is: + +**Group:** ``/config/gpio-virtuser`` + +This is the top directory of the gpio-consumer configfs tree. + +**Group:** ``/config/gpio-consumer/example-name`` + +**Attribute:** ``/config/gpio-consumer/example-name/live`` + +**Attribute:** ``/config/gpio-consumer/example-name/dev_name`` + +This is a directory representing a GPIO consumer device. + +The read-only ``dev_name`` attribute exposes the name of the device as it will +appear in the system on the platform bus. This is useful for locating the +associated debugfs directory under +``/sys/kernel/debug/gpio-virtuser/$dev_name``. + +The ``'live'`` attribute allows to trigger the actual creation of the device +once it's fully configured. The accepted values are: ``'1'`` to enable the +virtual device and ``'0'`` to disable and tear it down. + +Creating GPIO lookup tables +--------------------------- + +Users can create a number of configfs groups under the device group: + +**Group:** ``/config/gpio-consumer/example-name/con_id`` + +The ``'con_id'`` directory represents a single GPIO lookup and its value maps +to the ``'con_id'`` argument of the ``gpiod_get()`` function. For example: +``con_id`` == ``'reset'`` maps to the ``reset-gpios`` device property. + +Users can assign a number of GPIOs to each lookup. Each GPIO is a sub-directory +with a user-defined name under the ``'con_id'`` group. + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/key`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/offset`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/drive`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/pull`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/active_low`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/transitory`` + +This is a group describing a single GPIO in the ``con_id-gpios`` property. + +For virtual consumers created using configfs we use machine lookup tables so +this group can be considered as a mapping between the filesystem and the fields +of a single entry in ``'struct gpiod_lookup'``. + +The ``'key'`` attribute represents either the name of the chip this GPIO +belongs to or the GPIO line name. This depends on the value of the ``'offset'`` +attribute: if its value is >= 0, then ``'key'`` represents the label of the +chip to lookup while ``'offset'`` represents the offset of the line in that +chip. If ``'offset'`` is < 0, then ``'key'`` represents the name of the line. + +The remaining attributes map to the ``'flags'`` field of the GPIO lookup +struct. The first two take string values as arguments: + +**``'drive'``:** ``'push-pull'``, ``'open-drain'``, ``'open-source'`` +**``'pull'``:** ``'pull-up'``, ``'pull-down'``, ``'pull-disabled'``, ``'as-is'`` + +``'active_low'`` and ``'transitory'`` are boolean attributes. + +Activating GPIO consumers +------------------------- + +Once the configuration is complete, the ``'live'`` attribute must be set to 1 in +order to instantiate the consumer. It can be set back to 0 to destroy the +virtual device. The module will synchronously wait for the new simulated device +to be successfully probed and if this doesn't happen, writing to ``'live'`` will +result in an error. + +Device-tree +----------- + +Virtual GPIO consumers can also be defined in device-tree. The compatible string +must be: ``"gpio-virtuser"`` with at least one property following the +standardized GPIO pattern. + +An example device-tree code defining a virtual GPIO consumer: + +.. code-block :: none + + gpio-virt-consumer { + compatible = "gpio-virtuser"; + + foo-gpios = <&gpio0 5 GPIO_ACTIVE_LOW>, <&gpio1 2 0>; + bar-gpios = <&gpio0 6 0>; + }; + +Controlling virtual GPIO consumers +---------------------------------- + +Once active, the device will export debugfs attributes for controlling GPIO +arrays as well as each requested GPIO line separately. Let's consider the +following device property: ``foo-gpios = <&gpio0 0 0>, <&gpio0 4 0>;``. + +The following debugfs attribute groups will be created: + +**Group:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/`` + +This is the group that will contain the attributes for the entire GPIO array. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/values`` + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/values_atomic`` + +Both attributes allow to read and set arrays of GPIO values. User must pass +exactly the number of values that the array contains in the form of a string +containing zeroes and ones representing inactive and active GPIO states +respectively. In this example: ``echo 11 > values``. + +The ``values_atomic`` attribute works the same as ``values`` but the kernel +will execute the GPIO driver callbacks in interrupt context. + +**Group:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/`` + +This is a group that represents a single GPIO with ``$index`` being its offset +in the array. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/consumer`` + +Allows to set and read the consumer label of the GPIO line. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/debounce`` + +Allows to set and read the debounce period of the GPIO line. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/direction`` + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/direction_atomic`` + +These two attributes allow to set the direction of the GPIO line. They accept +"input" and "output" as values. The atomic variant executes the driver callback +in interrupt context. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/interrupts`` + +If the line is requested in input mode, writing ``1`` to this attribute will +make the module listen for edge interrupts on the GPIO. Writing ``0`` disables +the monitoring. Reading this attribute returns the current number of registered +interrupts (both edges). + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/value`` + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/value_atomic`` + +Both attributes allow to read and set values of individual requested GPIO lines. +They accept the following values: ``1`` and ``0``. diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst index 460afd29617e..712f379731cb 100644 --- a/Documentation/admin-guide/gpio/index.rst +++ b/Documentation/admin-guide/gpio/index.rst @@ -10,6 +10,7 @@ GPIO Character Device Userspace API <../../userspace-api/gpio/chardev> gpio-aggregator gpio-sim + gpio-virtuser Obsolete APIs <obsolete> .. only:: subproject and html diff --git a/Documentation/admin-guide/highuid.rst b/Documentation/admin-guide/highuid.rst deleted file mode 100644 index 6ee70465c0ea..000000000000 --- a/Documentation/admin-guide/highuid.rst +++ /dev/null @@ -1,80 +0,0 @@ -=================================================== -Notes on the change from 16-bit UIDs to 32-bit UIDs -=================================================== - -:Author: Chris Wing <wingc@umich.edu> -:Last updated: January 11, 2000 - -- kernel code MUST take into account __kernel_uid_t and __kernel_uid32_t - when communicating between user and kernel space in an ioctl or data - structure. - -- kernel code should use uid_t and gid_t in kernel-private structures and - code. - -What's left to be done for 32-bit UIDs on all Linux architectures: - -- Disk quotas have an interesting limitation that is not related to the - maximum UID/GID. They are limited by the maximum file size on the - underlying filesystem, because quota records are written at offsets - corresponding to the UID in question. - Further investigation is needed to see if the quota system can cope - properly with huge UIDs. If it can deal with 64-bit file offsets on all - architectures, this should not be a problem. - -- Decide whether or not to keep backwards compatibility with the system - accounting file, or if we should break it as the comments suggest - (currently, the old 16-bit UID and GID are still written to disk, and - part of the former pad space is used to store separate 32-bit UID and - GID) - -- Need to validate that OS emulation calls the 16-bit UID - compatibility syscalls, if the OS being emulated used 16-bit UIDs, or - uses the 32-bit UID system calls properly otherwise. - - This affects at least: - - - iBCS on Intel - - - sparc32 emulation on sparc64 - (need to support whatever new 32-bit UID system calls are added to - sparc32) - -- Validate that all filesystems behave properly. - - At present, 32-bit UIDs _should_ work for: - - - ext2 - - ufs - - isofs - - nfs - - coda - - udf - - Ioctl() fixups have been made for: - - - ncpfs - - smbfs - - Filesystems with simple fixups to prevent 16-bit UID wraparound: - - - minix - - sysv - - qnx4 - - Other filesystems have not been checked yet. - -- The ncpfs and smpfs filesystems cannot presently use 32-bit UIDs in - all ioctl()s. Some new ioctl()s have been added with 32-bit UIDs, but - more are needed. (as well as new user<->kernel data structures) - -- The ELF core dump format only supports 16-bit UIDs on arm, i386, m68k, - sh, and sparc32. Fixing this is probably not that important, but would - require adding a new ELF section. - -- The ioctl()s used to control the in-kernel NFS server only support - 16-bit UIDs on arm, i386, m68k, sh, and sparc32. - -- make sure that the UID mapping feature of AX25 networking works properly - (it should be safe because it's always used a 32-bit integer to - communicate between user and kernel) diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst index cf1eeefdfc32..a92e10ec402e 100644 --- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst @@ -67,8 +67,8 @@ arg4: will be performed for all tasks in the task group of ``pid``. arg5: - userspace pointer to an unsigned long for storing the cookie returned by - ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. + userspace pointer to an unsigned long long for storing the cookie returned + by ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. In order for a process to push a cookie to, or pull a cookie from a process, it is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index ff0b440ef2dc..09890a8f3ee9 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -22,3 +22,6 @@ are configurable at compile, boot or run time. srso gather_data_sampling reg-file-data-sampling + rsb + old_microcode + indirect-target-selection diff --git a/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst new file mode 100644 index 000000000000..d9ca64108d23 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/indirect-target-selection.rst @@ -0,0 +1,168 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Indirect Target Selection (ITS) +=============================== + +ITS is a vulnerability in some Intel CPUs that support Enhanced IBRS and were +released before Alder Lake. ITS may allow an attacker to control the prediction +of indirect branches and RETs located in the lower half of a cacheline. + +ITS is assigned CVE-2024-28956 with a CVSS score of 4.7 (Medium). + +Scope of Impact +--------------- +- **eIBRS Guest/Host Isolation**: Indirect branches in KVM/kernel may still be + predicted with unintended target corresponding to a branch in the guest. + +- **Intra-Mode BTI**: In-kernel training such as through cBPF or other native + gadgets. + +- **Indirect Branch Prediction Barrier (IBPB)**: After an IBPB, indirect + branches may still be predicted with targets corresponding to direct branches + executed prior to the IBPB. This is fixed by the IPU 2025.1 microcode, which + should be available via distro updates. Alternatively microcode can be + obtained from Intel's github repository [#f1]_. + +Affected CPUs +------------- +Below is the list of ITS affected CPUs [#f2]_ [#f3]_: + + ======================== ============ ==================== =============== + Common name Family_Model eIBRS Intra-mode BTI + Guest/Host Isolation + ======================== ============ ==================== =============== + SKYLAKE_X (step >= 6) 06_55H Affected Affected + ICELAKE_X 06_6AH Not affected Affected + ICELAKE_D 06_6CH Not affected Affected + ICELAKE_L 06_7EH Not affected Affected + TIGERLAKE_L 06_8CH Not affected Affected + TIGERLAKE 06_8DH Not affected Affected + KABYLAKE_L (step >= 12) 06_8EH Affected Affected + KABYLAKE (step >= 13) 06_9EH Affected Affected + COMETLAKE 06_A5H Affected Affected + COMETLAKE_L 06_A6H Affected Affected + ROCKETLAKE 06_A7H Not affected Affected + ======================== ============ ==================== =============== + +- All affected CPUs enumerate Enhanced IBRS feature. +- IBPB isolation is affected on all ITS affected CPUs, and need a microcode + update for mitigation. +- None of the affected CPUs enumerate BHI_CTRL which was introduced in Golden + Cove (Alder Lake and Sapphire Rapids). This can help guests to determine the + host's affected status. +- Intel Atom CPUs are not affected by ITS. + +Mitigation +---------- +As only the indirect branches and RETs that have their last byte of instruction +in the lower half of the cacheline are vulnerable to ITS, the basic idea behind +the mitigation is to not allow indirect branches in the lower half. + +This is achieved by relying on existing retpoline support in the kernel, and in +compilers. ITS-vulnerable retpoline sites are runtime patched to point to newly +added ITS-safe thunks. These safe thunks consists of indirect branch in the +second half of the cacheline. Not all retpoline sites are patched to thunks, if +a retpoline site is evaluated to be ITS-safe, it is replaced with an inline +indirect branch. + +Dynamic thunks +~~~~~~~~~~~~~~ +From a dynamically allocated pool of safe-thunks, each vulnerable site is +replaced with a new thunk, such that they get a unique address. This could +improve the branch prediction accuracy. Also, it is a defense-in-depth measure +against aliasing. + +Note, for simplicity, indirect branches in eBPF programs are always replaced +with a jump to a static thunk in __x86_indirect_its_thunk_array. If required, +in future this can be changed to use dynamic thunks. + +All vulnerable RETs are replaced with a static thunk, they do not use dynamic +thunks. This is because RETs get their prediction from RSB mostly that does not +depend on source address. RETs that underflow RSB may benefit from dynamic +thunks. But, RETs significantly outnumber indirect branches, and any benefit +from a unique source address could be outweighed by the increased icache +footprint and iTLB pressure. + +Retpoline +~~~~~~~~~ +Retpoline sequence also mitigates ITS-unsafe indirect branches. For this +reason, when retpoline is enabled, ITS mitigation only relocates the RETs to +safe thunks. Unless user requested the RSB-stuffing mitigation. + +RSB Stuffing +~~~~~~~~~~~~ +RSB-stuffing via Call Depth Tracking is a mitigation for Retbleed RSB-underflow +attacks. And it also mitigates RETs that are vulnerable to ITS. + +Mitigation in guests +^^^^^^^^^^^^^^^^^^^^ +All guests deploy ITS mitigation by default, irrespective of eIBRS enumeration +and Family/Model of the guest. This is because eIBRS feature could be hidden +from a guest. One exception to this is when a guest enumerates BHI_DIS_S, which +indicates that the guest is running on an unaffected host. + +To prevent guests from unnecessarily deploying the mitigation on unaffected +platforms, Intel has defined ITS_NO bit(62) in MSR IA32_ARCH_CAPABILITIES. When +a guest sees this bit set, it should not enumerate the ITS bug. Note, this bit +is not set by any hardware, but is **intended for VMMs to synthesize** it for +guests as per the host's affected status. + +Mitigation options +^^^^^^^^^^^^^^^^^^ +The ITS mitigation can be controlled using the "indirect_target_selection" +kernel parameter. The available options are: + + ======== =================================================================== + on (default) Deploy the "Aligned branch/return thunks" mitigation. + If spectre_v2 mitigation enables retpoline, aligned-thunks are only + deployed for the affected RET instructions. Retpoline mitigates + indirect branches. + + off Disable ITS mitigation. + + vmexit Equivalent to "=on" if the CPU is affected by guest/host isolation + part of ITS. Otherwise, mitigation is not deployed. This option is + useful when host userspace is not in the threat model, and only + attacks from guest to host are considered. + + stuff Deploy RSB-fill mitigation when retpoline is also deployed. + Otherwise, deploy the default mitigation. When retpoline mitigation + is enabled, RSB-stuffing via Call-Depth-Tracking also mitigates + ITS. + + force Force the ITS bug and deploy the default mitigation. + ======== =================================================================== + +Sysfs reporting +--------------- + +The sysfs file showing ITS mitigation status is: + + /sys/devices/system/cpu/vulnerabilities/indirect_target_selection + +Note, microcode mitigation status is not reported in this file. + +The possible values in this file are: + +.. list-table:: + + * - Not affected + - The processor is not vulnerable. + * - Vulnerable + - System is vulnerable and no mitigation has been applied. + * - Vulnerable, KVM: Not affected + - System is vulnerable to intra-mode BTI, but not affected by eIBRS + guest/host isolation. + * - Mitigation: Aligned branch/return thunks + - The mitigation is enabled, affected indirect branches and RETs are + relocated to safe thunks. + * - Mitigation: Retpolines, Stuffing RSB + - The mitigation is enabled using retpoline and RSB stuffing. + +References +---------- +.. [#f1] Microcode repository - https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files + +.. [#f2] Affected Processors list - https://www.intel.com/content/www/us/en/developer/topic-technology/software-security-guidance/processors-affected-consolidated-product-cpu-model.html + +.. [#f3] Affected Processors list (machine readable) - https://github.com/intel/Intel-affected-processor-list diff --git a/Documentation/admin-guide/hw-vuln/old_microcode.rst b/Documentation/admin-guide/hw-vuln/old_microcode.rst new file mode 100644 index 000000000000..6ded8f86b8d0 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/old_microcode.rst @@ -0,0 +1,21 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Old Microcode +============= + +The kernel keeps a table of released microcode. Systems that had +microcode older than this at boot will say "Vulnerable". This means +that the system was vulnerable to some known CPU issue. It could be +security or functional, the kernel does not know or care. + +You should update the CPU microcode to mitigate any exposure. This is +usually accomplished by updating the files in +/lib/firmware/intel-ucode/ via normal distribution updates. Intel also +distributes these files in a github repo: + + https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files.git + +Just like all the other hardware vulnerabilities, exposure is +determined at boot. Runtime microcode updates do not change the status +of this vulnerability. diff --git a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst index 0585d02b9a6c..ad15417d39f9 100644 --- a/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst +++ b/Documentation/admin-guide/hw-vuln/reg-file-data-sampling.rst @@ -29,14 +29,6 @@ Below is the list of affected Intel processors [#f1]_: RAPTORLAKE_S 06_BFH =================== ============ -As an exception to this table, Intel Xeon E family parts ALDERLAKE(06_97H) and -RAPTORLAKE(06_B7H) codenamed Catlow are not affected. They are reported as -vulnerable in Linux because they share the same family/model with an affected -part. Unlike their affected counterparts, they do not enumerate RFDS_CLEAR or -CPUID.HYBRID. This information could be used to distinguish between the -affected and unaffected parts, but it is deemed not worth adding complexity as -the reporting is fixed automatically when these parts enumerate RFDS_NO. - Mitigation ========== Intel released a microcode update that enables software to clear sensitive diff --git a/Documentation/admin-guide/hw-vuln/rsb.rst b/Documentation/admin-guide/hw-vuln/rsb.rst new file mode 100644 index 000000000000..21dbf9cf25f8 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/rsb.rst @@ -0,0 +1,268 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +RSB-related mitigations +======================= + +.. warning:: + Please keep this document up-to-date, otherwise you will be + volunteered to update it and convert it to a very long comment in + bugs.c! + +Since 2018 there have been many Spectre CVEs related to the Return Stack +Buffer (RSB) (sometimes referred to as the Return Address Stack (RAS) or +Return Address Predictor (RAP) on AMD). + +Information about these CVEs and how to mitigate them is scattered +amongst a myriad of microarchitecture-specific documents. + +This document attempts to consolidate all the relevant information in +once place and clarify the reasoning behind the current RSB-related +mitigations. It's meant to be as concise as possible, focused only on +the current kernel mitigations: what are the RSB-related attack vectors +and how are they currently being mitigated? + +It's *not* meant to describe how the RSB mechanism operates or how the +exploits work. More details about those can be found in the references +below. + +Rather, this is basically a glorified comment, but too long to actually +be one. So when the next CVE comes along, a kernel developer can +quickly refer to this as a refresher to see what we're actually doing +and why. + +At a high level, there are two classes of RSB attacks: RSB poisoning +(Intel and AMD) and RSB underflow (Intel only). They must each be +considered individually for each attack vector (and microarchitecture +where applicable). + +---- + +RSB poisoning (Intel and AMD) +============================= + +SpectreRSB +~~~~~~~~~~ + +RSB poisoning is a technique used by SpectreRSB [#spectre-rsb]_ where +an attacker poisons an RSB entry to cause a victim's return instruction +to speculate to an attacker-controlled address. This can happen when +there are unbalanced CALLs/RETs after a context switch or VMEXIT. + +* All attack vectors can potentially be mitigated by flushing out any + poisoned RSB entries using an RSB filling sequence + [#intel-rsb-filling]_ [#amd-rsb-filling]_ when transitioning between + untrusted and trusted domains. But this has a performance impact and + should be avoided whenever possible. + + .. DANGER:: + **FIXME**: Currently we're flushing 32 entries. However, some CPU + models have more than 32 entries. The loop count needs to be + increased for those. More detailed information is needed about RSB + sizes. + +* On context switch, the user->user mitigation requires ensuring the + RSB gets filled or cleared whenever IBPB gets written [#cond-ibpb]_ + during a context switch: + + * AMD: + On Zen 4+, IBPB (or SBPB [#amd-sbpb]_ if used) clears the RSB. + This is indicated by IBPB_RET in CPUID [#amd-ibpb-rsb]_. + + On Zen < 4, the RSB filling sequence [#amd-rsb-filling]_ must be + always be done in addition to IBPB [#amd-ibpb-no-rsb]_. This is + indicated by X86_BUG_IBPB_NO_RET. + + * Intel: + IBPB always clears the RSB: + + "Software that executed before the IBPB command cannot control + the predicted targets of indirect branches executed after the + command on the same logical processor. The term indirect branch + in this context includes near return instructions, so these + predicted targets may come from the RSB." [#intel-ibpb-rsb]_ + +* On context switch, user->kernel attacks are prevented by SMEP. User + space can only insert user space addresses into the RSB. Even + non-canonical addresses can't be inserted due to the page gap at the + end of the user canonical address space reserved by TASK_SIZE_MAX. + A SMEP #PF at instruction fetch prevents the kernel from speculatively + executing user space. + + * AMD: + "Finally, branches that are predicted as 'ret' instructions get + their predicted targets from the Return Address Predictor (RAP). + AMD recommends software use a RAP stuffing sequence (mitigation + V2-3 in [2]) and/or Supervisor Mode Execution Protection (SMEP) + to ensure that the addresses in the RAP are safe for + speculation. Collectively, we refer to these mitigations as "RAP + Protection"." [#amd-smep-rsb]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits." [#intel-smep-rsb]_ + +* On VMEXIT, guest->host attacks are mitigated by eIBRS (and PBRSB + mitigation if needed): + + * AMD: + "When Automatic IBRS is enabled, the internal return address + stack used for return address predictions is cleared on VMEXIT." + [#amd-eibrs-vmexit]_ + + * Intel: + "On processors with enhanced IBRS, an RSB overwrite sequence may + not suffice to prevent the predicted target of a near return + from using an RSB entry created in a less privileged predictor + mode. Software can prevent this by enabling SMEP (for + transitions from user mode to supervisor mode) and by having + IA32_SPEC_CTRL.IBRS set during VM exits. Processors with + enhanced IBRS still support the usage model where IBRS is set + only in the OS/VMM for OSes that enable SMEP. To do this, such + processors will ensure that guest behavior cannot control the + RSB after a VM exit once IBRS is set, even if IBRS was not set + at the time of the VM exit." [#intel-eibrs-vmexit]_ + + Note that some Intel CPUs are susceptible to Post-barrier Return + Stack Buffer Predictions (PBRSB) [#intel-pbrsb]_, where the last + CALL from the guest can be used to predict the first unbalanced RET. + In this case the PBRSB mitigation is needed in addition to eIBRS. + +AMD RETBleed / SRSO / Branch Type Confusion +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On AMD, poisoned RSB entries can also be created by the AMD RETBleed +variant [#retbleed-paper]_ [#amd-btc]_ or by Speculative Return Stack +Overflow [#amd-srso]_ (Inception [#inception-paper]_). The kernel +protects itself by replacing every RET in the kernel with a branch to a +single safe RET. + +---- + +RSB underflow (Intel only) +========================== + +RSB Alternate (RSBA) ("Intel Retbleed") +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some Intel Skylake-generation CPUs are susceptible to the Intel variant +of RETBleed [#retbleed-paper]_ (Return Stack Buffer Underflow +[#intel-rsbu]_). If a RET is executed when the RSB buffer is empty due +to mismatched CALLs/RETs or returning from a deep call stack, the branch +predictor can fall back to using the Branch Target Buffer (BTB). If a +user forces a BTB collision then the RET can speculatively branch to a +user-controlled address. + +* Note that RSB filling doesn't fully mitigate this issue. If there + are enough unbalanced RETs, the RSB may still underflow and fall back + to using a poisoned BTB entry. + +* On context switch, user->user underflow attacks are mitigated by the + conditional IBPB [#cond-ibpb]_ on context switch which effectively + clears the BTB: + + * "The indirect branch predictor barrier (IBPB) is an indirect branch + control mechanism that establishes a barrier, preventing software + that executed before the barrier from controlling the predicted + targets of indirect branches executed after the barrier on the same + logical processor." [#intel-ibpb-btb]_ + +* On context switch and VMEXIT, user->kernel and guest->host RSB + underflows are mitigated by IBRS or eIBRS: + + * "Enabling IBRS (including enhanced IBRS) will mitigate the "RSBU" + attack demonstrated by the researchers. As previously documented, + Intel recommends the use of enhanced IBRS, where supported. This + includes any processor that enumerates RRSBA but not RRSBA_DIS_S." + [#intel-rsbu]_ + + However, note that eIBRS and IBRS do not mitigate intra-mode attacks. + Like RRSBA below, this is mitigated by clearing the BHB on kernel + entry. + + As an alternative to classic IBRS, call depth tracking (combined with + retpolines) can be used to track kernel returns and fill the RSB when + it gets close to being empty. + +Restricted RSB Alternate (RRSBA) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Some newer Intel CPUs have Restricted RSB Alternate (RRSBA) behavior, +which, similar to RSBA described above, also falls back to using the BTB +on RSB underflow. The only difference is that the predicted targets are +restricted to the current domain when eIBRS is enabled: + +* "Restricted RSB Alternate (RRSBA) behavior allows alternate branch + predictors to be used by near RET instructions when the RSB is + empty. When eIBRS is enabled, the predicted targets of these + alternate predictors are restricted to those belonging to the + indirect branch predictor entries of the current prediction domain. + [#intel-eibrs-rrsba]_ + +When a CPU with RRSBA is vulnerable to Branch History Injection +[#bhi-paper]_ [#intel-bhi]_, an RSB underflow could be used for an +intra-mode BTI attack. This is mitigated by clearing the BHB on +kernel entry. + +However if the kernel uses retpolines instead of eIBRS, it needs to +disable RRSBA: + +* "Where software is using retpoline as a mitigation for BHI or + intra-mode BTI, and the processor both enumerates RRSBA and + enumerates RRSBA_DIS controls, it should disable this behavior." + [#intel-retpoline-rrsba]_ + +---- + +References +========== + +.. [#spectre-rsb] `Spectre Returns! Speculation Attacks using the Return Stack Buffer <https://arxiv.org/pdf/1807.07940.pdf>`_ + +.. [#intel-rsb-filling] "Empty RSB Mitigation on Skylake-generation" in `Retpoline: A Branch Target Injection Mitigation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/retpoline-branch-target-injection-mitigation.html#inpage-nav-5-1>`_ + +.. [#amd-rsb-filling] "Mitigation V2-3" in `Software Techniques for Managing Speculation <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/software-techniques-for-managing-speculation.pdf>`_ + +.. [#cond-ibpb] Whether IBPB is written depends on whether the prev and/or next task is protected from Spectre attacks. It typically requires opting in per task or system-wide. For more details see the documentation for the ``spectre_v2_user`` cmdline option in Documentation/admin-guide/kernel-parameters.txt. + +.. [#amd-sbpb] IBPB without flushing of branch type predictions. Only exists for AMD. + +.. [#amd-ibpb-rsb] "Function 8000_0008h -- Processor Capacity Parameters and Extended Feature Identification" in `AMD64 Architecture Programmer's Manual Volume 3: General-Purpose and System Instructions <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24594.pdf>`_. SBPB behaves the same way according to `this email <https://lore.kernel.org/5175b163a3736ca5fd01cedf406735636c99a>`_. + +.. [#amd-ibpb-no-rsb] `Spectre Attacks: Exploiting Speculative Execution <https://comsec.ethz.ch/wp-content/files/ibpb_sp25.pdf>`_ + +.. [#intel-ibpb-rsb] "Introduction" in `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#amd-smep-rsb] "Existing Mitigations" in `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#intel-smep-rsb] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#amd-eibrs-vmexit] "Extended Feature Enable Register (EFER)" in `AMD64 Architecture Programmer's Manual Volume 2: System Programming <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_ + +.. [#intel-eibrs-vmexit] "Enhanced IBRS" in `Indirect Branch Restricted Speculation <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-restricted-speculation.html>`_ + +.. [#intel-pbrsb] `Post-barrier Return Stack Buffer Predictions / CVE-2022-26373 / INTEL-SA-00706 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/post-barrier-return-stack-buffer-predictions.html>`_ + +.. [#retbleed-paper] `RETBleed: Arbitrary Speculative Code Execution with Return Instruction <https://comsec.ethz.ch/wp-content/files/retbleed_sec22.pdf>`_ + +.. [#amd-btc] `Technical Guidance for Mitigating Branch Type Confusion <https://www.amd.com/content/dam/amd/en/documents/resources/technical-guidance-for-mitigating-branch-type-confusion.pdf>`_ + +.. [#amd-srso] `Technical Update Regarding Speculative Return Stack Overflow <https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf>`_ + +.. [#inception-paper] `Inception: Exposing New Attack Surfaces with Training in Transient Execution <https://comsec.ethz.ch/wp-content/files/inception_sec23.pdf>`_ + +.. [#intel-rsbu] `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#intel-ibpb-btb] `Indirect Branch Predictor Barrier' <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/indirect-branch-predictor-barrier.html>`_ + +.. [#intel-eibrs-rrsba] "Guidance for RSBU" in `Return Stack Buffer Underflow / Return Stack Buffer Underflow / CVE-2022-29901, CVE-2022-28693 / INTEL-SA-00702 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/return-stack-buffer-underflow.html>`_ + +.. [#bhi-paper] `Branch History Injection: On the Effectiveness of Hardware Mitigations Against Cross-Privilege Spectre-v2 Attacks <http://download.vusec.net/papers/bhi-spectre-bhb_sec22.pdf>`_ + +.. [#intel-bhi] `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ + +.. [#intel-retpoline-rrsba] "Retpoline" in `Branch History Injection and Intra-mode Branch Target Injection / CVE-2022-0001, CVE-2022-0002 / INTEL-SA-00598 <https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html>`_ diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 25a04cda4c2c..132e0bc6007e 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -592,85 +592,19 @@ Spectre variant 2 Mitigation control on the kernel command line --------------------------------------------- -Spectre variant 2 mitigation can be disabled or force enabled at the -kernel command line. +In general the kernel selects reasonable default mitigations for the +current CPU. - nospectre_v1 +Spectre default mitigations can be disabled or changed at the kernel +command line with the following options: - [X86,PPC] Disable mitigations for Spectre Variant 1 - (bounds check bypass). With this option data leaks are - possible in the system. + - nospectre_v1 + - nospectre_v2 + - spectre_v2={option} + - spectre_v2_user={option} + - spectre_bhi={option} - nospectre_v2 - - [X86] Disable all mitigations for the Spectre variant 2 - (indirect branch prediction) vulnerability. System may - allow data leaks with this option, which is equivalent - to spectre_v2=off. - - - spectre_v2= - - [X86] Control mitigation of Spectre variant 2 - (indirect branch speculation) vulnerability. - The default operation protects the kernel from - user space attacks. - - on - unconditionally enable, implies - spectre_v2_user=on - off - unconditionally disable, implies - spectre_v2_user=off - auto - kernel detects whether your CPU model is - vulnerable - - Selecting 'on' will, and 'auto' may, choose a - mitigation method at run time according to the - CPU, the available microcode, the setting of the - CONFIG_MITIGATION_RETPOLINE configuration option, - and the compiler with which the kernel was built. - - Selecting 'on' will also enable the mitigation - against user space to user space task attacks. - - Selecting 'off' will disable both the kernel and - the user space protections. - - Specific mitigations can also be selected manually: - - retpoline auto pick between generic,lfence - retpoline,generic Retpolines - retpoline,lfence LFENCE; indirect branch - retpoline,amd alias for retpoline,lfence - eibrs Enhanced/Auto IBRS - eibrs,retpoline Enhanced/Auto IBRS + Retpolines - eibrs,lfence Enhanced/Auto IBRS + LFENCE - ibrs use IBRS to protect kernel - - Not specifying this option is equivalent to - spectre_v2=auto. - - In general the kernel by default selects - reasonable mitigations for the current CPU. To - disable Spectre variant 2 mitigations, boot with - spectre_v2=off. Spectre variant 1 mitigations - cannot be disabled. - - spectre_bhi= - - [X86] Control mitigation of Branch History Injection - (BHI) vulnerability. This setting affects the deployment - of the HW BHI control and the SW BHB clearing sequence. - - on - (default) Enable the HW or SW mitigation as - needed. - off - Disable the mitigation. - -For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt +For more details on the available options, refer to Documentation/admin-guide/kernel-parameters.txt Mitigation selection guide -------------------------- diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst index e715bfc09879..66af95251a3d 100644 --- a/Documentation/admin-guide/hw-vuln/srso.rst +++ b/Documentation/admin-guide/hw-vuln/srso.rst @@ -104,7 +104,20 @@ The possible values in this file are: (spec_rstack_overflow=ibpb-vmexit) + * 'Mitigation: Reduced Speculation': + This mitigation gets automatically enabled when the above one "IBPB on + VMEXIT" has been selected and the CPU supports the BpSpecReduce bit. + + It gets automatically enabled on machines which have the + SRSO_USER_KERNEL_NO=1 CPUID bit. In that case, the code logic is to switch + to the above =ibpb-vmexit mitigation because the user/kernel boundary is + not affected anymore and thus "safe RET" is not needed. + + After enabling the IBPB on VMEXIT mitigation option, the BpSpecReduce bit + is detected (functionality present on all such machines) and that + practically overrides IBPB on VMEXIT as it has a lot less performance + impact and takes care of the guest->host attack vector too. In order to exploit vulnerability, an attacker needs to: @@ -135,7 +148,7 @@ and does not want to suffer the performance impact, one can always disable the mitigation with spec_rstack_overflow=off. Similarly, 'Mitigation: IBPB' is another full mitigation type employing -an indrect branch prediction barrier after having applied the required +an indirect branch prediction barrier after having applied the required microcode patch for one's system. This mitigation comes also at a performance cost. @@ -158,3 +171,72 @@ poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). + +Checking the safe RET mitigation actually works +----------------------------------------------- + +In case one wants to validate whether the SRSO safe RET mitigation works +on a kernel, one could use two performance counters + +* PMC_0xc8 - Count of RET/RET lw retired +* PMC_0xc9 - Count of RET/RET lw retired mispredicted + +and compare the number of RETs retired properly vs those retired +mispredicted, in kernel mode. Another way of specifying those events +is:: + + # perf list ex_ret_near_ret + + List of pre-defined events (to be used in -e or -M): + + core: + ex_ret_near_ret + [Retired Near Returns] + ex_ret_near_ret_mispred + [Retired Near Returns Mispredicted] + +Either the command using the event mnemonics:: + + # perf stat -e ex_ret_near_ret:k -e ex_ret_near_ret_mispred:k sleep 10s + +or using the raw PMC numbers:: + + # perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + +should give the same amount. I.e., every RET retired should be +mispredicted:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 137,167 cpu/event=0xc8,umask=0/k + 137,173 cpu/event=0xc9,umask=0/k + + 10.004110303 seconds time elapsed + + 0.000000000 seconds user + 0.004462000 seconds sys + +vs the case when the mitigation is disabled (spec_rstack_overflow=off) +or not functioning properly, showing usually a lot smaller number of +mispredicted retired RETs vs the overall count of retired RETs during +a workload:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 201,627 cpu/event=0xc8,umask=0/k + 4,074 cpu/event=0xc9,umask=0/k + + 10.003267252 seconds time elapsed + + 0.002729000 seconds user + 0.000000000 seconds sys + +Also, there is a selftest which performs the above, go to +tools/testing/selftests/x86/ and do:: + + make srso + ./srso diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 32ea52f1d150..259d79fbeb94 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -7,6 +7,9 @@ added to the kernel over time. There is, as yet, little overall order or organization here — this material was not written to be a single, coherent document! With luck things will improve quickly over time. +General guides to kernel administration +--------------------------------------- + This initial section contains overall information, including the README file describing the kernel as a whole, documentation on kernel parameters, etc. @@ -15,19 +18,44 @@ etc. :maxdepth: 1 README - kernel-parameters devices - sysctl/index - abi features -This section describes CPU vulnerabilities and their mitigations. +A big part of the kernel's administrative interface is the /proc and sysfs +virtual filesystems; these documents describe how to interact with tem + +.. toctree:: + :maxdepth: 1 + + sysfs-rules + sysctl/index + cputopology + abi + +Security-related documentation: .. toctree:: :maxdepth: 1 hw-vuln/index + LSM/index + perf-security + +Booting the kernel +------------------ + +.. toctree:: + :maxdepth: 1 + + bootconfig + kernel-parameters + efi-stub + initrd + + +Tracking down and identifying problems +-------------------------------------- Here is a set of documents aimed at users who are trying to track down problems and bugs in particular. @@ -48,95 +76,119 @@ problems and bugs in particular. kdump/index perf/index pstore-blk + clearing-warn-once + kernel-per-CPU-kthreads + lockup-watchdogs + RAS/index + sysrq -This is the beginning of a section with information of interest to -application developers. Documents covering various aspects of the kernel -ABI will be found here. + +Core-kernel subsystems +---------------------- + +These documents describe core-kernel administration interfaces that are +likely to be of interest on almost any system. .. toctree:: :maxdepth: 1 - sysfs-rules + cgroup-v2 + cgroup-v1/index + cpu-load + mm/index + module-signing + namespaces/index + numastat + pm/index + syscall-user-dispatch -This is the beginning of a section with information of interest to -application developers and system integrators doing analysis of the -Linux kernel for safety critical applications. Documents supporting -analysis of kernel interactions with applications, and key kernel -subsystems expectations will be found here. +Support for non-native binary formats. Note that some of these +documents are ... old ... .. toctree:: :maxdepth: 1 - workload-tracing + binfmt-misc + java + mono -The rest of this manual consists of various unordered guides on how to -configure specific aspects of kernel behavior to your liking. + +Block-layer and filesystem administration +----------------------------------------- .. toctree:: :maxdepth: 1 - acpi/index - aoe/index - auxdisplay/index bcache binderfs - binfmt-misc blockdev/index - bootconfig - braille-console - btmrvl - cgroup-v1/index - cgroup-v2 cifs/index - clearing-warn-once - cpu-load - cputopology - dell_rbu device-mapper/index - edid - efi-stub ext4 filesystem-monitoring nfs/index - gpio/index - highuid - hw_random - initrd iostats - java jfs - kernel-per-CPU-kthreads + md + ufs + xfs + +Device-specific guides +---------------------- + +How to configure your hardware within your Linux system. + +.. toctree:: + :maxdepth: 1 + + acpi/index + aoe/index + auxdisplay/index + braille-console + btmrvl + dell_rbu + edid + gpio/index + hw_random laptops/index lcd-panel-cgram - ldm - lockup-watchdogs - LSM/index - md media/index - mm/index - module-signing - mono - namespaces/index - numastat + nvme-multipath parport - perf-security - pm/index - pmf pnp rapidio - RAS/index rtc serial-console svga - syscall-user-dispatch - sysrq thermal/index thunderbolt - ufs - unicode vga-softcursor video-output - xfs + +Workload analysis +----------------- + +This is the beginning of a section with information of interest to +application developers and system integrators doing analysis of the +Linux kernel for safety critical applications. Documents supporting +analysis of kernel interactions with applications, and key kernel +subsystems expectations will be found here. + +.. toctree:: + :maxdepth: 1 + + workload-tracing + +Everything else +--------------- + +A few hard-to-categorize and generally obsolete documents. + +.. toctree:: + :maxdepth: 1 + + ldm + unicode .. only:: subproject and html diff --git a/Documentation/admin-guide/iostats.rst b/Documentation/admin-guide/iostats.rst index 609a3201fd4e..9453196ade51 100644 --- a/Documentation/admin-guide/iostats.rst +++ b/Documentation/admin-guide/iostats.rst @@ -2,62 +2,39 @@ I/O statistics fields ===================== -Since 2.4.20 (and some versions before, with patches), and 2.5.45, -more extensive disk statistics have been introduced to help measure disk -activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do -the work for you, but in case you are interested in creating your own -tools, the fields are explained here. - -In 2.4 now, the information is found as additional fields in -``/proc/partitions``. In 2.6 and upper, the same information is found in two -places: one is in the file ``/proc/diskstats``, and the other is within -the sysfs file system, which must be mounted in order to obtain -the information. Throughout this document we'll assume that sysfs -is mounted on ``/sys``, although of course it may be mounted anywhere. -Both ``/proc/diskstats`` and sysfs use the same source for the information -and so should not differ. - -Here are examples of these different formats:: - - 2.4: - 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030 - - 2.6+ sysfs: - 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 35486 38030 38030 38030 - - 2.6+ diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 - 3 1 hda1 35486 38030 38030 38030 - - 4.18+ diskstats: - 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0 - -On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have -a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``. - -The advantage of one over the other is that the sysfs choice works well -if you are watching a known, small set of disks. ``/proc/diskstats`` may -be a better choice if you are watching a large number of disks because -you'll avoid the overhead of 50, 100, or 500 or more opens/closes with -each snapshot of your disk statistics. - -In 2.4, the statistics fields are those after the device name. In -the above example, the first field of statistics would be 446216. -By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll -find just the 15 fields, beginning with 446216. If you look at -``/proc/diskstats``, the 15 fields will be preceded by the major and -minor device numbers, and device name. Each of these formats provides -15 fields of statistics, each meaning exactly the same things. -All fields except field 9 are cumulative since boot. Field 9 should -go to zero as I/Os complete; all others only increase (unless they -overflow and wrap). Wrapping might eventually occur on a very busy -or long-lived system; so applications should be prepared to deal with -it. Regarding wrapping, the types of the fields are either unsigned -int (32 bit) or unsigned long (32-bit or 64-bit, depending on your -machine) as noted per-field below. Unless your observations are very -spread in time, these fields should not wrap twice before you notice it. +The kernel exposes disk statistics via ``/proc/diskstats`` and +``/sys/block/<device>/stat``. These stats are usually accessed via tools +such as ``sar`` and ``iostat``. + +Here are examples using a disk with two partitions:: + + /proc/diskstats: + 259 0 nvme0n1 255999 814 12369153 47919 996852 81 36123024 425995 0 301795 580470 0 0 0 0 60602 106555 + 259 1 nvme0n1p1 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0 + 259 2 nvme0n1p2 255401 1 12343477 47799 996004 0 36014736 425784 0 344336 473584 0 0 0 0 0 0 + + /sys/block/nvme0n1/stat: + 255999 814 12369153 47919 996858 81 36123056 426009 0 301809 580491 0 0 0 0 60605 106562 + + /sys/block/nvme0n1/nvme0n1p1/stat: + 492 813 17572 96 848 81 108288 210 0 76 307 0 0 0 0 0 0 + +Both files contain the same 17 statistics. ``/sys/block/<device>/stat`` +contains the fields for ``<device>``. In ``/proc/diskstats`` the fields +are prefixed with the major and minor device numbers and the device +name. In the example above, the first stat value for ``nvme0n1`` is +255999 in both files. + +The sysfs ``stat`` file is efficient for monitoring a small, known set +of disks. If you're tracking a large number of devices, +``/proc/diskstats`` is often the better choice since it avoids the +overhead of opening and closing multiple files for each snapshot. + +All fields are cumulative, monotonic counters, except for field 9, which +resets to zero as I/Os complete. The remaining fields reset at boot, on +device reattachment or reinitialization, or when the underlying counter +overflows. Applications reading these counters should detect and handle +resets when comparing stat snapshots. Each set of stats only applies to the indicated device; if you want system-wide stats you'll have to find all the devices and sum them all up. diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 0302a93b1d40..20fabdf6567e 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -136,10 +136,6 @@ System kernel config options CONFIG_KEXEC_CORE=y - Subsequently, CRASH_CORE is selected by KEXEC_CORE:: - - CONFIG_CRASH_CORE=y - 2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo filesystems." This is usually enabled by default:: @@ -168,6 +164,10 @@ Dump-capture kernel config options (Arch Independent) CONFIG_CRASH_DUMP=y + And this will select VMCORE_INFO and CRASH_RESERVE:: + CONFIG_VMCORE_INFO=y + CONFIG_CRASH_RESERVE=y + 2) Enable "/proc/vmcore support" under "Filesystems" -> "Pseudo filesystems":: CONFIG_PROC_VMCORE=y @@ -180,10 +180,6 @@ Dump-capture kernel config options (Arch Dependent, i386 and x86_64) 1) On i386, enable high memory support under "Processor type and features":: - CONFIG_HIGHMEM64G=y - - or:: - CONFIG_HIGHMEM4G 2) With CONFIG_SMP=y, usually nr_cpus=1 need specified on the kernel @@ -551,6 +547,38 @@ from within add_taint() whenever the value set in this bitmask matches with the bit flag being set by add_taint(). This will cause a kdump to occur at the add_taint()->panic() call. +Write the dump file to encrypted disk volume +============================================ + +CONFIG_CRASH_DM_CRYPT can be enabled to support saving the dump file to an +encrypted disk volume (only x86_64 supported for now). User space can interact +with /sys/kernel/config/crash_dm_crypt_keys for setup, + +1. Tell the first kernel what logon keys are needed to unlock the disk volumes, + # Add key #1 + mkdir /sys/kernel/config/crash_dm_crypt_keys/7d26b7b4-e342-4d2d-b660-7426b0996720 + # Add key #1's description + echo cryptsetup:7d26b7b4-e342-4d2d-b660-7426b0996720 > /sys/kernel/config/crash_dm_crypt_keys/description + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 1 + + # Add key #2 in the same way + + # how many keys do we have now? + cat /sys/kernel/config/crash_dm_crypt_keys/count + 2 + + # To support CPU/memory hot-plugging, re-use keys already saved to reserved + # memory + echo true > /sys/kernel/config/crash_dm_crypt_key/reuse + +2. Load the dump-capture kernel + +3. After the dump-capture kerne get booted, restore the keys to user keyring + echo yes > /sys/kernel/crash_dm_crypt_keys/restore + Contact ======= diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 0f714fc945ac..8cf4614385b7 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -331,8 +331,8 @@ PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|P Page attributes. These flags are used to filter various unnecessary for dumping pages. -PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline) ------------------------------------------------------------------------------ +PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_unaccepted) +------------------------------------------------------------------------------------------------------------------------- More page attributes. These flags are used to filter various unnecessary for dumping pages. diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index e8bdf5e86a9b..39d0e7ff0965 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -27,6 +27,16 @@ kernel command line (/proc/cmdline) and collects module parameters when it loads a module, so the kernel command line can be used for loadable modules too. +This document may not be entirely up to date and comprehensive. The command +"modinfo -p ${modulename}" shows a current list of all parameters of a loadable +module. Loadable modules, after being loaded into the running kernel, also +reveal their parameters in /sys/module/${modulename}/parameters/. Some of these +parameters may be changed at runtime by the command +``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``. + +Special handling +---------------- + Hyphens (dashes) and underscores are equivalent in parameter names, so:: log_buf_len=1M print-fatal-signals=1 @@ -39,8 +49,8 @@ Double-quotes can be used to protect spaces in values, e.g.:: param="spaces in here" -cpu lists: ----------- +cpu lists +~~~~~~~~~ Some kernel parameters take a list of CPUs as a value, e.g. isolcpus, nohz_full, irqaffinity, rcu_nocbs. The format of this list is: @@ -82,12 +92,17 @@ so that "nohz_full=all" is the equivalent of "nohz_full=0-N". The semantics of "N" and "all" is supported on a level of bitmaps and holds for all users of bitmap_parselist(). -This document may not be entirely up to date and comprehensive. The command -"modinfo -p ${modulename}" shows a current list of all parameters of a loadable -module. Loadable modules, after being loaded into the running kernel, also -reveal their parameters in /sys/module/${modulename}/parameters/. Some of these -parameters may be changed at runtime by the command -``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``. +Metric suffixes +~~~~~~~~~~~~~~~ + +The [KMG] suffix is commonly described after a number of kernel +parameter values. 'K', 'M', 'G', 'T', 'P', and 'E' suffixes are allowed. +These letters represent the _binary_ multipliers 'Kilo', 'Mega', 'Giga', +'Tera', 'Peta', and 'Exa', equaling 2^10, 2^20, 2^30, 2^40, 2^50, and +2^60 bytes respectively. Such letter suffixes can also be entirely omitted. + +Kernel Build Options +-------------------- The parameters listed below are only valid if certain kernel build options were enabled and if respective hardware is present. This list should be kept @@ -118,7 +133,6 @@ is applicable:: HIBERNATION HIBERNATION is enabled. HW Appropriate hardware is enabled. HYPER_V HYPERV support is enabled. - IA-64 IA-64 architecture is enabled. IMA Integrity measurement architecture is enabled. IP_PNP IP DHCP, BOOTP, or RARP is enabled. IPV6 IPv6 support is enabled. @@ -160,6 +174,7 @@ is applicable:: SCSI Appropriate SCSI support is enabled. A lot of drivers have their options described inside the Documentation/scsi/ sub-directory. + SDW SoundWire support is enabled. SECURITY Different security models are enabled. SELINUX SELinux support is enabled. SERIAL Serial support is enabled. @@ -179,8 +194,6 @@ is applicable:: WDT Watchdog support is enabled. X86-32 X86-32, aka i386 architecture is enabled. X86-64 X86-64 architecture is enabled. - More X86-64 boot options can be found in - Documentation/arch/x86/x86_64/boot-options.rst. X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) X86_UV SGI UV support is enabled. XEN Xen support is enabled @@ -198,7 +211,6 @@ Do not modify the syntax of boot loader parameters without extreme need or coordination with <Documentation/arch/x86/boot.rst>. There are also arch-specific kernel-parameters not documented here. -See for example <Documentation/arch/x86/x86_64/boot-options.rst>. Note that ALL kernel parameters listed below are CASE SENSITIVE, and that a trailing = on the name of any parameter states that that parameter will @@ -212,10 +224,5 @@ a fixed number of characters. This limit depends on the architecture and is between 256 and 4096 characters. It is defined in the file ./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE. -Finally, the [KMG] suffix is commonly described after a number of kernel -parameter values. These 'K', 'M', and 'G' letters represent the _binary_ -multipliers 'Kilo', 'Mega', and 'Giga', equaling 2^10, 2^20, and 2^30 -bytes respectively. Such letter suffixes can also be entirely omitted: - .. include:: kernel-parameters.txt :literal: diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 213d0719e2b7..a3ea40b22fb9 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -12,7 +12,7 @@ acpi= [HW,ACPI,X86,ARM64,RISCV64,EARLY] Advanced Configuration and Power Interface Format: { force | on | off | strict | noirq | rsdt | - copy_dsdt } + copy_dsdt | nospcr } force -- enable ACPI if default was off on -- enable ACPI but allow fallback to DT [arm64,riscv64] off -- disable ACPI if default was on @@ -21,8 +21,16 @@ strictly ACPI specification compliant. rsdt -- prefer RSDT over (default) XSDT copy_dsdt -- copy DSDT to memory - For ARM64 and RISCV64, ONLY "acpi=off", "acpi=on" or - "acpi=force" are available + nocmcff -- Disable firmware first mode for corrected + errors. This disables parsing the HEST CMC error + source to check if firmware has set the FF flag. This + may result in duplicate corrected error reports. + nospcr -- disable console in ACPI SPCR table as + default _serial_ console on ARM64 + For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or + "acpi=nospcr" are available + For RISCV64, ONLY "acpi=off", "acpi=on" or "acpi=force" + are available See also Documentation/power/runtime_pm.rst, pci=noacpi @@ -329,12 +337,17 @@ allowed anymore to lift isolation requirements as needed. This option does not override iommu=pt - force_enable - Force enable the IOMMU on platforms known - to be buggy with IOMMU enabled. Use this - option with care. - pgtbl_v1 - Use v1 page table for DMA-API (Default). - pgtbl_v2 - Use v2 page table for DMA-API. - irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + force_enable - Force enable the IOMMU on platforms known + to be buggy with IOMMU enabled. Use this + option with care. + pgtbl_v1 - Use v1 page table for DMA-API (Default). + pgtbl_v2 - Use v2 page table for DMA-API. + irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + nohugepages - Limit page-sizes used for v1 page-tables + to 4 KiB. + v2_pgsizes_only - Limit page-sizes used for v1 page-tables + to 4KiB/2Mib/1GiB. + amd_iommu_dump= [HW,X86-64] Enable AMD IOMMU driver option to dump the ACPI table @@ -396,15 +409,13 @@ not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. + apic [APIC,X86-64] Use IO-APIC. Default. + apic= [APIC,X86,EARLY] Advanced Programmable Interrupt Controller Change the output verbosity while booting Format: { quiet (default) | verbose | debug } Change the amount of debugging information output when initialising the APIC and IO-APIC components. - For X86-32, this can also be used to specify an APIC - driver name. - Format: apic=driver_name - Examples: apic=bigsmp apic_extnmi= [APIC,X86,EARLY] External NMI delivery setting Format: { bsp (default) | all | none } @@ -415,6 +426,10 @@ useful so that a dump capture kernel won't be shot down by NMI + apicpmtimer Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally + broken. + autoconf= [IPV6] See Documentation/networking/ipv6.rst. @@ -431,9 +446,15 @@ arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: <io>,<irq>,<nodeID> + arm64.no32bit_el0 [ARM64] Unconditionally disable the execution of + 32 bit applications. + arm64.nobti [ARM64] Unconditionally disable Branch Target Identification support + arm64.nogcs [ARM64] Unconditionally disable Guarded Control Stack + support + arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory Set instructions support @@ -510,6 +531,18 @@ Format: <io>,<irq>,<mode> See header of drivers/net/hamradio/baycom_ser_hdx.c. + bdev_allow_write_mounted= + Format: <bool> + Control the ability to open a mounted block device + for writing, i.e., allow / disallow writes that bypass + the FS. This was implemented as a means to prevent + fuzzers from crashing the kernel by overwriting the + metadata underneath a mounted FS without its awareness. + This also prevents destructive formatting of mounted + filesystems by naive storage tooling that don't use + O_EXCL. Default is Y and can be changed through the + Kconfig option CONFIG_BLK_DEV_WRITE_MOUNTED. + bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. @@ -785,6 +818,25 @@ Documentation/networking/netconsole.rst for an alternative. + <DEVNAME>:<n>.<n>[,options] + Use the specified serial port on the serial core bus. + The addressing uses DEVNAME of the physical serial port + device, followed by the serial core controller instance, + and the serial port instance. The options are the same + as documented for the ttyS addressing above. + + The mapping of the serial ports to the tty instances + can be viewed with: + + $ ls -d /sys/bus/serial-base/devices/*:*.*/tty/* + /sys/bus/serial-base/devices/00:04:0.0/tty/ttyS0 + + In the above example, the console can be addressed with + console=00:04:0.0. Note that a console addressed this + way will only get added when the related device driver + is ready. The use of an earlycon parameter in addition to + the console may be desired for console output early on. + uart[8250],io,<addr>[,options] uart[8250],mmio,<addr>[,options] uart[8250],mmio16,<addr>[,options] @@ -875,12 +927,16 @@ the parameter has no effect. crash_kexec_post_notifiers - Run kdump after running panic-notifiers and dumping - kmsg. This only for the users who doubt kdump always - succeeds in any situation. - Note that this also increases risks of kdump failure, - because some panic notifiers can make the crashed - kernel more unstable. + Only jump to kdump kernel after running the panic + notifiers and dumping kmsg. This option increases + the risks of a kdump failure, since some panic + notifiers can make the crashed kernel more unstable. + In configurations where kdump may not be reliable, + running the panic notifiers could allow collecting + more data on dmesg, like stack traces from other CPUS + or extra data dumped by panic_print. Note that some + configurations enable this option unconditionally, + like Hyper-V, PowerPC (fadump) and AMD SEV-SNP. crashkernel=size[KMG][@offset[KMG]] [KNL,EARLY] Using kexec, Linux can switch to a 'crash kernel' @@ -1351,7 +1407,8 @@ earlyprintk=serial[,0x...[,baudrate]] earlyprintk=ttySn[,baudrate] earlyprintk=dbgp[debugController#] - earlyprintk=pciserial[,force],bus:device.function[,baudrate] + earlyprintk=mmio32,membase[,{nocfg|baudrate}] + earlyprintk=pciserial[,force],bus:device.function[,{nocfg|baudrate}] earlyprintk=xdbc[xhciController#] earlyprintk=bios @@ -1359,6 +1416,9 @@ the normal console is initialized. It is not enabled by default because it has some cosmetic problems. + Use "nocfg" to skip UART configuration, assume + BIOS/firmware has configured UART correctly. + Append ",keep" to not disable it when the real console takes over. @@ -1428,27 +1488,6 @@ you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. - efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI,X86,EARLY] - Add arbitrary attribute to specific memory range by - updating original EFI memory map. - Region of memory which aa attribute is added to is - from ss to ss+nn. - - If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000 - is specified, EFI_MEMORY_MORE_RELIABLE(0x10000) - attribute is added to range 0x100000000-0x180000000 and - 0x10a0000000-0x1120000000. - - If efi_fake_mem=8G@9G:0x40000 is specified, the - EFI_MEMORY_SP(0x40000) attribute is added to - range 0x240000000-0x43fffffff. - - Using this parameter you can do debugging of EFI memmap - related features. For example, you can do debugging of - Address Range Mirroring feature even if your box - doesn't support it, or mark specific memory as - "soft reserved". - efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT that is to be dynamically loaded by Linux. If there are multiple variables with the same name but with different @@ -1524,6 +1563,7 @@ failslab= fail_usercopy= fail_page_alloc= + fail_skb_realloc= fail_make_request=[KNL] General fault injection mechanism. Format: <interval>,<probability>,<space>,<times> @@ -1696,6 +1736,8 @@ off: Disable GDS mitigation. + gbpages [X86] Use GB pages for kernel direct mappings. + gcov_persist= [GCOV] When non-zero (default), profiling data for kernel modules is saved and remains accessible via debugfs, even when the module is unloaded/reloaded. @@ -1743,7 +1785,9 @@ allocation boundaries as a proactive defense against bounds-checking flaws in the kernel's copy_to_user()/copy_from_user() interface. - on Perform hardened usercopy checks (default). + The default is determined by + CONFIG_HARDENED_USERCOPY_DEFAULT_ON. + on Perform hardened usercopy checks. off Disable hardened usercopy checks. hardlockup_all_cpu_backtrace= @@ -1756,8 +1800,6 @@ for 64-bit NUMA, off otherwise. Format: 0 | 1 (for off | on) - hcl= [IA-64] SGI's Hardware Graph compatibility layer - hd= [EIDE] (E)IDE hard drive subsystem geometry Format: <cyl>,<head>,<sect> @@ -1786,6 +1828,13 @@ lz4: Select LZ4 compression algorithm to compress/decompress hibernation image. + hibernate.pm_test_delay= + [HIBERNATION] + Sets the number of seconds to remain in a hibernation test + mode before resuming the system (see + /sys/power/pm_test). Only available when CONFIG_PM_DEBUG + is set. Default value is 5. + highmem=nn[KMG] [KNL,BOOT,EARLY] forces the highmem zone to have an exact size of <nn>. This works even on boxes that have no highmem otherwise. This also works to reduce highmem @@ -1821,7 +1870,7 @@ hpet_mmap= [X86, HPET_MMAP] Allow userspace to mmap HPET registers. Default set by CONFIG_HPET_MMAP_DEFAULT. - hugepages= [HW] Number of HugeTLB pages to allocate at boot. + hugepages= [HW,EARLY] Number of HugeTLB pages to allocate at boot. If this follows hugepagesz (below), it specifies the number of pages of hugepagesz to be allocated. If this is the first HugeTLB parameter on the command @@ -1833,15 +1882,24 @@ <node>:<integer>[,<node>:<integer>] hugepagesz= - [HW] The size of the HugeTLB pages. This is used in - conjunction with hugepages (above) to allocate huge - pages of a specific size at boot. The pair - hugepagesz=X hugepages=Y can be specified once for - each supported huge page size. Huge page sizes are - architecture dependent. See also + [HW,EARLY] The size of the HugeTLB pages. This is + used in conjunction with hugepages (above) to + allocate huge pages of a specific size at boot. The + pair hugepagesz=X hugepages=Y can be specified once + for each supported huge page size. Huge page sizes + are architecture dependent. See also Documentation/admin-guide/mm/hugetlbpage.rst. Format: size[KMG] + hugepage_alloc_threads= + [HW] The number of threads that should be used to + allocate hugepages during boot. This option can be + used to improve system bootup time when allocating + a large amount of huge pages. + The default value is 25% of the available hardware threads. + + Note that this parameter only applies to non-gigantic huge pages. + hugetlb_cma= [HW,CMA,EARLY] The size of a CMA area used for allocation of gigantic hugepages. Or using node format, the size of a CMA area per node can be specified. @@ -1852,6 +1910,13 @@ hugepages using the CMA allocator. If enabled, the boot-time allocation of gigantic hugepages is skipped. + hugetlb_cma_only= + [HW,CMA,EARLY] When allocating new HugeTLB pages, only + try to allocate from the CMA areas. + + This option does nothing if hugetlb_cma= is not also + specified. + hugetlb_free_vmemmap= [KNL] Requires CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP enabled. @@ -1893,12 +1958,40 @@ which allow the hypervisor to 'idle' the guest on lock contention. + hw_protection= [HW] + Format: reboot | shutdown + + Hardware protection action taken on critical events like + overtemperature or imminent voltage loss. + i2c_bus= [HW] Override the default board specific I2C bus speed or register an additional I2C bus that is not registered from board initialization code. Format: <bus_id>,<clkrate> + i2c_touchscreen_props= [HW,ACPI,X86] + Set device-properties for ACPI-enumerated I2C-attached + touchscreen, to e.g. fix coordinates of upside-down + mounted touchscreens. If you need this option please + submit a drivers/platform/x86/touchscreen_dmi.c patch + adding a DMI quirk for this. + + Format: + <ACPI_HW_ID>:<prop_name>=<val>[:prop_name=val][:...] + Where <val> is one of: + Omit "=<val>" entirely Set a boolean device-property + Unsigned number Set a u32 device-property + Anything else Set a string device-property + + Examples (split over multiple lines): + i2c_touchscreen_props=GDIX1001:touchscreen-inverted-x: + touchscreen-inverted-y + + i2c_touchscreen_props=MSSL1680:touchscreen-size-x=1920: + touchscreen-size-y=1080:touchscreen-inverted-y: + firmware-name=gsl1680-vendor-model.fw:silead,home-button + i8042.debug [HW] Toggle i8042 debug mode i8042.unmask_kbd_data [HW] Enable printing of interrupt data from the KBD port @@ -1958,12 +2051,21 @@ idle= [X86,EARLY] Format: idle=poll, idle=halt, idle=nomwait - Poll forces a polling idle loop that can slightly - improve the performance of waking up a idle CPU, but - will use a lot of power and make the system run hot. - Not recommended. + + idle=poll: Don't do power saving in the idle loop + using HLT, but poll for rescheduling event. This will + make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor + benchmarks. It also makes some profiling using + performance counters more accurate. Please note that + on systems with MONITOR/MWAIT support (like Intel + EM64T CPUs) this option has no performance advantage + over the normal idle loop. It may also interact badly + with hyperthreading. + idle=halt: Halt is forced to be used for CPU idle. In such case C2/C3 won't be used again. + idle=nomwait: Disable mwait for CPU C-states idxd.sva= [HW] @@ -1978,7 +2080,7 @@ for the device. By default it is set to false (0). ieee754= [MIPS] Select IEEE Std 754 conformance mode - Format: { strict | legacy | 2008 | relaxed } + Format: { strict | legacy | 2008 | relaxed | emulated } Default: strict Choose which programs will be accepted for execution @@ -1998,6 +2100,8 @@ by the FPU relaxed accept any binaries regardless of whether supported by the FPU + emulated accept any binaries but enable FPU emulator + if binary mode is unsupported by the FPU. The FPU emulator is always able to support both NaN encodings, so if no FPU hardware is present or it has @@ -2105,6 +2209,23 @@ different crypto accelerators. This option can be used to achieve best performance for particular HW. + indirect_target_selection= [X86,Intel] Mitigation control for Indirect + Target Selection(ITS) bug in Intel CPUs. Updated + microcode is also required for a fix in IBPB. + + on: Enable mitigation (default). + off: Disable mitigation. + force: Force the ITS bug and deploy default + mitigation. + vmexit: Only deploy mitigation if CPU is affected by + guest/host isolation part of ITS. + stuff: Deploy RSB-fill mitigation when retpoline is + also deployed. Otherwise, deploy the default + mitigation. + + For details see: + Documentation/admin-guide/hw-vuln/indirect-target-selection.rst + init= [KNL] Format: <full_path> Run specified binary instead of /sbin/init as init @@ -2243,6 +2364,9 @@ per_cpu_perf_limits Allow per-logical-CPU P-State performance control limits using cpufreq sysfs interface + no_cas + Do not enable capacity-aware scheduling (CAS) on + hybrid systems intremap= [X86-64,Intel-IOMMU,EARLY] on enable Interrupt Remapping (default) @@ -2251,26 +2375,81 @@ no_x2apic_optout BIOS x2APIC opt-out request will be ignored nopost disable Interrupt Posting + posted_msi + enable MSIs delivered as posted interrupts iomem= Disable strict checking of access to MMIO memory strict regions from userspace. relaxed iommu= [X86,EARLY] + off + Don't initialize and use any kind of IOMMU. + force + Force the use of the hardware IOMMU even when + it is not actually needed (e.g. because < 3 GB + memory). + noforce + Don't force hardware IOMMU usage when it is not + needed. (default). + biomerge panic nopanic merge nomerge + soft - pt [X86] - nopt [X86] - nobypass [PPC/POWERNV] + Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + + [X86] + pt + [X86] + nopt + [PPC/POWERNV] + nobypass Disable IOMMU bypass, using IOMMU for PCI devices. + [X86] + AMD Gart HW IOMMU-specific options: + + <size> + Set the size of the remapping area in bytes. + + allowed + Overwrite iommu off workarounds for specific chipsets + + fullflush + Flush IOMMU on each allocation (default). + + nofullflush + Don't use IOMMU fullflush. + + memaper[=<order>] + Allocate an own aperture over RAM with size + 32MB<<order. (default: order=1, i.e. 64MB) + + merge + Do scatter-gather (SG) merging. Implies "force" + (experimental). + + nomerge + Don't do scatter-gather (SG) merging. + + noaperture + Ask the IOMMU not to touch the aperture for AGP. + + noagp + Don't initialize the AGP driver and use full aperture. + + panic + Always panic when IOMMU overflows. + iommu.forcedac= [ARM64,X86,EARLY] Control IOVA allocation for PCI devices. Format: { "0" | "1" } 0 - Try to allocate a 32-bit DMA address first, before @@ -2321,6 +2500,18 @@ ipcmni_extend [KNL,EARLY] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. + ipe.enforce= [IPE] + Format: <bool> + Determine whether IPE starts in permissive (0) or + enforce (1) mode. The default is enforce. + + ipe.success_audit= + [IPE] + Format: <bool> + Start IPE with success auditing enabled, emitting + an audit event when a binary is allowed. The default + is 0. + irqaffinity= [SMP] Set the default irq affinity mask The argument is a cpu list, as described above. @@ -2366,7 +2557,9 @@ specified in the flag list (default: domain): nohz - Disable the tick when a single task runs. + Disable the tick when a single task runs as well as + disabling other kernel noises like having RCU callbacks + offloaded. This is equivalent to the nohz_full parameter. A residual 1Hz tick is offloaded to workqueues, which you need to affine to housekeeping through the global @@ -2492,7 +2685,7 @@ keepinitrd [HW,ARM] See retain_initrd. - kernelcore= [KNL,X86,IA-64,PPC,EARLY] + kernelcore= [KNL,X86,PPC,EARLY] Format: nn[KMGTPE] | nn% | "mirror" This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested @@ -2556,6 +2749,31 @@ kgdbwait [KGDB,EARLY] Stop kernel execution and enter the kernel debugger at the earliest opportunity. + kho= [KEXEC,EARLY] + Format: { "0" | "1" | "off" | "on" | "y" | "n" } + Enables or disables Kexec HandOver. + "0" | "off" | "n" - kexec handover is disabled + "1" | "on" | "y" - kexec handover is enabled + + kho_scratch= [KEXEC,EARLY] + Format: ll[KMG],mm[KMG],nn[KMG] | nn% + Defines the size of the KHO scratch region. The KHO + scratch regions are physically contiguous memory + ranges that can only be used for non-kernel + allocations. That way, even when memory is heavily + fragmented with handed over memory, the kexeced + kernel will always have enough contiguous ranges to + bootstrap itself. + + It is possible to specify the exact amount of + memory in the form of "ll[KMG],mm[KMG],nn[KMG]" + where the first parameter defines the size of a low + memory scratch area, the second parameter defines + the size of a global scratch area and the third + parameter defines the size of additional per-node + scratch areas. The form "nn%" defines scale factor + (in percents) of memory that was used during boot. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. @@ -2619,6 +2837,23 @@ Default is Y (on). + kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86] + If enabled, KVM will enable virtualization in hardware + when KVM is loaded, and disable virtualization when KVM + is unloaded (if KVM is built as a module). + + If disabled, KVM will dynamically enable and disable + virtualization on-demand when creating and destroying + VMs, i.e. on the 0=>1 and 1=>0 transitions of the + number of VMs. + + Enabling virtualization at module load avoids potential + latency for creation of the 0=>1 VM, as KVM serializes + virtualization enabling across all online CPUs. The + "cost" of enabling virtualization when KVM is loaded, + is that doing so may interfere with using out-of-tree + hypervisors that want to "own" virtualization hardware. + kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface. Default is false (don't support). @@ -2665,17 +2900,21 @@ nvhe: Standard nVHE-based mode, without support for protected guests. - protected: nVHE-based mode with support for guests whose - state is kept private from the host. + protected: Mode with support for guests whose state is + kept private from the host, using VHE or + nVHE depending on HW support. nested: VHE-based mode with support for nested - virtualization. Requires at least ARMv8.3 - hardware. + virtualization. Requires at least ARMv8.4 + hardware (with FEAT_NV2). Defaults to VHE/nVHE based on hardware support. Setting mode to "protected" will disable kexec and hibernation - for the host. "nested" is experimental and should be - used with extreme caution. + for the host. To force nVHE on VHE hardware, add + "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the + command-line. + "nested" is experimental and should be used with + extreme caution. kvm-arm.vgic_v3_group0_trap= [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0 @@ -2693,6 +2932,24 @@ [KVM,ARM,EARLY] Allow use of GICv4 for direct injection of LPIs. + kvm-arm.wfe_trap_policy= + [KVM,ARM] Control when to set WFE instruction trap for + KVM VMs. Traps are allowed but not guaranteed by the + CPU architecture. + + trap: set WFE instruction trap + + notrap: clear WFE instruction trap + + kvm-arm.wfi_trap_policy= + [KVM,ARM] Control when to set WFI instruction trap for + KVM VMs. Traps are allowed but not guaranteed by the + CPU architecture. + + trap: set WFI instruction trap + + notrap: clear WFI instruction trap + kvm_cma_resv_ratio=n [PPC,EARLY] Reserves given percentage from system memory area for contiguous memory allocation for KVM hash pagetable @@ -2935,6 +3192,8 @@ * max_sec_lba48: Set or clear transfer size limit to 65535 sectors. + * external: Mark port as external (hotplug-capable). + * [no]lpm: Enable or disable link power management. * [no]setxfer: Indicate if transfer speed mode setting @@ -3132,26 +3391,16 @@ unlikely, in the extreme case this might damage your hardware. - ltpc= [NET] - Format: <io>,<irq>,<dma> - lsm.debug [SECURITY] Enable LSM initialization debugging output. lsm=lsm1,...,lsmN [SECURITY] Choose order of LSM initialization. This overrides CONFIG_LSM, and the "security=" parameter. - machvec= [IA-64] Force the use of a particular machine-vector - (machvec) in a generic kernel. - Example: machvec=hpzx1 - machtype= [Loongson] Share the same kernel image file between different yeeloong laptops. Example: machtype=lemote-yeeloong-2f-7inch - max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater - than or equal to this physical address is ignored. - maxcpus= [SMP,EARLY] Maximum number of processors that an SMP kernel will bring up during bootup. maxcpus=n : n >= 0 limits the kernel to bring up 'n' processors. Surely after @@ -3168,9 +3417,77 @@ devices can be requested on-demand with the /dev/loop-control interface. - mce [X86-32] Machine Check Exception + mce= [X86-{32,64}] + + Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. + + off + disable machine check + + no_cmci + disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your + hardware is misbehaving. + + Note that you'll get more problems without CMCI than + with due to the shared banks, i.e. you might get + duplicated error logs. + + dont_log_ce + don't make logs for corrected errors. All events + reported as corrected are silently cleared by OS. This + option will be useful if you have no interest in any + of corrected errors. + + ignore_ce + disable features for corrected errors, e.g. + polling timer and CMCI. All events reported as + corrected are not cleared by OS and remained in its + error banks. + + Usually this disablement is not recommended, however + if there is an agent checking/clearing corrected + errors (e.g. BIOS or hardware monitoring + applications), conflicting with OS's error handling, + and you cannot deactivate the agent, then this option + will be a help. + + no_lmce + do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + + bootlog + enable logging of machine checks left over from + booting. Disabled by default on AMD Fam10h and older + because some BIOS leave bogus ones. + + If your BIOS doesn't do that it's a good idea to + enable though to make sure you log even machine check + events that result in a reboot. On Intel systems it is + enabled by default. + + nobootlog + disable boot machine check logging. + + monarchtimeout (number) + sets the time in us to wait for other CPUs on machine + checks. 0 to disable. + + bios_cmci_threshold + don't overwrite the bios-set CMCI threshold. This boot + option prevents Linux from overwriting the CMCI + threshold set by the bios. Without this option, Linux + always sets the CMCI threshold to 1. Enabling this may + make memory predictive failure analysis less effective + if the bios sets thresholds for memory errors since we + will not see details for all errors. + + recovery + force-enable recoverable machine check code paths + + Everything else is in sysfs now. - mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. @@ -3260,8 +3577,8 @@ [KNL] Set the initial state for the memory hotplug onlining policy. If not specified, the default value is set according to the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel config + options. See Documentation/admin-guide/mm/memory-hotplug.rst. memmap=exactmap [KNL,X86,EARLY] Enable setting of an exact @@ -3377,10 +3694,6 @@ deep - Suspend-To-RAM or equivalent (if supported) See Documentation/admin-guide/pm/sleep-states.rst. - mfgpt_irq= [IA-32] Specify the IRQ to use for the - Multi-Function General Purpose Timers on AMD Geode - platforms. - mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when the BIOS has incorrectly applied a workaround. TinyBIOS version 0.98 is known to be affected, 0.99 fixes the @@ -3393,9 +3706,6 @@ Enable or disable the microcode minimal revision enforcement for the runtime microcode loader. - min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this - physical address is ignored. - mini2440= [ARM,HW,KNL] Format:[0..2][b][c][t] Default: "0tb" @@ -3432,6 +3742,7 @@ expose users to several CPU vulnerabilities. Equivalent to: if nokaslr then kpti=0 [ARM64] gather_data_sampling=off [X86] + indirect_target_selection=off [X86] kvm.nx_huge_pages=off [X86] l1tf=off [X86] mds=off [X86] @@ -3560,7 +3871,7 @@ mousedev.yres= [MOUSE] Vertical screen resolution, used for devices reporting absolute coordinates, such as tablets - movablecore= [KNL,X86,IA-64,PPC,EARLY] + movablecore= [KNL,X86,PPC,EARLY] Format: nn[KMGTPE] | nn% This parameter is the complement to kernelcore=, it specifies the amount of memory used for migratable @@ -3586,11 +3897,6 @@ mtdparts= [MTD] See drivers/mtd/parsers/cmdlinepart.c - mtdset= [ARM] - ARM/S3C2412 JIVE boot control - - See arch/arm/mach-s3c/mach-jive.c - mtouchusb.raw_coordinates= [HW] Make the MicroTouch USB driver use raw coordinates ('y', default) or cooked coordinates ('n') @@ -3776,10 +4082,12 @@ Format: [state][,regs][,debounce][,die] nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels - Format: [panic,][nopanic,][num] + Format: [panic,][nopanic,][rNNN,][num] Valid num: 0 or 1 0 - turn hardlockup detector in nmi_watchdog off 1 - turn hardlockup detector in nmi_watchdog on + rNNN - configure the watchdog with raw perf event 0xNNN + When panic is specified, panic when an NMI watchdog timeout occurs (or 'nopanic' to not panic on an NMI watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set) @@ -3803,12 +4111,11 @@ noalign [KNL,ARM] - noaltinstr [S390,EARLY] Disables alternative instructions - patching (CPU alternatives feature). - noapic [SMP,APIC,EARLY] Tells the kernel to not make use of any IOAPICs that may be present in the system. + noapictimer [APIC,X86] Don't set up the APIC timer + noautogroup Disable scheduler automatic task group creation. nocache [ARM,EARLY] @@ -3837,8 +4144,6 @@ no_entry_flush [PPC,EARLY] Don't flush the L1-D cache when entering the kernel. - noexec [IA-64] - noexec32 [X86-64] This affects only 32-bit executables. noexec32=on: enable non-executable mappings (default) @@ -3858,12 +4163,7 @@ register save and restore. The kernel will only save legacy floating-point registers on task switch. - nohalt [IA-64] Tells the kernel not to use the power saving - function PAL_HALT_LIGHT when idle. This increases - power-consumption. On the positive side, it reduces - interrupt wake-up latency, which may improve performance - in certain environments such as networked servers or - real-time systems. + nogbpages [X86] Do not use GB pages for kernel direct mappings. no_hash_pointers [KNL,EARLY] @@ -3882,7 +4182,7 @@ nohibernate [HIBERNATION] Disable hibernation and resume. - nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to + nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,RISCV,SH] Forces the kernel to busy wait in do_idle() and not use the arch_cpu_idle() implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP to be effective. This is useful on platforms where the @@ -3891,6 +4191,8 @@ the impact of the sleep instructions. This is also useful when using JTAG debugger. + nohpet [X86] Don't use the HPET timer. + nohugeiomap [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge I/O mappings. nohugevmalloc [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge vmalloc mappings. @@ -3919,8 +4221,6 @@ remapping. [Deprecated - use intremap=off] - nointroute [IA-64] - noinvpcid [X86,EARLY] Disable the INVPCID cpu feature. noiotrap [SH] Disables trapped I/O port accesses. @@ -3930,8 +4230,6 @@ noisapnp [ISAPNP] Disables ISA PnP code. - nojitter [IA-64] Disables jitter checking for ITC timers. - nokaslr [KNL,EARLY] When CONFIG_RANDOMIZE_BASE is set, this disables kernel and module base offset ASLR (Address Space @@ -3946,8 +4244,6 @@ nolapic_timer [X86-32,APIC,EARLY] Do not use the local APIC timer. - nomca [IA-64] Disable machine check abort handling - nomce [X86-32] Disable Machine Check Exception nomfgpt [X86-32] Disable Multi-Function General Purpose @@ -3999,8 +4295,6 @@ noresume [SWSUSP] Disables resume and restores original swap space. - nosbagart [IA-64] - no-scroll [VGA] Disables scrollback. This is required for the Braillex ib80-piezo Braille reader made by F.H. Papenmeier (Germany). @@ -4018,10 +4312,10 @@ nosmp [SMP,EARLY] Tells an SMP kernel to act as a UP kernel, and disable the IO APIC. legacy for "maxcpus=0". - nosmt [KNL,MIPS,PPC,S390,EARLY] Disable symmetric multithreading (SMT). + nosmt [KNL,MIPS,PPC,EARLY] Disable symmetric multithreading (SMT). Equivalent to smt=1. - [KNL,X86,PPC] Disable symmetric multithreading (SMT). + [KNL,X86,PPC,S390] Disable symmetric multithreading (SMT). nosmt=force: Force disable SMT, cannot be undone via the sysfs control file. @@ -4044,14 +4338,16 @@ prediction) vulnerability. System may allow data leaks with this option. - no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV,EARLY] Disable - paravirtualized steal time accounting. steal time is - computed, but won't influence scheduler behaviour + no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV,LOONGARCH,EARLY] + Disable paravirtualized steal time accounting. steal time + is computed, but won't influence scheduler behaviour nosync [HW,M68K] Disables sync negotiation for all devices. - no_timer_check [X86,APIC] Disables the code which tests for - broken timer IRQ sources. + no_timer_check [X86,APIC] Disables the code which tests for broken + timer IRQ sources, i.e., the IO-APIC timer. This can + work around problems with incorrect timer + initialization on some boards. no_uaccess_flush [PPC,EARLY] Don't flush the L1-D cache after accessing user data. @@ -4101,19 +4397,6 @@ parameter, xsave area per process might occupy more memory on xsaves enabled systems. - nps_mtm_hs_ctr= [KNL,ARC] - This parameter sets the maximum duration, in - cycles, each HW thread of the CTOP can run - without interruptions, before HW switches it. - The actual maximum duration is 16 times this - parameter's value. - Format: integer between 1 and 255 - Default: 255 - - nptcg= [IA-64] Override max number of concurrent global TLB - purges which is reported from either PAL_VM_SUMMARY or - SAL PALO. - nr_cpus= [SMP,EARLY] Maximum number of processors that an SMP kernel could support. nr_cpus=n : n >= 1 limits the kernel to support 'n' processors. It could be larger than the @@ -4129,6 +4412,26 @@ Disable NUMA, Only set up a single NUMA node spanning all memory. + numa=fake=<size>[MG] + [KNL, ARM64, RISCV, X86, EARLY] + If given as a memory unit, fills all system RAM with + nodes of size interleaved over physical nodes. + + numa=fake=<N> + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer, fills all system RAM with N + fake nodes interleaved over physical nodes. + + numa=fake=<N>U + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer followed by 'U', it will + divide each physical node into N emulated nodes. + + numa=noacpi [X86] Don't parse the SRAT table for NUMA setup + + numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or + soft-reserved memory partitioning. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable @@ -4173,13 +4476,11 @@ page_alloc.shuffle= [KNL] Boolean flag to control whether the page allocator - should randomize its free lists. The randomization may - be automatically enabled if the kernel detects it is - running on a platform with a direct-mapped memory-side - cache, and this parameter can be used to - override/disable that behavior. The state of the flag - can be read from sysfs at: + should randomize its free lists. This parameter can be + used to enable/disable page randomization. The state of + the flag can be read from sysfs at: /sys/module/page_alloc/parameters/shuffle. + This parameter is only available if CONFIG_SHUFFLE_PAGE_ALLOCATOR=y. page_owner= [KNL,EARLY] Boot-time page_owner enabling option. Storage of the information about who allocated @@ -4589,14 +4890,51 @@ bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group. + config_acs= + Format: + <ACS flags>@<pci_dev>[; ...] + Specify one or more PCI devices (in the format + specified above) optionally prepended with flags + and separated by semicolons. The respective + capabilities will be enabled, disabled or + unchanged based on what is specified in + flags. + + ACS Flags is defined as follows: + bit-0 : ACS Source Validation + bit-1 : ACS Translation Blocking + bit-2 : ACS P2P Request Redirect + bit-3 : ACS P2P Completion Redirect + bit-4 : ACS Upstream Forwarding + bit-5 : ACS P2P Egress Control + bit-6 : ACS Direct Translated P2P + Each bit can be marked as: + '0' – force disabled + '1' – force enabled + 'x' – unchanged + For example, + pci=config_acs=10x@pci:0:0 + would configure all devices that support + ACS to enable P2P Request Redirect, disable + Translation Blocking, and leave Source + Validation unchanged from whatever power-up + or firmware set it to. + + Note: this may remove isolation between devices + and may put more devices in an IOMMU group. force_floating [S390] Force usage of floating interrupts. nomio [S390] Do not use MIO instructions. norid [S390] ignore the RID field and force use of one PCI domain per PCI function + notph [PCIE] If the PCIE_TPH kernel config parameter + is enabled, this kernel boot option can be used + to disable PCIe TLP Processing Hints support + system-wide. - pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power + pcie_aspm= [PCIE] Forcibly enable or ignore PCIe Active State Power Management. - off Disable ASPM. + off Don't touch ASPM configuration at all. Leave any + configuration done by firmware unchanged. force Enable ASPM even on devices that claim not to support it. WARNING: Forcing ASPM on may cause system lockups. @@ -4721,7 +5059,14 @@ none - Limited to cond_resched() calls voluntary - Limited to cond_resched() and might_sleep() calls full - Any section that isn't explicitly preempt disabled - can be preempted anytime. + can be preempted anytime. Tasks will also yield + contended spinlocks (if the critical section isn't + explicitly preempt disabled beyond the lock itself). + lazy - Scheduler controlled. Similar to full but instead + of preempting the task immediately, the task gets + one HZ tick time to yield itself before the + preemption will be forced. One preemption is when the + task returns to user space. print-fatal-signals= [KNL] debug: print fatal signals @@ -4751,6 +5096,14 @@ Format: <bool> default: 0 (auto_verbose is enabled) + printk.debug_non_panic_cpus= + Allows storing messages from non-panic CPUs into + the printk log buffer during panic(). They are + flushed to consoles by the panic-CPU on + a best-effort basis. + Format: <bool> (1/Y/y=enable, 0/N/n=disable) + Default: disabled + printk.devkmsg={on,off,ratelimit} Control writing to /dev/kmsg. on - unlimited logging to /dev/kmsg from userspace @@ -4761,6 +5114,16 @@ printk.time= Show timing data prefixed to each printk message line Format: <bool> (1/Y/y=enable, 0/N/n=disable) + proc_mem.force_override= [KNL] + Format: {always | ptrace | never} + Traditionally /proc/pid/mem allows memory permissions to be + overridden without restrictions. This option may be set to + restrict that. Can be one of: + - 'always': traditional behavior always allows mem overrides. + - 'ptrace': only allow mem overrides for active ptracers. + - 'never': never allow mem overrides. + If not specified, default is the CONFIG_PROC_MEM_* choice. + processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. @@ -4771,11 +5134,9 @@ profile= [KNL] Enable kernel profiling via /proc/profile Format: [<profiletype>,]<number> - Param: <profiletype>: "schedule", "sleep", or "kvm" + Param: <profiletype>: "schedule" or "kvm" [defaults to kernel profiling] Param: "schedule" - profile schedule points. - Param: "sleep" - profile D-state sleeping (millisecs). - Requires CONFIG_SCHEDSTATS Param: "kvm" - profile VM exits. Param: <number> - step/bucket size as a power of 2 for statistical time based profiling. @@ -4784,7 +5145,9 @@ prot_virt= [S390] enable hosting protected virtual machines isolated from the hypervisor (if hardware supports - that). + that). If enabled, the default kernel base address + might be overridden even when Kernel Address Space + Layout Randomization is disabled. Format: <bool> psi= [KNL] Enable or disable pressure stall information @@ -4908,6 +5271,10 @@ Set maximum number of finished RCU callbacks to process in one batch. + rcutree.csd_lock_suppress_rcu_stall= [KNL] + Do only a one-line RCU CPU stall warning when + there is an ongoing too-long CSD-lock wait. + rcutree.do_rcu_barrier= [KNL] Request a call to rcu_barrier(). This is throttled so that userspace tests can safely @@ -4985,6 +5352,14 @@ the ->nocb_bypass queue. The definition of "too many" is supplied by this kernel boot parameter. + rcutree.nohz_full_patience_delay= [KNL] + On callback-offloaded (rcu_nocbs) CPUs, avoid + disturbing RCU unless the grace period has + reached the specified age in milliseconds. + Defaults to zero. Large values will be capped + at five seconds. All values will be rounded down + to the nearest value representable by jiffies. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. @@ -5095,6 +5470,20 @@ delay, memory pressure or callback list growing too big. + rcutree.rcu_normal_wake_from_gp= [KNL] + Reduces a latency of synchronize_rcu() call. This approach + maintains its own track of synchronize_rcu() callers, so it + does not interact with regular callbacks because it does not + use a call_rcu[_hurry]() path. Please note, this is for a + normal grace period. + + How to enable it: + + echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp + or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" + + Default is 0. + rcuscale.gp_async= [KNL] Measure performance of asynchronous grace-period primitives such as call_rcu(). @@ -5226,7 +5615,42 @@ rcutorture.gp_cond= [KNL] Use conditional/asynchronous update-side - primitives, if available. + normal-grace-period primitives, if available. + + rcutorture.gp_cond_exp= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives, if available. + + rcutorture.gp_cond_full= [KNL] + Use conditional/asynchronous update-side + normal-grace-period primitives that also take + concurrent expedited grace periods into account, + if available. + + rcutorture.gp_cond_exp_full= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives that also take + concurrent normal grace periods into account, + if available. + + rcutorture.gp_cond_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_cond and gp_cond_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_cond_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_cond_exp and gp_cond_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. rcutorture.gp_exp= [KNL] Use expedited update-side primitives, if available. @@ -5235,6 +5659,43 @@ Use normal (non-expedited) asynchronous update-side primitives, if available. + rcutorture.gp_poll= [KNL] + Use polled update-side normal-grace-period + primitives, if available. + + rcutorture.gp_poll_exp= [KNL] + Use polled update-side expedited-grace-period + primitives, if available. + + rcutorture.gp_poll_full= [KNL] + Use polled update-side normal-grace-period + primitives that also take concurrent expedited + grace periods into account, if available. + + rcutorture.gp_poll_exp_full= [KNL] + Use polled update-side expedited-grace-period + primitives that also take concurrent normal + grace periods into account, if available. + + rcutorture.gp_poll_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_poll and gp_poll_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_poll_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_poll_exp and gp_poll_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. + rcutorture.gp_sync= [KNL] Use normal (non-expedited) synchronous update-side primitives, if available. If all @@ -5243,6 +5704,31 @@ are zero, rcutorture acts as if is interpreted they are all non-zero. + rcutorture.gpwrap_lag= [KNL] + Enable grace-period wrap lag testing. Setting + to false prevents the gpwrap lag test from + running. Default is true. + + rcutorture.gpwrap_lag_gps= [KNL] + Set the value for grace-period wrap lag during + active lag testing periods. This controls how many + grace periods differences we tolerate between + rdp and rnp's gp_seq before setting overflow flag. + The default is always set to 8. + + rcutorture.gpwrap_lag_cycle_mins= [KNL] + Set the total cycle duration for gpwrap lag + testing in minutes. This is the total time for + one complete cycle of active and inactive + testing periods. Default is 30 minutes. + + rcutorture.gpwrap_lag_active_mins= [KNL] + Set the duration for which gpwrap lag is active + within each cycle, in minutes. During this time, + the grace-period wrap lag will be set to the + value specified by gpwrap_lag_gps. Default is + 5 minutes. + rcutorture.irqreader= [KNL] Run RCU readers from irq handlers, or, more accurately, from a timer handler. Not all RCU @@ -5288,10 +5774,21 @@ Set time (jiffies) between CPU-hotplug operations, or zero to disable CPU-hotplug testing. - rcutorture.read_exit= [KNL] - Set the number of read-then-exit kthreads used - to test the interaction of RCU updaters and - task-exit processing. + rcutorture.preempt_duration= [KNL] + Set duration (in milliseconds) of preemptions + by a high-priority FIFO real-time task. Set to + zero (the default) to disable. The CPUs to + preempt are selected randomly from the set that + are online at a given point in time. Races with + CPUs going offline are ignored, with that attempt + at preemption skipped. + + rcutorture.preempt_interval= [KNL] + Set interval (in milliseconds, defaulting to one + second) between preemptions by a high-priority + FIFO real-time task. This delay is mediated + by an hrtimer and is further fuzzed to avoid + inadvertent synchronizations. rcutorture.read_exit_burst= [KNL] The number of times in a given read-then-exit @@ -5302,6 +5799,14 @@ The delay, in seconds, between successive read-then-exit testing episodes. + rcutorture.reader_flavor= [KNL] + A bit mask indicating which readers to use. + If there is more than one bit set, the readers + are entered from low-order bit up, and are + exited in the opposite order. For SRCU, the + 0x1 bit is normal readers, 0x2 NMI-safe readers, + and 0x4 light-weight readers. + rcutorture.shuffle_interval= [KNL] Set task-shuffle interval (s). Shuffling tasks allows some CPUs to go into dyntick-idle mode @@ -5333,7 +5838,13 @@ Time to wait (s) after boot before inducing stall. rcutorture.stall_cpu_irqsoff= [KNL] - Disable interrupts while stalling if set. + Disable interrupts while stalling if set, but only + on the first stall in the set. + + rcutorture.stall_cpu_repeat= [KNL] + Number of times to repeat the stall sequence, + so that rcutorture.stall_cpu_repeat=3 will result + in four stall sequences. rcutorture.stall_gp_kthread= [KNL] Duration (s) of forced sleep within RCU @@ -5359,6 +5870,11 @@ rcutorture.test_boost_duration= [KNL] Duration (s) of each individual boost test. + rcutorture.test_boost_holdoff= [KNL] + Holdoff time (s) from start of test to the start + of RCU priority-boost testing. Defaults to zero, + that is, no holdoff. + rcutorture.test_boost_interval= [KNL] Interval (s) between each boost test. @@ -5521,14 +6037,6 @@ of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks(). - rcupdate.rcu_tasks_rude_lazy_ms= [KNL] - Set timeout in milliseconds RCU Tasks - Rude asynchronous callback batching for - call_rcu_tasks_rude(). A negative value - will take the default. A value of zero will - disable batching. Batching is always disabled - for synchronize_rcu_tasks_rude(). - rcupdate.rcu_tasks_trace_lazy_ms= [KNL] Set timeout in milliseconds RCU Tasks Trace asynchronous callback batching for @@ -5573,6 +6081,55 @@ reboot_cpu is s[mp]#### with #### being the processor to be used for rebooting. + acpi + Use the ACPI RESET_REG in the FADT. If ACPI is not + configured or the ACPI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + bios + Use the CPU reboot vector for warm reset + + cold + Set the cold reboot flag + + default + There are some built-in platform specific "quirks" + - you may see: "reboot: <name> series board detected. + Selecting <type> for reboots." In the case where you + think the quirk is in error (e.g. you have newer BIOS, + or newer board) using this option will ignore the + built-in quirk table, and use the generic default + reboot actions. + + efi + Use efi reset_system runtime service. If EFI is not + configured or the EFI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + force + Don't stop other CPUs on reboot. This can make reboot + more reliable in some cases. + + kbd + Use the keyboard controller. cold reset (default) + + pci + Use a write to the PCI config space register 0xcf9 to + trigger reboot. + + triple + Force a triple fault (init) + + warm + Don't set the cold reboot flag + + Using warm reset will be much faster especially on big + memory systems because the BIOS will not go through + the memory check. Disadvantage is that not all + hardware will be completely reinitialized on reboot so + there may be boot problems on some systems. + + refscale.holdoff= [KNL] Set test-start holdoff period. The purpose of this parameter is to delay the start of the @@ -5641,6 +6198,28 @@ them. If <base> is less than 0x10000, the region is assumed to be I/O ports; otherwise it is memory. + reserve_mem= [RAM] + Format: nn[KMG]:<align>:<label> + Reserve physical memory and label it with a name that + other subsystems can use to access it. This is typically + used for systems that do not wipe the RAM, and this command + line will try to reserve the same physical memory on + soft reboots. Note, it is not guaranteed to be the same + location. For example, if anything about the system changes + or if booting a different kernel. It can also fail if KASLR + places the kernel at the location of where the RAM reservation + was from a previous boot, the new reservation will be at a + different location. + Any subsystem using this feature must add a way to verify + that the contents of the physical memory is from a previous + boot, as there may be cases where the memory will not be + located at the same location. + + The format is size:align:label for example, to request + 12 megabytes of 4096 alignment for ramoops: + + reserve_mem=12M:4096:oops ramoops.mem_name=oops + reservetop= [X86-32,EARLY] Format: nn[KMG] Reserves a hole at the top of the kernel virtual @@ -5719,9 +6298,6 @@ 2 The "airplane mode" button toggles between everything blocked and everything unblocked. - rhash_entries= [KNL,NET] - Set number of hash buckets for route cache - ring3mwait=disable [KNL] Disable ring 3 MONITOR/MWAIT feature on supported CPUs. @@ -5749,7 +6325,7 @@ port and the regular usb controller gets disabled. root= [KNL] Root filesystem - Usually this a a block device specifier of some kind, + Usually this is a block device specifier of some kind, see the early_lookup_bdev comment in block/early-lookup.c for details. Alternatively this can be "ram" for the legacy initial @@ -5776,6 +6352,11 @@ Memory area to be used by remote processor image, managed by CMA. + rt_group_sched= [KNL] Enable or disable SCHED_RR/FIFO group scheduling + when CONFIG_RT_GROUP_SCHED=y. Defaults to + !CONFIG_RT_GROUP_SCHED_DEFAULT_DISABLED. + Format: <bool> + rw [KNL] Mount root device read-write on boot S [KNL] Run init in single mode @@ -5811,6 +6392,7 @@ but is useful for debugging and performance tuning. sched_thermal_decay_shift= + [Deprecated] [KNL, SMP] Set a decay shift for scheduler thermal pressure signal. Thermal pressure signal follows the default decay period of other scheduler pelt @@ -5918,6 +6500,10 @@ non-zero "wait" parameter. See weight_single and weight_many. + sdw_mclk_divider=[SDW] + Specify the MCLK divider for Intel SoundWire buses in + case the BIOS does not provide the clock rate properly. + skew_tick= [KNL,EARLY] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. @@ -5940,7 +6526,16 @@ serialnumber [BUGS=X86-32] - sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst + sev=option[,option...] [X86-64] + + debug + Enable debug messages. + + nosnp + Do not enable SEV-SNP (applies to host/hypervisor + only). Setting 'nosnp' avoids the RMP check overhead + in memory accesses when users do not want to run + SEV-SNP guests. shapers= [NET] Maximal number of shapers. @@ -5954,9 +6549,6 @@ apic=verbose is specified. Example: apic=debug show_lapic=all - simeth= [IA-64] - simscsi= - slab_debug[=options[,slabs][;[options[,slabs]]...] [MM] Enabling slab_debug allows one to determine the culprit if slab objects become corrupted. Enabling @@ -6008,6 +6600,16 @@ For more information see Documentation/mm/slub.rst. (slub_nomerge legacy name also accepted for now) + slab_strict_numa [MM] + Support memory policies on a per object level + in the slab allocator. The default is for memory + policies to be applied at the folio level when + a new folio is needed or a partial folio is + retrieved from the lists. Increases overhead + in the slab fastpaths but gains more accurate + NUMA kernel object placement which helps with slow + interconnects in NUMA systems. + slram= [HW,MTD] smart2= [HW] @@ -6072,9 +6674,15 @@ deployment of the HW BHI control and the SW BHB clearing sequence. - on - (default) Enable the HW or SW mitigation - as needed. - off - Disable the mitigation. + on - (default) Enable the HW or SW mitigation as + needed. This protects the kernel from + both syscalls and VMs. + vmexit - On systems which don't have the HW mitigation + available, enable the SW mitigation on vmexit + ONLY. On such systems, the host kernel is + protected from VM-originated BHI attacks, but + may still be vulnerable to syscall attacks. + off - Disable the mitigation. spectre_v2= [X86,EARLY] Control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability. @@ -6096,6 +6704,8 @@ Selecting 'on' will also enable the mitigation against user space to user space task attacks. + Selecting specific mitigation does not force enable + user mitigations. Selecting 'off' will disable both the kernel and the user space protections. @@ -6218,11 +6828,6 @@ Not specifying this option is equivalent to spec_store_bypass_disable=auto. - spia_io_base= [HW,MTD] - spia_fio_base= - spia_pedr= - spia_peddr= - split_lock_detect= [X86] Enable split lock detection or bus lock detection @@ -6478,7 +7083,7 @@ This parameter controls use of the Protected Execution Facility on pSeries. - swiotlb= [ARM,IA-64,PPC,MIPS,X86,EARLY] + swiotlb= [ARM,PPC,MIPS,X86,S390,EARLY] Format: { <int> [,<int>] | force | noforce } <int> -- Number of I/O TLB slabs <int> -- Second integer after comma. Number of swiotlb @@ -6547,10 +7152,29 @@ <deci-seconds>: poll all this frequency 0: no polling (default) + thp_anon= [KNL] + Format: <size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state> + state is one of "always", "madvise", "never" or "inherit". + Control the default behavior of the system with respect + to anonymous transparent hugepages. + Can be used multiple times for multiple anon THP sizes. + See Documentation/admin-guide/mm/transhuge.rst for more + details. + threadirqs [KNL,EARLY] Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD. + thp_shmem= [KNL] + Format: <size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy> + Control the default policy of each hugepage size for the + internal shmem mount. <policy> is one of policies available + for the shmem mount ("always", "inherit", "never", "within_size", + and "advise"). + It can be used multiple times for multiple shmem THP sizes. + See Documentation/admin-guide/mm/transhuge.rst for more + details. + topology= [S390,EARLY] Format: {off | on} Specify if the kernel should make use of the cpu @@ -6559,12 +7183,6 @@ e.g. base its process migration decisions on it. Default is on. - topology_updates= [KNL, PPC, NUMA] - Format: {off} - Specify if the kernel should ignore (off) - topology updates sent by the hypervisor to this - LPAR. - torture.disable_onoff_at_boot= [KNL] Prevent the CPU-hotplug component of torturing until after init has spawned. @@ -6584,7 +7202,14 @@ torture.verbose_sleep_duration= [KNL] Duration of each verbose-printk() sleep in jiffies. - tp720= [HW,PS2] + tpm.disable_pcr_integrity= [HW,TPM] + Do not protect PCR registers from unintended physical + access, or interposers in the bus by the means of + having an integrity protected session wrapped around + TPM2_PCR_Extend command. Consider this in a situation + where TPM is heavily utilized by IMA, thus protection + causing a major performance hit, and the space where + machines are deployed is by other means guarded. tpm_suspend_pcr=[HW,TPM] Format: integer pcr id @@ -6664,6 +7289,14 @@ comma-separated list of trace events to enable. See also Documentation/trace/events.rst + To enable modules, use :mod: keyword: + + trace_event=:mod:<module> + + The value before :mod: will only enable specific events + that are part of the module. See the above mentioned + document for more information. + trace_instance=[instance-info] [FTRACE] Create a ring buffer instance early in boot up. This will be listed in: @@ -6684,6 +7317,59 @@ the same thing would happen if it was left off). The irq_handler_entry event, and all events under the "initcall" system. + Flags can be added to the instance to modify its behavior when it is + created. The flags are separated by '^'. + + The available flags are: + + traceoff - Have the tracing instance tracing disabled after it is created. + traceprintk - Have trace_printk() write into this trace instance + (note, "printk" and "trace_printk" can also be used) + + trace_instance=foo^traceoff^traceprintk,sched,irq + + The flags must come before the defined events. + + If memory has been reserved (see memmap for x86), the instance + can use that memory: + + memmap=12M$0x284500000 trace_instance=boot_map@0x284500000:12M + + The above will create a "boot_map" instance that uses the physical + memory at 0x284500000 that is 12Megs. The per CPU buffers of that + instance will be split up accordingly. + + Alternatively, the memory can be reserved by the reserve_mem option: + + reserve_mem=12M:4096:trace trace_instance=boot_map@trace + + This will reserve 12 megabytes at boot up with a 4096 byte alignment + and place the ring buffer in this memory. Note that due to KASLR, the + memory may not be the same location each time, which will not preserve + the buffer content. + + Also note that the layout of the ring buffer data may change between + kernel versions where the validator will fail and reset the ring buffer + if the layout is not the same as the previous kernel. + + If the ring buffer is used for persistent bootups and has events enabled, + it is recommend to disable tracing so that events from a previous boot do not + mix with events of the current boot (unless you are debugging a random crash + at boot up). + + reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq + + Note, saving the trace buffer across reboots does require that the system + is set up to not wipe memory. For instance, CONFIG_RESET_ATTACK_MITIGATION + can force a memory reset on boot which will clear any trace that was stored. + This is just one of many ways that can clear memory. Make sure your system + keeps the content of memory across reboots before relying on this option. + + NB: Both the mapped address and size must be page aligned for the architecture. + + See also Documentation/trace/debugging.rst + + trace_options=[option-list] [FTRACE] Enable or disable tracer options at boot. The option-list is a comma delimited list of options @@ -6719,6 +7405,15 @@ See also "Event triggers" in Documentation/trace/events.rst + traceoff_after_boot + [FTRACE] Sometimes tracing is used to debug issues + during the boot process. Since the trace buffer has a + limited amount of storage, it may be prudent to + disable tracing after the boot is finished, otherwise + the critical information may be overwritten. With this + option, the main tracing buffer will be turned off at + the end of the boot process. + traceoff_on_warning [FTRACE] enable this option to disable tracing when a warning is hit. This turns off "tracing_on". Tracing can @@ -6740,6 +7435,20 @@ See Documentation/admin-guide/mm/transhuge.rst for more details. + transparent_hugepage_shmem= [KNL] + Format: [always|within_size|advise|never|deny|force] + Can be used to control the hugepage allocation policy for + the internal shmem mount. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + + transparent_hugepage_tmpfs= [KNL] + Format: [always|within_size|advise|never] + Can be used to control the default hugepage allocation policy + for the tmpfs mount. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + trusted.source= [KEYS] Format: <string> This parameter identifies the trust source as a backend @@ -6748,6 +7457,7 @@ - "tpm" - "tee" - "caam" + - "dcp" If not specified then it defaults to iterating through the trust source list starting with TPM and assigns the first trust source as a backend which is initialized @@ -6763,6 +7473,18 @@ If not specified, "default" is used. In this case, the RNG's choice is left to each individual trust source. + trusted.dcp_use_otp_key + This is intended to be used in combination with + trusted.source=dcp and will select the DCP OTP key + instead of the DCP UNIQUE key blob encryption. + + trusted.dcp_skip_zk_test + This is intended to be used in combination with + trusted.source=dcp and will disable the check if the + blob key is all zeros. This is helpful for situations where + having this key zero'ed is acceptable. E.g. in testing + scenarios. + tsc= Disable clocksource stability checks for TSC. Format: <string> [x86] reliable: mark tsc clocksource as reliable, this @@ -6890,6 +7612,22 @@ Note that genuine overcurrent events won't be reported either. + unaligned_scalar_speed= + [RISCV] + Format: {slow | fast | unsupported} + Allow skipping scalar unaligned access speed tests. This + is useful for testing alternative code paths and to skip + the tests in environments where they run too slowly. All + CPUs must have the same scalar unaligned access speed. + + unaligned_vector_speed= + [RISCV] + Format: {slow | fast | unsupported} + Allow skipping vector unaligned access speed tests. This + is useful for testing alternative code paths and to skip + the tests in environments where they run too slowly. All + CPUs must have the same vector unaligned access speed. + unknown_nmi_panic [X86] Cause panic on unknown NMI. @@ -7016,6 +7754,9 @@ usb-storage.delay_use= [UMS] The delay in seconds before a new device is scanned for Logical Units (default 1). + Optionally the delay in milliseconds if the value has + suffix with "ms". + Example: delay_use=2567ms usb-storage.quirks= [UMS] A list of quirks entries to supplement or @@ -7082,13 +7823,6 @@ 16 - SIGBUS faults Example: user_debug=31 - userpte= - [X86,EARLY] Flags controlling user PTE allocations. - - nohigh = do not allocate PTE pages in - HIGHMEM regardless of setting - of CONFIG_HIGHPTE. - vdso= [X86,SH,SPARC] On X86_32, this is an alias for vdso32=. Otherwise: @@ -7109,9 +7843,6 @@ Try vdso32=0 if you encounter an error that says: dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed! - vector= [IA-64,SMP] - vector=percpu: enable percpu vector domain - video= [FB,EARLY] Frame buffer configuration See Documentation/fb/modedb.rst. @@ -7162,9 +7893,12 @@ vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an exact size of <nn>. This can be used to increase - the minimum size (128MB on x86). It can also be - used to decrease the size and leave more room - for directly mapped kernel RAM. + the minimum size (128MB on x86, arm32 platforms). + It can also be used to decrease the size and leave more room + for directly mapped kernel RAM. Note that this parameter does + not exist on many other platforms (including arm64, alpha, + loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc, + parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc). vmcp_cma=nn[MG] [KNL,S390,EARLY] Sets the memory size reserved for contiguous memory @@ -7206,7 +7940,7 @@ vt.cur_default= [VT] Default cursor shape. Format: 0xCCBBAA, where AA, BB, and CC are the same as the parameters of the <Esc>[?A;B;Cc escape sequence; - see VGA-softcursor.txt. Default: 2 = underline. + see vga-softcursor.rst. Default: 2 = underline. vt.default_blu= [VT] Format: <blue0>,<blue1>,<blue2>,...,<blue15> @@ -7277,6 +8011,13 @@ it can be updated at runtime by writing to the corresponding sysfs file. + workqueue.panic_on_stall=<uint> + Panic when workqueue stall is detected by + CONFIG_WQ_WATCHDOG. It sets the number times of the + stall to trigger panic. + + The default is 0, which disables the panic on stall. + workqueue.cpu_intensive_thresh_us= Per-cpu work items which run for longer than this threshold are automatically considered CPU intensive @@ -7323,7 +8064,7 @@ This can be changed after boot by writing to the matching /sys/module/workqueue/parameters file. All workqueues with the "default" affinity scope will be - updated accordignly. + updated accordingly. workqueue.debug_force_rr_cpu Workqueue used to implicitly guarantee that work @@ -7369,17 +8110,18 @@ Crash from Xen panic notifier, without executing late panic() code such as dumping handler. + xen_mc_debug [X86,XEN,EARLY] + Enable multicall debugging when running as a Xen PV guest. + Enabling this feature will reduce performance a little + bit, so it should only be enabled for obtaining extended + debug data in case of multicall errors. + xen_msr_safe= [X86,XEN,EARLY] Format: <bool> Select whether to always use non-faulting (safe) MSR access functions when running as Xen PV guest. The default value is controlled by CONFIG_XEN_PV_MSR_SAFE. - xen_nopvspin [X86,XEN,EARLY] - Disables the qspinlock slowpath using Xen PV optimizations. - This parameter is obsoleted by "nopvspin" parameter, which - has equivalent effect for XEN platform. - xen_nopv [X86] Disables the PV optimizations forcing the HVM guest to run as generic HVM guest with no PV drivers. @@ -7467,4 +8209,3 @@ memory, and other data can't be written using xmon commands. off xmon is disabled. - diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index b6aeae3327ce..ee9a6c94f383 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -278,12 +278,7 @@ To reduce its OS jitter, do any of the following: due to the rtas_event_scan() function. WARNING: Please check your CPU specifications to make sure that this is safe on your particular system. - e. If running on Cell Processor, build your kernel with - CBE_CPUFREQ_SPU_GOVERNOR=n to avoid OS jitter from - spu_gov_work(). - WARNING: Please check your CPU specifications to - make sure that this is safe on your particular system. - f. If running on PowerMAC, build your kernel with + e. If running on PowerMAC, build your kernel with CONFIG_PMAC_RACKMETER=n to disable the CPU-meter, avoiding OS jitter from rackmeter_do_timer(). @@ -315,7 +310,7 @@ To reduce its OS jitter, do at least one of the following: to do. Name: - rcuop/%d and rcuos/%d + rcuop/%d, rcuos/%d, and rcuog/%d Purpose: Offload RCU callbacks from the corresponding CPU. diff --git a/Documentation/admin-guide/laptops/alienware-wmi.rst b/Documentation/admin-guide/laptops/alienware-wmi.rst new file mode 100644 index 000000000000..27a32a8057da --- /dev/null +++ b/Documentation/admin-guide/laptops/alienware-wmi.rst @@ -0,0 +1,127 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Alienware WMI Driver +==================== + +Kurt Borja <kuurtb@gmail.com> + +This is a driver for the "WMAX" WMI device, which is found in most Dell gaming +laptops and controls various special features. + +Before the launch of M-Series laptops (~2018), the "WMAX" device controlled +basic RGB lighting, deep sleep mode, HDMI mode and amplifier status. + +Later, this device was completely repurpused. Now it mostly deals with thermal +profiles, sensor monitoring and overclocking. This interface is named "AWCC" and +is known to be used by the AWCC OEM application to control these features. + +The alienware-wmi driver controls both interfaces. + +AWCC Interface +============== + +WMI device documentation: Documentation/wmi/devices/alienware-wmi.rst + +Supported devices +----------------- + +- Alienware M-Series laptops +- Alienware X-Series laptops +- Alienware Aurora Desktops +- Dell G-Series laptops + +If you believe your device supports the AWCC interface and you don't have any of +the features described in this document, try the following alienware-wmi module +parameters: + +- ``force_platform_profile=1``: Forces probing for platform profile support +- ``force_hwmon=1``: Forces probing for HWMON support + +If the module loads successfully with these parameters, consider submitting a +patch adding your model to the ``awcc_dmi_table`` located in +``drivers/platform/x86/dell/alienware-wmi-wmax.c`` or contacting the maintainer +for further guidance. + +Status +------ + +The following features are currently supported: + +- :ref:`Platform Profile <platform-profile>`: + + - Thermal profile control + + - G-Mode toggling + +- :ref:`HWMON <hwmon>`: + + - Sensor monitoring + + - Manual fan control + +.. _platform-profile: + +Platform Profile +---------------- + +The AWCC interface exposes various firmware defined thermal profiles. These are +exposed to user-space through the Platform Profile class interface. Refer to +:ref:`sysfs-class-platform-profile <abi_file_testing_sysfs_class_platform_profile>` +for more information. + +The name of the platform-profile class device exported by this driver is +"alienware-wmi" and it's path can be found with: + +:: + + grep -l "alienware-wmi" /sys/class/platform-profile/platform-profile-*/name | sed 's|/[^/]*$||' + +If the device supports G-Mode, it is also toggled when selecting the +``performance`` profile. + +.. note:: + You may set the ``force_gmode`` module parameter to always try to toggle this + feature, without checking if your model supports it. + +.. _hwmon: + +HWMON +----- + +The AWCC interface also supports sensor monitoring and manual fan control. Both +of these features are exposed to user-space through the HWMON interface. + +The name of the hwmon class device exported by this driver is "alienware_wmi" +and it's path can be found with: + +:: + + grep -l "alienware_wmi" /sys/class/hwmon/hwmon*/name | sed 's|/[^/]*$||' + +Sensor monitoring is done through the standard HWMON interface. Refer to +:ref:`sysfs-class-hwmon <abi_file_testing_sysfs_class_hwmon>` for more +information. + +Manual fan control on the other hand, is not exposed directly by the AWCC +interface. Instead it let's us control a fan `boost` value. This `boost` value +has the following aproximate behavior over the fan pwm: + +:: + + pwm = pwm_base + (fan_boost / 255) * (pwm_max - pwm_base) + +Due to the above behavior, the fan `boost` control is exposed to user-space +through the following, custom hwmon sysfs attribute: + +=============================== ======= ======================================= +Name Perm Description +=============================== ======= ======================================= +fan[1-4]_boost RW Fan boost value. + + Integer value between 0 and 255 +=============================== ======= ======================================= + +.. note:: + In some devices, manual fan control only works reliably if the ``custom`` + platform profile is selected. diff --git a/Documentation/admin-guide/laptops/index.rst b/Documentation/admin-guide/laptops/index.rst index cd9a1c2695fd..db842b629303 100644 --- a/Documentation/admin-guide/laptops/index.rst +++ b/Documentation/admin-guide/laptops/index.rst @@ -7,10 +7,12 @@ Laptop Drivers .. toctree:: :maxdepth: 1 + alienware-wmi asus-laptop disk-shock-protection laptop-mode lg-laptop + samsung-galaxybook sony-laptop sonypi thinkpad-acpi diff --git a/Documentation/admin-guide/laptops/samsung-galaxybook.rst b/Documentation/admin-guide/laptops/samsung-galaxybook.rst new file mode 100644 index 000000000000..752b8f1a4a74 --- /dev/null +++ b/Documentation/admin-guide/laptops/samsung-galaxybook.rst @@ -0,0 +1,174 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +========================== +Samsung Galaxy Book Driver +========================== + +Joshua Grisham <josh@joshuagrisham.com> + +This is a Linux x86 platform driver for Samsung Galaxy Book series notebook +devices which utilizes Samsung's ``SCAI`` ACPI device in order to control +extra features and receive various notifications. + +Supported devices +================= + +Any device with one of the supported ACPI device IDs should be supported. This +covers most of the "Samsung Galaxy Book" series notebooks that are currently +available as of this writing, and could include other Samsung notebook devices +as well. + +Status +====== + +The following features are currently supported: + +- :ref:`Keyboard backlight <keyboard-backlight>` control +- :ref:`Performance mode <performance-mode>` control implemented using the + platform profile interface +- :ref:`Battery charge control end threshold + <battery-charge-control-end-threshold>` (stop charging battery at given + percentage value) implemented as a battery hook +- :ref:`Firmware Attributes <firmware-attributes>` to allow control of various + device settings +- :ref:`Handling of Fn hotkeys <keyboard-hotkey-actions>` for various actions +- :ref:`Handling of ACPI notifications and hotkeys + <acpi-notifications-and-hotkey-actions>` + +Because different models of these devices can vary in their features, there is +logic built within the driver which attempts to test each implemented feature +for a valid response before enabling its support (registering additional devices +or extensions, adding sysfs attributes, etc). Therefore, it can be important to +note that not all features may be supported for your particular device. + +The following features might be possible to implement but will require +additional investigation and are therefore not supported at this time: + +- "Dolby Atmos" mode for the speakers +- "Outdoor Mode" for increasing screen brightness on models with ``SAM0427`` +- "Silent Mode" on models with ``SAM0427`` + +.. _keyboard-backlight: + +Keyboard backlight +================== + +A new LED class named ``samsung-galaxybook::kbd_backlight`` is created which +will then expose the device using the standard sysfs-based LED interface at +``/sys/class/leds/samsung-galaxybook::kbd_backlight``. Brightness can be +controlled by writing the desired value to the ``brightness`` sysfs attribute or +with any other desired userspace utility. + +.. note:: + Most of these devices have an ambient light sensor which also turns + off the keyboard backlight under well-lit conditions. This behavior does not + seem possible to control at this time, but can be good to be aware of. + +.. _performance-mode: + +Performance mode +================ + +This driver implements the +Documentation/userspace-api/sysfs-platform_profile.rst interface for working +with the "performance mode" function of the Samsung ACPI device. + +Mapping of each Samsung "performance mode" to its respective platform profile is +performed dynamically by the driver, as not all models support all of the same +performance modes. Your device might have one or more of the following mappings: + +- "Silent" maps to ``low-power`` +- "Quiet" maps to ``quiet`` +- "Optimized" maps to ``balanced`` +- "High performance" maps to ``performance`` + +The result of the mapping can be printed in the kernel log when the module is +loaded. Supported profiles can also be retrieved from +``/sys/firmware/acpi/platform_profile_choices``, while +``/sys/firmware/acpi/platform_profile`` can be used to read or write the +currently selected profile. + +The ``balanced`` platform profile will be set during module load if no profile +has been previously set. + +.. _battery-charge-control-end-threshold: + +Battery charge control end threshold +==================================== + +This platform driver will add the ability to set the battery's charge control +end threshold, but does not have the ability to set a start threshold. + +This feature is typically called "Battery Saver" by the various Samsung +applications in Windows, but in Linux we have implemented the standardized +"charge control threshold" sysfs interface on the battery device to allow for +controlling this functionality from the userspace. + +The sysfs attribute +``/sys/class/power_supply/BAT1/charge_control_end_threshold`` can be used to +read or set the desired charge end threshold. + +If you wish to maintain interoperability with the Samsung Settings application +in Windows, then you should set the value to 100 to represent "off", or enable +the feature using only one of the following values: 50, 60, 70, 80, or 90. +Otherwise, the driver will accept any value between 1 and 100 as the percentage +that you wish the battery to stop charging at. + +.. note:: + Some devices have been observed as automatically "turning off" the charge + control end threshold if an input value of less than 30 is given. + +.. _firmware-attributes: + +Firmware Attributes +=================== + +The following enumeration-typed firmware attributes are set up by this driver +and should be accessible under +``/sys/class/firmware-attributes/samsung-galaxybook/attributes/`` if your device +supports them: + +- ``power_on_lid_open`` (device should power on when the lid is opened) +- ``usb_charging`` (USB ports can deliver power to connected devices even when + the device is powered off or in a low sleep state) +- ``block_recording`` (blocks access to camera and microphone) + +All of these attributes are simple boolean-like enumeration values which use 0 +to represent "off" and 1 to represent "on". Use the ``current_value`` attribute +to get or change the setting on the device. + +Note that when ``block_recording`` is updated, the input device "Samsung Galaxy +Book Lens Cover" will receive a ``SW_CAMERA_LENS_COVER`` switch event which +reflects the current state. + +.. _keyboard-hotkey-actions: + +Keyboard hotkey actions (i8042 filter) +====================================== + +The i8042 filter will swallow the keyboard events for the Fn+F9 hotkey (Multi- +level keyboard backlight toggle) and Fn+F10 hotkey (Block recording toggle) +and instead execute their actions within the driver itself. + +Fn+F9 will cycle through the brightness levels of the keyboard backlight. A +notification will be sent using ``led_classdev_notify_brightness_hw_changed`` +so that the userspace can be aware of the change. This mimics the behavior of +other existing devices where the brightness level is cycled internally by the +embedded controller and then reported via a notification. + +Fn+F10 will toggle the value of the "block recording" setting, which blocks +or allows usage of the built-in camera and microphone (and generates the same +Lens Cover switch event mentioned above). + +.. _acpi-notifications-and-hotkey-actions: + +ACPI notifications and hotkey actions +===================================== + +ACPI notifications will generate ACPI netlink events under the device class +``samsung-galaxybook`` and bus ID matching the Samsung ACPI device ID found on +your device. The events can be received using userspace tools such as +``acpi_listen`` and ``acpid``. + +The Fn+F11 Performance mode hotkey will be handled by the driver; each keypress +will cycle to the next available platform profile. diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 7f674a6cfa8a..4ab0fef7d440 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -445,8 +445,10 @@ event code Key Notes 0x1008 0x07 FN+F8 IBM: toggle screen expand Lenovo: configure UltraNav, or toggle screen expand. - On newer platforms (2024+) - replaced by 0x131f (see below) + On 2024 platforms replaced by + 0x131f (see below) and on newer + platforms (2025 +) keycode is + replaced by 0x1401 (see below). 0x1009 0x08 FN+F9 - @@ -506,9 +508,11 @@ event code Key Notes 0x1019 0x18 unknown -0x131f ... FN+F8 Platform Mode change. +0x131f ... FN+F8 Platform Mode change (2024 systems). Implemented in driver. +0x1401 ... FN+F8 Platform Mode change (2025 + systems). + Implemented in driver. ... ... ... 0x1020 0x1F unknown diff --git a/Documentation/admin-guide/media/building.rst b/Documentation/admin-guide/media/building.rst index a06473429916..7a413ba07f93 100644 --- a/Documentation/admin-guide/media/building.rst +++ b/Documentation/admin-guide/media/building.rst @@ -15,7 +15,7 @@ Please notice, however, that, if: you should use the main media development tree ``master`` branch: - https://git.linuxtv.org/media_tree.git/ + https://git.linuxtv.org/media.git/ In this case, you may find some useful information at the `LinuxTv wiki pages <https://linuxtv.org/wiki>`_: diff --git a/Documentation/admin-guide/media/c3-isp.dot b/Documentation/admin-guide/media/c3-isp.dot new file mode 100644 index 000000000000..42dc931ee84a --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.dot @@ -0,0 +1,26 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0 | <port1> 1} | c3-isp-core\n/dev/v4l-subdev0 | {<port2> 2 | <port3> 3 | <port4> 4 | <port5> 5}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port3 -> n00000008:port0 + n00000001:port4 -> n0000000b:port0 + n00000001:port5 -> n0000000e:port0 + n00000001:port2 -> n00000027 + n00000008 [label="{{<port0> 0} | c3-isp-resizer0\n/dev/v4l-subdev1 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000008:port1 -> n00000016 [style=bold] + n0000000b [label="{{<port0> 0} | c3-isp-resizer1\n/dev/v4l-subdev2 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000b:port1 -> n0000001a [style=bold] + n0000000e [label="{{<port0> 0} | c3-isp-resizer2\n/dev/v4l-subdev3 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000e:port1 -> n00000023 [style=bold] + n00000011 [label="{{<port0> 0} | c3-mipi-adapter\n/dev/v4l-subdev4 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n00000011:port1 -> n00000001:port0 [style=bold] + n00000016 [label="c3-isp-cap0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n0000001a [label="c3-isp-cap1\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n0000001e [label="{{<port0> 0} | c3-mipi-csi2\n/dev/v4l-subdev5 | {<port1> 1}}", shape=Mrecord, style=filled, fillcolor=green] + n0000001e:port1 -> n00000011:port0 [style=bold] + n00000023 [label="c3-isp-cap2\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n00000027 [label="c3-isp-stats\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n0000002b [label="c3-isp-params\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n0000002b -> n00000001:port1 + n0000003f [label="{{} | imx290 2-001a\n/dev/v4l-subdev6 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n0000003f:port0 -> n0000001e:port0 [style=bold] +} diff --git a/Documentation/admin-guide/media/c3-isp.rst b/Documentation/admin-guide/media/c3-isp.rst new file mode 100644 index 000000000000..ac508b8c6831 --- /dev/null +++ b/Documentation/admin-guide/media/c3-isp.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR MIT) + +.. include:: <isonum.txt> + +================================================= +Amlogic C3 Image Signal Processing (C3ISP) driver +================================================= + +Introduction +============ + +This file documents the Amlogic C3ISP driver located under +drivers/media/platform/amlogic/c3/isp. + +The current version of the driver supports the C3ISP found on +Amlogic C308L processor. + +The driver implements V4L2, Media controller and V4L2 subdev interfaces. +Camera sensor using V4L2 subdev interface in the kernel is supported. + +The driver has been tested on AW419-C308L-Socket platform. + +Amlogic C3 ISP +============== + +The Camera hardware found on C308L processors and supported by +the driver consists of: + +- 1 MIPI-CSI-2 module: handles the physical layer of the MIPI CSI-2 receiver and + receives data from the connected camera sensor. +- 1 MIPI-ADAPTER module: organizes MIPI data to meet ISP input requirements and + send MIPI data to ISP. +- 1 ISP (Image Signal Processing) module: contains a pipeline of image processing + hardware blocks. The ISP pipeline contains three resizers at the end each of + them connected to a DMA interface which writes the output data to memory. + +A high-level functional view of the C3 ISP is presented below.:: + + +----------+ +-------+ + | Resizer |--->| WRMIF | + +---------+ +------------+ +--------------+ +-------+ |----------+ +-------+ + | Sensor |--->| MIPI CSI-2 |--->| MIPI ADAPTER |--->| ISP |---|----------+ +-------+ + +---------+ +------------+ +--------------+ +-------+ | Resizer |--->| WRMIF | + +----------+ +-------+ + |----------+ +-------+ + | Resizer |--->| WRMIF | + +----------+ +-------+ + +Driver architecture and design +============================== + +With the goal to model the hardware links between the modules and to expose a +clean, logical and usable interface, the driver registers the following V4L2 +sub-devices: + +- 1 `c3-mipi-csi2` sub-device - the MIPI CSI-2 receiver +- 1 `c3-mipi-adapter` sub-device - the MIPI adapter +- 1 `c3-isp-core` sub-device - the ISP core +- 3 `c3-isp-resizer` sub-devices - the ISP resizers + +The `c3-isp-core` sub-device is linked to 2 video device nodes for statistics +capture and parameters programming: + +- the `c3-isp-stats` capture video device node for statistics capture +- the `c3-isp-params` output video device for parameters programming + +Each `c3-isp-resizer` sub-device is linked to a capture video device node where +frames are captured from: + +- `c3-isp-resizer0` is linked to the `c3-isp-cap0` capture video device +- `c3-isp-resizer1` is linked to the `c3-isp-cap1` capture video device +- `c3-isp-resizer2` is linked to the `c3-isp-cap2` capture video device + +The media controller pipeline graph is as follows (with connected a +IMX290 camera sensor): + +.. _isp_topology_graph: + +.. kernel-figure:: c3-isp.dot + :alt: c3-isp.dot + :align: center + + Media pipeline topology + +Implementation +============== + +Runtime configuration of the ISP hardware is performed on the `c3-isp-params` +video device node using the :ref:`V4L2_META_FMT_C3ISP_PARAMS +<v4l2-meta-fmt-c3isp-params>` as data format. The buffer structure is defined by +:c:type:`c3_isp_params_cfg`. + +Statistics are captured from the `c3-isp-stats` video device node using the +:ref:`V4L2_META_FMT_C3ISP_STATS <v4l2-meta-fmt-c3isp-stats>` data format. + +The final picture size and format is configured using the V4L2 video +capture interface on the `c3-isp-cap[0, 2]` video device nodes. + +The Amlogic C3 ISP is supported by `libcamera <https://libcamera.org>`_ with a +dedicated pipeline handler and algorithms that perform run-time image correction +and enhancement. diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst index 6b30e355cf23..b2e7a300494a 100644 --- a/Documentation/admin-guide/media/cec.rst +++ b/Documentation/admin-guide/media/cec.rst @@ -42,10 +42,14 @@ dongles): ``persistent_config``: by default this is off, but when set to 1 the driver will store the current settings to the device's internal eeprom and restore it the next time the device is connected to the USB port. + - RainShadow Tech. Note: this driver does not support the persistent_config module option of the Pulse-Eight driver. The hardware supports it, but I have no plans to add this feature. But I accept patches :-) +- Extron DA HD 4K PLUS HDMI Distribution Amplifier. See + :ref:`extron_da_hd_4k_plus` for more information. + Miscellaneous: - vivid: emulates a CEC receiver and CEC transmitter. @@ -378,3 +382,86 @@ it later using ``--analyze-pin``. You can also use this as a full-fledged CEC device by configuring it using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``. + +.. _extron_da_hd_4k_plus: + +Extron DA HD 4K PLUS CEC Adapter driver +======================================= + +This driver is for the Extron DA HD 4K PLUS series of HDMI Distribution +Amplifiers: https://www.extron.com/product/dahd4kplusseries + +The 2, 4 and 6 port models are supported. + +Firmware version 1.02.0001 or higher is required. + +Note that older Extron hardware revisions have a problem with the CEC voltage, +which may mean that CEC will not work. This is fixed in hardware revisions +E34814 and up. + +The CEC support has two modes: the first is a manual mode where userspace has +to manually control CEC for the HDMI Input and all HDMI Outputs. While this gives +full control, it is also complicated. + +The second mode is an automatic mode, which is selected if the module option +``vendor_id`` is set. In that case the driver controls CEC and CEC messages +received in the input will be distributed to the outputs. It is still possible +to use the /dev/cecX devices to talk to the connected devices directly, but it is +the driver that configures everything and deals with things like Hotplug Detect +changes. + +The driver also takes care of the EDIDs: /dev/videoX devices are created to +read the EDIDs and (for the HDMI Input port) to set the EDID. + +By default userspace is responsible to set the EDID for the HDMI Input +according to the EDIDs of the connected displays. But if the ``manufacturer_name`` +module option is set, then the driver will take care of setting the EDID +of the HDMI Input based on the supported resolutions of the connected displays. +Currently the driver only supports resolutions 1080p60 and 4kp60: if all connected +displays support 4kp60, then it will advertise 4kp60 on the HDMI input, otherwise +it will fall back to an EDID that just reports 1080p60. + +The status of the Extron is reported in ``/sys/kernel/debug/cec/cecX/status``. + +The extron-da-hd-4k-plus driver implements the following module options: + +``debug`` +--------- + +If set to 1, then all serial port traffic is shown. + +``vendor_id`` +------------- + +The CEC Vendor ID to report to connected displays. + +If set, then the driver will take care of distributing CEC messages received +on the input to the HDMI outputs. This is done for the following CEC messages: + +- <Standby> +- <Image View On> and <Text View On> +- <Give Device Power Status> +- <Set System Audio Mode> +- <Request Current Latency> + +If not set, then userspace is responsible for this, and it will have to +configure the CEC devices for HDMI Input and the HDMI Outputs manually. + +``manufacturer_name`` +--------------------- + +A three character manufacturer name that is used in the EDID for the HDMI +Input. If not set, then userspace is responsible for configuring an EDID. +If set, then the driver will update the EDID automatically based on the +resolutions supported by the connected displays, and it will not be possible +anymore to manually set the EDID for the HDMI Input. + +``hpd_never_low`` +----------------- + +If set, then the Hotplug Detect pin of the HDMI Input will always be high, +even if nothing is connected to the HDMI Outputs. If not set (the default) +then the Hotplug Detect pin of the HDMI input will go low if all the detected +Hotplug Detect pins of the HDMI Outputs are also low. + +This option may be changed dynamically. diff --git a/Documentation/admin-guide/media/em28xx-cardlist.rst b/Documentation/admin-guide/media/em28xx-cardlist.rst index ace65718ea22..7dac07986d91 100644 --- a/Documentation/admin-guide/media/em28xx-cardlist.rst +++ b/Documentation/admin-guide/media/em28xx-cardlist.rst @@ -438,3 +438,11 @@ EM28xx cards list - MyGica iGrabber - em2860 - 1f4d:1abe + * - 106 + - Hauppauge USB QuadHD ATSC + - em28274 + - 2040:846d + * - 107 + - MyGica UTV3 Analog USB2.0 TV Box + - em2860 + - eb1a:2860 diff --git a/Documentation/admin-guide/media/index.rst b/Documentation/admin-guide/media/index.rst index be7e0e4482ca..b11737ae6c04 100644 --- a/Documentation/admin-guide/media/index.rst +++ b/Documentation/admin-guide/media/index.rst @@ -20,6 +20,11 @@ Documentation/driver-api/media/index.rst - for driver development information and Kernel APIs used by media devices; +Documentation/process/debugging/media_specific_debugging_guide.rst + + - for advice about essential tools and techniques to debug drivers on this + subsystem + .. toctree:: :caption: Table of Contents :maxdepth: 2 diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst index 83b3cd03b35c..9c190942932e 100644 --- a/Documentation/admin-guide/media/ipu3.rst +++ b/Documentation/admin-guide/media/ipu3.rst @@ -98,7 +98,7 @@ frames in packed raw Bayer format to IPU3 CSI2 receiver. # and that ov5670 sensor is connected to i2c bus 10 with address 0x36 export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036") - # Establish the link for the media devices using media-ctl [#f3]_ + # Establish the link for the media devices using media-ctl media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]" # Set the format for the media devices @@ -589,12 +589,8 @@ preserved. References ========== -.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h - .. [#f1] https://github.com/intel/nvt .. [#f2] http://git.ideasonboard.org/yavta.git -.. [#f3] http://git.ideasonboard.org/?p=media-ctl.git;a=summary - .. [#f4] ImgU limitation requires an additional 16x16 for all input resolutions diff --git a/Documentation/admin-guide/media/ipu6-isys.rst b/Documentation/admin-guide/media/ipu6-isys.rst new file mode 100644 index 000000000000..d05086824a74 --- /dev/null +++ b/Documentation/admin-guide/media/ipu6-isys.rst @@ -0,0 +1,161 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: <isonum.txt> + +======================================================== +Intel Image Processing Unit 6 (IPU6) Input System driver +======================================================== + +Copyright |copy| 2023--2024 Intel Corporation + +Introduction +============ + +This file documents the Intel IPU6 (6th generation Image Processing Unit) +Input System (MIPI CSI2 receiver) drivers located under +drivers/media/pci/intel/ipu6. + +The Intel IPU6 can be found in certain Intel SoCs but not in all SKUs: + +* Tiger Lake +* Jasper Lake +* Alder Lake +* Raptor Lake +* Meteor Lake + +Intel IPU6 is made up of two components - Input System (ISYS) and Processing +System (PSYS). + +The Input System mainly works as MIPI CSI-2 receiver which receives and +processes the image data from the sensors and outputs the frames to memory. + +There are 2 driver modules - intel-ipu6 and intel-ipu6-isys. intel-ipu6 is an +IPU6 common driver which does PCI configuration, firmware loading and parsing, +firmware authentication, DMA mapping and IPU-MMU (internal Memory mapping Unit) +configuration. intel_ipu6_isys implements V4L2, Media Controller and V4L2 +sub-device interfaces. The IPU6 ISYS driver supports camera sensors connected +to the IPU6 ISYS through V4L2 sub-device sensor drivers. + +.. Note:: See Documentation/driver-api/media/drivers/ipu6.rst for more + information about the IPU6 hardware. + +Input system driver +=================== + +The Input System driver mainly configures CSI-2 D-PHY, constructs the firmware +stream configuration, sends commands to firmware, gets response from hardware +and firmware and then returns buffers to user. The ISYS is represented as +several V4L2 sub-devices as well as video nodes. + +.. kernel-figure:: ipu6_isys_graph.svg + :alt: ipu6 isys media graph with multiple streams support + + IPU6 ISYS media graph with multiple streams support + +The graph has been produced using the following command: + +.. code-block:: none + + fdp -Gsplines=true -Tsvg < dot > dot.svg + +Capturing frames with IPU6 ISYS +------------------------------- + +IPU6 ISYS is used to capture frames from the camera sensors connected to the +CSI2 ports. The supported input formats of ISYS are listed in table below: + +.. tabularcolumns:: |p{0.8cm}|p{4.0cm}|p{4.0cm}| + +.. flat-table:: + :header-rows: 1 + + * - IPU6 ISYS supported input formats + + * - RGB565, RGB888 + + * - UYVY8, YUYV8 + + * - RAW8, RAW10, RAW12 + +.. _ipu6_isys_capture_examples: + +Examples +~~~~~~~~ + +Here is an example of IPU6 ISYS raw capture on Dell XPS 9315 laptop. On this +machine, ov01a10 sensor is connected to IPU ISYS CSI-2 port 2, which can +generate images at sBGGR10 with resolution 1280x800. + +Using the media controller APIs, we can configure ov01a10 sensor by +media-ctl [#f1]_ and yavta [#f2]_ to transmit frames to IPU6 ISYS. + +.. code-block:: none + + # Example 1 capture frame from ov01a10 camera sensor + # This example assumes /dev/media0 as the IPU ISYS media device + export MDEV=/dev/media0 + + # Establish the link for the media devices using media-ctl + media-ctl -d $MDEV -l "\"ov01a10 3-0036\":0 -> \"Intel IPU6 CSI2 2\":0[1]" + + # Set the format for the media devices + media-ctl -d $MDEV -V "ov01a10:0 [fmt:SBGGR10/1280x800]" + media-ctl -d $MDEV -V "Intel IPU6 CSI2 2:0 [fmt:SBGGR10/1280x800]" + media-ctl -d $MDEV -V "Intel IPU6 CSI2 2:1 [fmt:SBGGR10/1280x800]" + +Once the media pipeline is configured, desired sensor specific settings +(such as exposure and gain settings) can be set, using the yavta tool. + +e.g + +.. code-block:: none + + # and that ov01a10 sensor is connected to i2c bus 3 with address 0x36 + export SDEV=$(media-ctl -d $MDEV -e "ov01a10 3-0036") + + yavta -w 0x009e0903 400 $SDEV + yavta -w 0x009e0913 1000 $SDEV + yavta -w 0x009e0911 2000 $SDEV + +Once the desired sensor settings are set, frame captures can be done as below. + +e.g + +.. code-block:: none + + yavta --data-prefix -u -c10 -n5 -I -s 1280x800 --file=/tmp/frame-#.bin \ + -f SBGGR10 $(media-ctl -d $MDEV -e "Intel IPU6 ISYS Capture 0") + +With the above command, 10 frames are captured at 1280x800 resolution with +sBGGR10 format. The captured frames are available as /tmp/frame-#.bin files. + +Here is another example of IPU6 ISYS RAW and metadata capture from camera +sensor ov2740 on Lenovo X1 Yoga laptop. + +.. code-block:: none + + media-ctl -l "\"ov2740 14-0036\":0 -> \"Intel IPU6 CSI2 1\":0[1]" + media-ctl -l "\"Intel IPU6 CSI2 1\":1 -> \"Intel IPU6 ISYS Capture 0\":0[1]" + media-ctl -l "\"Intel IPU6 CSI2 1\":2 -> \"Intel IPU6 ISYS Capture 1\":0[1]" + + # set routing + media-ctl -R "\"Intel IPU6 CSI2 1\" [0/0->1/0[1],0/1->2/1[1]]" + + media-ctl -V "\"Intel IPU6 CSI2 1\":0/0 [fmt:SGRBG10/1932x1092]" + media-ctl -V "\"Intel IPU6 CSI2 1\":0/1 [fmt:GENERIC_8/97x1]" + media-ctl -V "\"Intel IPU6 CSI2 1\":1/0 [fmt:SGRBG10/1932x1092]" + media-ctl -V "\"Intel IPU6 CSI2 1\":2/1 [fmt:GENERIC_8/97x1]" + + CAPTURE_DEV=$(media-ctl -e "Intel IPU6 ISYS Capture 0") + ./yavta --data-prefix -c100 -n5 -I -s1932x1092 --file=/tmp/frame-#.bin \ + -f SGRBG10 ${CAPTURE_DEV} + + CAPTURE_META=$(media-ctl -e "Intel IPU6 ISYS Capture 1") + ./yavta --data-prefix -c100 -n5 -I -s97x1 -B meta-capture \ + --file=/tmp/meta-#.bin -f GENERIC_8 ${CAPTURE_META} + +References +========== + +.. [#f1] https://git.ideasonboard.org/media-ctl.git +.. [#f2] https://git.ideasonboard.org/yavta.git diff --git a/Documentation/admin-guide/media/ipu6_isys_graph.svg b/Documentation/admin-guide/media/ipu6_isys_graph.svg new file mode 100644 index 000000000000..c8539ef320d2 --- /dev/null +++ b/Documentation/admin-guide/media/ipu6_isys_graph.svg @@ -0,0 +1,548 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" + "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> +<!-- Generated by graphviz version 2.43.0 (0) + --> +<!-- Title: board Pages: 1 --> +<svg width="1703pt" height="1473pt" + viewBox="0.00 0.00 1703.00 1473.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"> +<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 1469)"> +<title>board</title> +<polygon fill="white" stroke="transparent" points="-4,4 -4,-1469 1699,-1469 1699,4 -4,4"/> +<!-- n00000001 --> +<g id="node1" class="node"> +<title>n00000001</title> +<polygon fill="yellow" stroke="black" points="832.99,-750.08 629.99,-750.08 629.99,-712.08 832.99,-712.08 832.99,-750.08"/> +<text text-anchor="middle" x="731.49" y="-734.88" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 0</text> +<text text-anchor="middle" x="731.49" y="-719.88" font-family="Times,serif" font-size="14.00">/dev/video0</text> +</g> +<!-- n00000005 --> +<g id="node2" class="node"> +<title>n00000005</title> +<polygon fill="yellow" stroke="black" points="1396.59,-771.88 1193.59,-771.88 1193.59,-733.88 1396.59,-733.88 1396.59,-771.88"/> +<text text-anchor="middle" x="1295.09" y="-756.68" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 1</text> +<text text-anchor="middle" x="1295.09" y="-741.68" font-family="Times,serif" font-size="14.00">/dev/video1</text> +</g> +<!-- n00000009 --> +<g id="node3" class="node"> +<title>n00000009</title> +<polygon fill="yellow" stroke="black" points="1118.52,-690.47 915.52,-690.47 915.52,-652.47 1118.52,-652.47 1118.52,-690.47"/> +<text text-anchor="middle" x="1017.02" y="-675.27" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 2</text> +<text text-anchor="middle" x="1017.02" y="-660.27" font-family="Times,serif" font-size="14.00">/dev/video2</text> +</g> +<!-- n0000000d --> +<g id="node4" class="node"> +<title>n0000000d</title> +<polygon fill="yellow" stroke="black" points="1105.89,-838.84 902.89,-838.84 902.89,-800.84 1105.89,-800.84 1105.89,-838.84"/> +<text text-anchor="middle" x="1004.39" y="-823.64" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 3</text> +<text text-anchor="middle" x="1004.39" y="-808.64" font-family="Times,serif" font-size="14.00">/dev/video3</text> +</g> +<!-- n00000011 --> +<g id="node5" class="node"> +<title>n00000011</title> +<polygon fill="yellow" stroke="black" points="1279.22,-992.95 1076.22,-992.95 1076.22,-954.95 1279.22,-954.95 1279.22,-992.95"/> +<text text-anchor="middle" x="1177.72" y="-977.75" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 4</text> +<text text-anchor="middle" x="1177.72" y="-962.75" font-family="Times,serif" font-size="14.00">/dev/video4</text> +</g> +<!-- n00000015 --> +<g id="node6" class="node"> +<title>n00000015</title> +<polygon fill="yellow" stroke="black" points="939.18,-984.91 736.18,-984.91 736.18,-946.91 939.18,-946.91 939.18,-984.91"/> +<text text-anchor="middle" x="837.68" y="-969.71" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 5</text> +<text text-anchor="middle" x="837.68" y="-954.71" font-family="Times,serif" font-size="14.00">/dev/video5</text> +</g> +<!-- n00000019 --> +<g id="node7" class="node"> +<title>n00000019</title> +<polygon fill="yellow" stroke="black" points="957.87,-527.99 754.87,-527.99 754.87,-489.99 957.87,-489.99 957.87,-527.99"/> +<text text-anchor="middle" x="856.37" y="-512.79" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 6</text> +<text text-anchor="middle" x="856.37" y="-497.79" font-family="Times,serif" font-size="14.00">/dev/video6</text> +</g> +<!-- n0000001d --> +<g id="node8" class="node"> +<title>n0000001d</title> +<polygon fill="yellow" stroke="black" points="1291.02,-542.15 1088.02,-542.15 1088.02,-504.15 1291.02,-504.15 1291.02,-542.15"/> +<text text-anchor="middle" x="1189.52" y="-526.95" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 7</text> +<text text-anchor="middle" x="1189.52" y="-511.95" font-family="Times,serif" font-size="14.00">/dev/video7</text> +</g> +<!-- n00000021 --> +<g id="node9" class="node"> +<title>n00000021</title> +<polygon fill="yellow" stroke="black" points="202.74,-611.46 -0.26,-611.46 -0.26,-573.46 202.74,-573.46 202.74,-611.46"/> +<text text-anchor="middle" x="101.24" y="-596.26" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 8</text> +<text text-anchor="middle" x="101.24" y="-581.26" font-family="Times,serif" font-size="14.00">/dev/video8</text> +</g> +<!-- n00000025 --> +<g id="node10" class="node"> +<title>n00000025</title> +<polygon fill="yellow" stroke="black" points="764.86,-637.89 561.86,-637.89 561.86,-599.89 764.86,-599.89 764.86,-637.89"/> +<text text-anchor="middle" x="663.36" y="-622.69" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 9</text> +<text text-anchor="middle" x="663.36" y="-607.69" font-family="Times,serif" font-size="14.00">/dev/video9</text> +</g> +<!-- n00000029 --> +<g id="node11" class="node"> +<title>n00000029</title> +<polygon fill="yellow" stroke="black" points="358.62,-519.5 146.62,-519.5 146.62,-481.5 358.62,-481.5 358.62,-519.5"/> +<text text-anchor="middle" x="252.62" y="-504.3" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 10</text> +<text text-anchor="middle" x="252.62" y="-489.3" font-family="Times,serif" font-size="14.00">/dev/video10</text> +</g> +<!-- n0000002d --> +<g id="node12" class="node"> +<title>n0000002d</title> +<polygon fill="yellow" stroke="black" points="481.4,-662.59 269.4,-662.59 269.4,-624.59 481.4,-624.59 481.4,-662.59"/> +<text text-anchor="middle" x="375.4" y="-647.39" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 11</text> +<text text-anchor="middle" x="375.4" y="-632.39" font-family="Times,serif" font-size="14.00">/dev/video11</text> +</g> +<!-- n00000031 --> +<g id="node13" class="node"> +<title>n00000031</title> +<polygon fill="yellow" stroke="black" points="637.17,-837.47 425.17,-837.47 425.17,-799.47 637.17,-799.47 637.17,-837.47"/> +<text text-anchor="middle" x="531.17" y="-822.27" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 12</text> +<text text-anchor="middle" x="531.17" y="-807.27" font-family="Times,serif" font-size="14.00">/dev/video12</text> +</g> +<!-- n00000035 --> +<g id="node14" class="node"> +<title>n00000035</title> +<polygon fill="yellow" stroke="black" points="337.75,-833.67 125.75,-833.67 125.75,-795.67 337.75,-795.67 337.75,-833.67"/> +<text text-anchor="middle" x="231.75" y="-818.47" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 13</text> +<text text-anchor="middle" x="231.75" y="-803.47" font-family="Times,serif" font-size="14.00">/dev/video13</text> +</g> +<!-- n00000039 --> +<g id="node15" class="node"> +<title>n00000039</title> +<polygon fill="yellow" stroke="black" points="393.07,-317.96 181.07,-317.96 181.07,-279.96 393.07,-279.96 393.07,-317.96"/> +<text text-anchor="middle" x="287.07" y="-302.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 14</text> +<text text-anchor="middle" x="287.07" y="-287.76" font-family="Times,serif" font-size="14.00">/dev/video14</text> +</g> +<!-- n0000003d --> +<g id="node16" class="node"> +<title>n0000003d</title> +<polygon fill="yellow" stroke="black" points="701.46,-391.04 489.46,-391.04 489.46,-353.04 701.46,-353.04 701.46,-391.04"/> +<text text-anchor="middle" x="595.46" y="-375.84" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 15</text> +<text text-anchor="middle" x="595.46" y="-360.84" font-family="Times,serif" font-size="14.00">/dev/video15</text> +</g> +<!-- n00000041 --> +<g id="node17" class="node"> +<title>n00000041</title> +<polygon fill="yellow" stroke="black" points="212.45,-1228.8 0.45,-1228.8 0.45,-1190.8 212.45,-1190.8 212.45,-1228.8"/> +<text text-anchor="middle" x="106.45" y="-1213.6" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 16</text> +<text text-anchor="middle" x="106.45" y="-1198.6" font-family="Times,serif" font-size="14.00">/dev/video16</text> +</g> +<!-- n00000045 --> +<g id="node18" class="node"> +<title>n00000045</title> +<polygon fill="yellow" stroke="black" points="784.86,-1252.38 572.86,-1252.38 572.86,-1214.38 784.86,-1214.38 784.86,-1252.38"/> +<text text-anchor="middle" x="678.86" y="-1237.18" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 17</text> +<text text-anchor="middle" x="678.86" y="-1222.18" font-family="Times,serif" font-size="14.00">/dev/video17</text> +</g> +<!-- n00000049 --> +<g id="node19" class="node"> +<title>n00000049</title> +<polygon fill="yellow" stroke="black" points="503.14,-1169.96 291.14,-1169.96 291.14,-1131.96 503.14,-1131.96 503.14,-1169.96"/> +<text text-anchor="middle" x="397.14" y="-1154.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 18</text> +<text text-anchor="middle" x="397.14" y="-1139.76" font-family="Times,serif" font-size="14.00">/dev/video18</text> +</g> +<!-- n0000004d --> +<g id="node20" class="node"> +<title>n0000004d</title> +<polygon fill="yellow" stroke="black" points="492.62,-1319.4 280.62,-1319.4 280.62,-1281.4 492.62,-1281.4 492.62,-1319.4"/> +<text text-anchor="middle" x="386.62" y="-1304.2" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 19</text> +<text text-anchor="middle" x="386.62" y="-1289.2" font-family="Times,serif" font-size="14.00">/dev/video19</text> +</g> +<!-- n00000051 --> +<g id="node21" class="node"> +<title>n00000051</title> +<polygon fill="yellow" stroke="black" points="680.74,-1464.66 468.74,-1464.66 468.74,-1426.66 680.74,-1426.66 680.74,-1464.66"/> +<text text-anchor="middle" x="574.74" y="-1449.46" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 20</text> +<text text-anchor="middle" x="574.74" y="-1434.46" font-family="Times,serif" font-size="14.00">/dev/video20</text> +</g> +<!-- n00000055 --> +<g id="node22" class="node"> +<title>n00000055</title> +<polygon fill="yellow" stroke="black" points="302.42,-1452.56 90.42,-1452.56 90.42,-1414.56 302.42,-1414.56 302.42,-1452.56"/> +<text text-anchor="middle" x="196.42" y="-1437.36" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 21</text> +<text text-anchor="middle" x="196.42" y="-1422.36" font-family="Times,serif" font-size="14.00">/dev/video21</text> +</g> +<!-- n00000059 --> +<g id="node23" class="node"> +<title>n00000059</title> +<polygon fill="yellow" stroke="black" points="319.89,-1018.32 107.89,-1018.32 107.89,-980.32 319.89,-980.32 319.89,-1018.32"/> +<text text-anchor="middle" x="213.89" y="-1003.12" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 22</text> +<text text-anchor="middle" x="213.89" y="-988.12" font-family="Times,serif" font-size="14.00">/dev/video22</text> +</g> +<!-- n0000005d --> +<g id="node24" class="node"> +<title>n0000005d</title> +<polygon fill="yellow" stroke="black" points="692.62,-1031.39 480.62,-1031.39 480.62,-993.39 692.62,-993.39 692.62,-1031.39"/> +<text text-anchor="middle" x="586.62" y="-1016.19" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 23</text> +<text text-anchor="middle" x="586.62" y="-1001.19" font-family="Times,serif" font-size="14.00">/dev/video23</text> +</g> +<!-- n00000061 --> +<g id="node25" class="node"> +<title>n00000061</title> +<polygon fill="yellow" stroke="black" points="1122.45,-248.8 910.45,-248.8 910.45,-210.8 1122.45,-210.8 1122.45,-248.8"/> +<text text-anchor="middle" x="1016.45" y="-233.6" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 24</text> +<text text-anchor="middle" x="1016.45" y="-218.6" font-family="Times,serif" font-size="14.00">/dev/video24</text> +</g> +<!-- n00000065 --> +<g id="node26" class="node"> +<title>n00000065</title> +<polygon fill="yellow" stroke="black" points="1694.86,-272.38 1482.86,-272.38 1482.86,-234.38 1694.86,-234.38 1694.86,-272.38"/> +<text text-anchor="middle" x="1588.86" y="-257.18" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 25</text> +<text text-anchor="middle" x="1588.86" y="-242.18" font-family="Times,serif" font-size="14.00">/dev/video25</text> +</g> +<!-- n00000069 --> +<g id="node27" class="node"> +<title>n00000069</title> +<polygon fill="yellow" stroke="black" points="1413.14,-189.96 1201.14,-189.96 1201.14,-151.96 1413.14,-151.96 1413.14,-189.96"/> +<text text-anchor="middle" x="1307.14" y="-174.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 26</text> +<text text-anchor="middle" x="1307.14" y="-159.76" font-family="Times,serif" font-size="14.00">/dev/video26</text> +</g> +<!-- n0000006d --> +<g id="node28" class="node"> +<title>n0000006d</title> +<polygon fill="yellow" stroke="black" points="1402.62,-339.4 1190.62,-339.4 1190.62,-301.4 1402.62,-301.4 1402.62,-339.4"/> +<text text-anchor="middle" x="1296.62" y="-324.2" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 27</text> +<text text-anchor="middle" x="1296.62" y="-309.2" font-family="Times,serif" font-size="14.00">/dev/video27</text> +</g> +<!-- n00000071 --> +<g id="node29" class="node"> +<title>n00000071</title> +<polygon fill="yellow" stroke="black" points="1590.74,-484.66 1378.74,-484.66 1378.74,-446.66 1590.74,-446.66 1590.74,-484.66"/> +<text text-anchor="middle" x="1484.74" y="-469.46" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 28</text> +<text text-anchor="middle" x="1484.74" y="-454.46" font-family="Times,serif" font-size="14.00">/dev/video28</text> +</g> +<!-- n00000075 --> +<g id="node30" class="node"> +<title>n00000075</title> +<polygon fill="yellow" stroke="black" points="1212.42,-472.56 1000.42,-472.56 1000.42,-434.56 1212.42,-434.56 1212.42,-472.56"/> +<text text-anchor="middle" x="1106.42" y="-457.36" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 29</text> +<text text-anchor="middle" x="1106.42" y="-442.36" font-family="Times,serif" font-size="14.00">/dev/video29</text> +</g> +<!-- n00000079 --> +<g id="node31" class="node"> +<title>n00000079</title> +<polygon fill="yellow" stroke="black" points="1229.89,-38.32 1017.89,-38.32 1017.89,-0.32 1229.89,-0.32 1229.89,-38.32"/> +<text text-anchor="middle" x="1123.89" y="-23.12" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 30</text> +<text text-anchor="middle" x="1123.89" y="-8.12" font-family="Times,serif" font-size="14.00">/dev/video30</text> +</g> +<!-- n0000007d --> +<g id="node32" class="node"> +<title>n0000007d</title> +<polygon fill="yellow" stroke="black" points="1602.62,-51.39 1390.62,-51.39 1390.62,-13.39 1602.62,-13.39 1602.62,-51.39"/> +<text text-anchor="middle" x="1496.62" y="-36.19" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 31</text> +<text text-anchor="middle" x="1496.62" y="-21.19" font-family="Times,serif" font-size="14.00">/dev/video31</text> +</g> +<!-- n00000081 --> +<g id="node33" class="node"> +<title>n00000081</title> +<path fill="green" stroke="black" d="M924.28,-700.28C924.28,-700.28 1108.28,-700.28 1108.28,-700.28 1114.28,-700.28 1120.28,-706.28 1120.28,-712.28 1120.28,-712.28 1120.28,-772.28 1120.28,-772.28 1120.28,-778.28 1114.28,-784.28 1108.28,-784.28 1108.28,-784.28 924.28,-784.28 924.28,-784.28 918.28,-784.28 912.28,-778.28 912.28,-772.28 912.28,-772.28 912.28,-712.28 912.28,-712.28 912.28,-706.28 918.28,-700.28 924.28,-700.28"/> +<text text-anchor="middle" x="1016.28" y="-769.08" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="912.28,-761.28 1120.28,-761.28 "/> +<text text-anchor="middle" x="1016.28" y="-746.08" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 0</text> +<text text-anchor="middle" x="1016.28" y="-731.08" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev0</text> +<polyline fill="none" stroke="black" points="912.28,-723.28 1120.28,-723.28 "/> +<text text-anchor="middle" x="925.28" y="-708.08" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="938.28,-700.28 938.28,-723.28 "/> +<text text-anchor="middle" x="951.28" y="-708.08" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="964.28,-700.28 964.28,-723.28 "/> +<text text-anchor="middle" x="977.28" y="-708.08" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="990.28,-700.28 990.28,-723.28 "/> +<text text-anchor="middle" x="1003.28" y="-708.08" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="1016.28,-700.28 1016.28,-723.28 "/> +<text text-anchor="middle" x="1029.28" y="-708.08" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="1042.28,-700.28 1042.28,-723.28 "/> +<text text-anchor="middle" x="1055.28" y="-708.08" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="1068.28,-700.28 1068.28,-723.28 "/> +<text text-anchor="middle" x="1081.28" y="-708.08" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="1094.28,-700.28 1094.28,-723.28 "/> +<text text-anchor="middle" x="1107.28" y="-708.08" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n00000081->n00000001 --> +<g id="edge1" class="edge"> +<title>n00000081:port1->n00000001</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M912.28,-711.28C912.28,-711.28 880.33,-714.78 843.28,-718.84"/> +<polygon fill="black" stroke="black" points="842.81,-715.37 833.25,-719.94 843.57,-722.33 842.81,-715.37"/> +</g> +<!-- n00000081->n00000005 --> +<g id="edge2" class="edge"> +<title>n00000081:port2->n00000005</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M951.38,-700.28C951.38,-700.28 1086.18,-688.61 1123.48,-697.08 1155.93,-704.45 1158.99,-719.67 1190.39,-730.68 1190.49,-730.71 1190.59,-730.75 1190.69,-730.78"/> +<polygon fill="black" stroke="black" points="1189.45,-734.06 1200.05,-733.86 1191.64,-727.41 1189.45,-734.06"/> +</g> +<!-- n00000081->n00000009 --> +<g id="edge3" class="edge"> +<title>n00000081:port3->n00000009</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M977.28,-700.28C977.28,-700.28 979.31,-698.81 982.45,-696.54"/> +<polygon fill="black" stroke="black" points="984.7,-699.23 990.74,-690.53 980.59,-693.56 984.7,-699.23"/> +</g> +<!-- n00000081->n0000000d --> +<g id="edge4" class="edge"> +<title>n00000081:port4->n0000000d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1003.38,-700.26C1003.38,-700.26 916.62,-689.8 909.08,-697.08 880.2,-725.01 885.68,-754.82 909.08,-787.48 910.88,-789.99 918.96,-793.59 929.7,-797.47"/> +<polygon fill="black" stroke="black" points="928.69,-800.82 939.28,-800.79 930.98,-794.21 928.69,-800.82"/> +</g> +<!-- n00000081->n00000011 --> +<g id="edge5" class="edge"> +<title>n00000081:port5->n00000011</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1029.19,-700.26C1029.19,-700.26 1115.28,-690.56 1123.48,-697.08 1198.37,-756.64 1190.55,-886.51 1182.64,-944.71"/> +<polygon fill="black" stroke="black" points="1179.16,-944.31 1181.18,-954.71 1186.09,-945.32 1179.16,-944.31"/> +</g> +<!-- n00000081->n00000015 --> +<g id="edge6" class="edge"> +<title>n00000081:port6->n00000015</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1055.18,-700.28C1055.18,-700.28 915.57,-692.2 909.08,-697.08 834.02,-753.51 831.79,-879.34 835.06,-936.56"/> +<polygon fill="black" stroke="black" points="831.58,-936.99 835.74,-946.73 838.56,-936.52 831.58,-936.99"/> +</g> +<!-- n00000081->n00000019 --> +<g id="edge7" class="edge"> +<title>n00000081:port7->n00000019</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1081.28,-700.28C1081.28,-700.28 916.04,-696.54 912.32,-693.67 864.52,-656.73 856.3,-580.22 855.62,-538.2"/> +<polygon fill="black" stroke="black" points="859.11,-538.05 855.59,-528.06 852.11,-538.07 859.11,-538.05"/> +</g> +<!-- n00000081->n0000001d --> +<g id="edge8" class="edge"> +<title>n00000081:port8->n0000001d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1107.28,-700.28C1107.28,-700.28 1119.29,-696.23 1121.72,-693.67 1159.76,-653.62 1177.38,-589.6 1184.78,-552.46"/> +<polygon fill="black" stroke="black" points="1188.29,-552.76 1186.69,-542.29 1181.41,-551.47 1188.29,-552.76"/> +</g> +<!-- n0000008b --> +<g id="node34" class="node"> +<title>n0000008b</title> +<path fill="green" stroke="black" d="M293.1,-532.08C293.1,-532.08 477.1,-532.08 477.1,-532.08 483.1,-532.08 489.1,-538.08 489.1,-544.08 489.1,-544.08 489.1,-604.08 489.1,-604.08 489.1,-610.08 483.1,-616.08 477.1,-616.08 477.1,-616.08 293.1,-616.08 293.1,-616.08 287.1,-616.08 281.1,-610.08 281.1,-604.08 281.1,-604.08 281.1,-544.08 281.1,-544.08 281.1,-538.08 287.1,-532.08 293.1,-532.08"/> +<text text-anchor="middle" x="385.1" y="-600.88" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="281.1,-593.08 489.1,-593.08 "/> +<text text-anchor="middle" x="385.1" y="-577.88" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 1</text> +<text text-anchor="middle" x="385.1" y="-562.88" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev1</text> +<polyline fill="none" stroke="black" points="281.1,-555.08 489.1,-555.08 "/> +<text text-anchor="middle" x="294.1" y="-539.88" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="307.1,-532.08 307.1,-555.08 "/> +<text text-anchor="middle" x="320.1" y="-539.88" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="333.1,-532.08 333.1,-555.08 "/> +<text text-anchor="middle" x="346.1" y="-539.88" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="359.1,-532.08 359.1,-555.08 "/> +<text text-anchor="middle" x="372.1" y="-539.88" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="385.1,-532.08 385.1,-555.08 "/> +<text text-anchor="middle" x="398.1" y="-539.88" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="411.1,-532.08 411.1,-555.08 "/> +<text text-anchor="middle" x="424.1" y="-539.88" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="437.1,-532.08 437.1,-555.08 "/> +<text text-anchor="middle" x="450.1" y="-539.88" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="463.1,-532.08 463.1,-555.08 "/> +<text text-anchor="middle" x="476.1" y="-539.88" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n0000008b->n00000021 --> +<g id="edge9" class="edge"> +<title>n0000008b:port1->n00000021</title> +<path fill="none" stroke="black" d="M281.1,-543.08C281.1,-543.08 240.1,-560.51 205.94,-570.26 205.35,-570.43 204.77,-570.59 204.18,-570.76"/> +<polygon fill="black" stroke="black" points="203.2,-567.39 194.47,-573.39 205.03,-574.15 203.2,-567.39"/> +</g> +<!-- n0000008b->n00000025 --> +<g id="edge10" class="edge"> +<title>n0000008b:port2->n00000025</title> +<path fill="none" stroke="black" d="M320.2,-532.07C320.2,-532.07 456.9,-514.37 492.3,-528.88 528.42,-543.68 522.86,-571.78 556.11,-594.53"/> +<polygon fill="black" stroke="black" points="554.54,-597.67 564.9,-599.88 558.18,-591.69 554.54,-597.67"/> +</g> +<!-- n0000008b->n00000029 --> +<g id="edge11" class="edge"> +<title>n0000008b:port3->n00000029</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M346.1,-532.08C346.1,-532.08 333.93,-527.96 318.37,-522.71"/> +<polygon fill="black" stroke="black" points="319.48,-519.39 308.88,-519.5 317.24,-526.02 319.48,-519.39"/> +</g> +<!-- n0000008b->n0000002d --> +<g id="edge12" class="edge"> +<title>n0000008b:port4->n0000002d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M372.19,-532.05C372.19,-532.05 292.97,-514.3 277.9,-528.88 249.01,-556.8 253.16,-587.62 277.9,-619.28 278.34,-619.85 280.33,-620.69 283.45,-621.71"/> +<polygon fill="black" stroke="black" points="282.71,-625.14 293.29,-624.58 284.67,-618.42 282.71,-625.14"/> +</g> +<!-- n0000008b->n00000031 --> +<g id="edge13" class="edge"> +<title>n0000008b:port5->n00000031</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M398,-532.05C398,-532.05 476.28,-515.34 492.3,-528.88 568.49,-593.29 550.55,-729.67 538.14,-789.41"/> +<polygon fill="black" stroke="black" points="534.69,-788.79 535.99,-799.31 541.53,-790.28 534.69,-788.79"/> +</g> +<!-- n0000008b->n00000035 --> +<g id="edge14" class="edge"> +<title>n0000008b:port6->n00000035</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M424,-532.07C424,-532.07 290.37,-518.48 277.9,-528.88 202.27,-591.86 215.34,-725.69 225.66,-785.15"/> +<polygon fill="black" stroke="black" points="222.29,-786.14 227.54,-795.35 229.17,-784.88 222.29,-786.14"/> +</g> +<!-- n0000008b->n00000039 --> +<g id="edge15" class="edge"> +<title>n0000008b:port7->n00000039</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M450.1,-532.08C450.1,-532.08 395.22,-528.13 383.45,-518.65 375.46,-512.21 322.64,-385.46 298.76,-327.47"/> +<polygon fill="black" stroke="black" points="301.96,-326.05 294.92,-318.14 295.49,-328.72 301.96,-326.05"/> +</g> +<!-- n0000008b->n0000003d --> +<g id="edge16" class="edge"> +<title>n0000008b:port8->n0000003d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M476.1,-532.08C476.1,-532.08 522.37,-522.39 526.85,-518.65 563.15,-488.33 581.38,-434.52 589.6,-401.2"/> +<polygon fill="black" stroke="black" points="593.08,-401.69 591.93,-391.16 586.26,-400.11 593.08,-401.69"/> +</g> +<!-- n00000095 --> +<g id="node35" class="node"> +<title>n00000095</title> +<path fill="green" stroke="black" d="M301.38,-1180.11C301.38,-1180.11 485.38,-1180.11 485.38,-1180.11 491.38,-1180.11 497.38,-1186.11 497.38,-1192.11 497.38,-1192.11 497.38,-1252.11 497.38,-1252.11 497.38,-1258.11 491.38,-1264.11 485.38,-1264.11 485.38,-1264.11 301.38,-1264.11 301.38,-1264.11 295.38,-1264.11 289.38,-1258.11 289.38,-1252.11 289.38,-1252.11 289.38,-1192.11 289.38,-1192.11 289.38,-1186.11 295.38,-1180.11 301.38,-1180.11"/> +<text text-anchor="middle" x="393.38" y="-1248.91" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="289.38,-1241.11 497.38,-1241.11 "/> +<text text-anchor="middle" x="393.38" y="-1225.91" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 2</text> +<text text-anchor="middle" x="393.38" y="-1210.91" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev2</text> +<polyline fill="none" stroke="black" points="289.38,-1203.11 497.38,-1203.11 "/> +<text text-anchor="middle" x="302.38" y="-1187.91" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="315.38,-1180.11 315.38,-1203.11 "/> +<text text-anchor="middle" x="328.38" y="-1187.91" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="341.38,-1180.11 341.38,-1203.11 "/> +<text text-anchor="middle" x="354.38" y="-1187.91" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="367.38,-1180.11 367.38,-1203.11 "/> +<text text-anchor="middle" x="380.38" y="-1187.91" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="393.38,-1180.11 393.38,-1203.11 "/> +<text text-anchor="middle" x="406.38" y="-1187.91" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="419.38,-1180.11 419.38,-1203.11 "/> +<text text-anchor="middle" x="432.38" y="-1187.91" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="445.38,-1180.11 445.38,-1203.11 "/> +<text text-anchor="middle" x="458.38" y="-1187.91" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="471.38,-1180.11 471.38,-1203.11 "/> +<text text-anchor="middle" x="484.38" y="-1187.91" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n00000095->n00000041 --> +<g id="edge17" class="edge"> +<title>n00000095:port1->n00000041</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M289.38,-1191.11C289.38,-1191.11 258.94,-1194.22 222.89,-1197.91"/> +<polygon fill="black" stroke="black" points="222.19,-1194.46 212.6,-1198.96 222.9,-1201.42 222.19,-1194.46"/> +</g> +<!-- n00000095->n00000045 --> +<g id="edge18" class="edge"> +<title>n00000095:port2->n00000045</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M328.48,-1180.11C328.48,-1180.11 463.26,-1168.53 500.58,-1176.91 534.02,-1184.43 537.24,-1200.06 569.66,-1211.18 569.76,-1211.22 569.86,-1211.25 569.96,-1211.29"/> +<polygon fill="black" stroke="black" points="568.86,-1214.61 579.45,-1214.34 571,-1207.95 568.86,-1214.61"/> +</g> +<!-- n00000095->n00000049 --> +<g id="edge19" class="edge"> +<title>n00000095:port3->n00000049</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M354.38,-1180.11C354.38,-1180.11 356.8,-1178.46 360.49,-1175.94"/> +<polygon fill="black" stroke="black" points="362.56,-1178.77 368.86,-1170.24 358.62,-1172.98 362.56,-1178.77"/> +</g> +<!-- n00000095->n0000004d --> +<g id="edge20" class="edge"> +<title>n00000095:port4->n0000004d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M380.47,-1180.09C380.47,-1180.09 293.71,-1169.63 286.18,-1176.91 257.29,-1204.84 262.63,-1234.76 286.18,-1267.31 288.16,-1270.05 297.33,-1273.96 309.38,-1278.13"/> +<polygon fill="black" stroke="black" points="308.49,-1281.53 319.09,-1281.36 310.7,-1274.88 308.49,-1281.53"/> +</g> +<!-- n00000095->n00000051 --> +<g id="edge21" class="edge"> +<title>n00000095:port5->n00000051</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M406.28,-1180.09C406.28,-1180.09 492.13,-1170.7 500.58,-1176.91 576.41,-1232.66 579.83,-1358.79 577.09,-1416.2"/> +<polygon fill="black" stroke="black" points="573.59,-1416.23 576.51,-1426.41 580.58,-1416.63 573.59,-1416.23"/> +</g> +<!-- n00000095->n00000055 --> +<g id="edge22" class="edge"> +<title>n00000095:port6->n00000055</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M432.28,-1180.11C432.28,-1180.11 292.85,-1172.29 286.18,-1176.91 211.26,-1228.86 198.3,-1348.49 196.45,-1404.12"/> +<polygon fill="black" stroke="black" points="192.94,-1404.28 196.21,-1414.36 199.94,-1404.44 192.94,-1404.28"/> +</g> +<!-- n00000095->n00000059 --> +<g id="edge23" class="edge"> +<title>n00000095:port7->n00000059</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M458.38,-1180.11C458.38,-1180.11 291.84,-1175.85 287.94,-1173.16 239.87,-1139.96 222.85,-1068.83 216.94,-1028.6"/> +<polygon fill="black" stroke="black" points="220.39,-1028.06 215.6,-1018.61 213.46,-1028.98 220.39,-1028.06"/> +</g> +<!-- n00000095->n0000005d --> +<g id="edge24" class="edge"> +<title>n00000095:port8->n0000005d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M484.38,-1180.11C484.38,-1180.11 502.45,-1176.49 506.34,-1173.16 547.25,-1138.2 569.47,-1077.38 579.62,-1041.41"/> +<polygon fill="black" stroke="black" points="583.06,-1042.09 582.28,-1031.53 576.3,-1040.27 583.06,-1042.09"/> +</g> +<!-- n0000009f --> +<g id="node36" class="node"> +<title>n0000009f</title> +<path fill="green" stroke="black" d="M1211.38,-200.11C1211.38,-200.11 1395.38,-200.11 1395.38,-200.11 1401.38,-200.11 1407.38,-206.11 1407.38,-212.11 1407.38,-212.11 1407.38,-272.11 1407.38,-272.11 1407.38,-278.11 1401.38,-284.11 1395.38,-284.11 1395.38,-284.11 1211.38,-284.11 1211.38,-284.11 1205.38,-284.11 1199.38,-278.11 1199.38,-272.11 1199.38,-272.11 1199.38,-212.11 1199.38,-212.11 1199.38,-206.11 1205.38,-200.11 1211.38,-200.11"/> +<text text-anchor="middle" x="1303.38" y="-268.91" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="1199.38,-261.11 1407.38,-261.11 "/> +<text text-anchor="middle" x="1303.38" y="-245.91" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 3</text> +<text text-anchor="middle" x="1303.38" y="-230.91" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev3</text> +<polyline fill="none" stroke="black" points="1199.38,-223.11 1407.38,-223.11 "/> +<text text-anchor="middle" x="1212.38" y="-207.91" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="1225.38,-200.11 1225.38,-223.11 "/> +<text text-anchor="middle" x="1238.38" y="-207.91" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="1251.38,-200.11 1251.38,-223.11 "/> +<text text-anchor="middle" x="1264.38" y="-207.91" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="1277.38,-200.11 1277.38,-223.11 "/> +<text text-anchor="middle" x="1290.38" y="-207.91" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="1303.38,-200.11 1303.38,-223.11 "/> +<text text-anchor="middle" x="1316.38" y="-207.91" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="1329.38,-200.11 1329.38,-223.11 "/> +<text text-anchor="middle" x="1342.38" y="-207.91" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="1355.38,-200.11 1355.38,-223.11 "/> +<text text-anchor="middle" x="1368.38" y="-207.91" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="1381.38,-200.11 1381.38,-223.11 "/> +<text text-anchor="middle" x="1394.38" y="-207.91" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n0000009f->n00000061 --> +<g id="edge25" class="edge"> +<title>n0000009f:port1->n00000061</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1199.38,-211.11C1199.38,-211.11 1168.94,-214.22 1132.89,-217.91"/> +<polygon fill="black" stroke="black" points="1132.19,-214.46 1122.6,-218.96 1132.9,-221.42 1132.19,-214.46"/> +</g> +<!-- n0000009f->n00000065 --> +<g id="edge26" class="edge"> +<title>n0000009f:port2->n00000065</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1238.48,-200.11C1238.48,-200.11 1373.26,-188.53 1410.58,-196.91 1444.02,-204.43 1447.24,-220.06 1479.66,-231.18 1479.76,-231.22 1479.86,-231.25 1479.96,-231.29"/> +<polygon fill="black" stroke="black" points="1478.86,-234.61 1489.45,-234.34 1481,-227.95 1478.86,-234.61"/> +</g> +<!-- n0000009f->n00000069 --> +<g id="edge27" class="edge"> +<title>n0000009f:port3->n00000069</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1264.38,-200.11C1264.38,-200.11 1266.8,-198.46 1270.49,-195.94"/> +<polygon fill="black" stroke="black" points="1272.56,-198.77 1278.86,-190.24 1268.62,-192.98 1272.56,-198.77"/> +</g> +<!-- n0000009f->n0000006d --> +<g id="edge28" class="edge"> +<title>n0000009f:port4->n0000006d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1290.47,-200.09C1290.47,-200.09 1203.71,-189.63 1196.18,-196.91 1167.29,-224.84 1172.63,-254.76 1196.18,-287.31 1198.16,-290.05 1207.33,-293.96 1219.38,-298.13"/> +<polygon fill="black" stroke="black" points="1218.49,-301.53 1229.09,-301.36 1220.7,-294.88 1218.49,-301.53"/> +</g> +<!-- n0000009f->n00000071 --> +<g id="edge29" class="edge"> +<title>n0000009f:port5->n00000071</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1316.28,-200.09C1316.28,-200.09 1402.13,-190.7 1410.58,-196.91 1486.41,-252.66 1489.83,-378.79 1487.09,-436.2"/> +<polygon fill="black" stroke="black" points="1483.59,-436.23 1486.51,-446.41 1490.58,-436.63 1483.59,-436.23"/> +</g> +<!-- n0000009f->n00000075 --> +<g id="edge30" class="edge"> +<title>n0000009f:port6->n00000075</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1342.28,-200.11C1342.28,-200.11 1202.85,-192.29 1196.18,-196.91 1121.26,-248.86 1108.3,-368.49 1106.45,-424.12"/> +<polygon fill="black" stroke="black" points="1102.94,-424.28 1106.21,-434.36 1109.94,-424.44 1102.94,-424.28"/> +</g> +<!-- n0000009f->n00000079 --> +<g id="edge31" class="edge"> +<title>n0000009f:port7->n00000079</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1368.38,-200.11C1368.38,-200.11 1201.84,-195.85 1197.94,-193.16 1149.87,-159.96 1132.85,-88.83 1126.94,-48.6"/> +<polygon fill="black" stroke="black" points="1130.39,-48.06 1125.6,-38.61 1123.46,-48.98 1130.39,-48.06"/> +</g> +<!-- n0000009f->n0000007d --> +<g id="edge32" class="edge"> +<title>n0000009f:port8->n0000007d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1394.38,-200.11C1394.38,-200.11 1412.45,-196.49 1416.34,-193.16 1457.25,-158.2 1479.47,-97.38 1489.62,-61.41"/> +<polygon fill="black" stroke="black" points="1493.06,-62.09 1492.28,-51.53 1486.3,-60.27 1493.06,-62.09"/> +</g> +<!-- n000000e9 --> +<g id="node37" class="node"> +<title>n000000e9</title> +<path fill="green" stroke="black" d="M398.65,-431.45C398.65,-431.45 511.65,-431.45 511.65,-431.45 517.65,-431.45 523.65,-437.45 523.65,-443.45 523.65,-443.45 523.65,-503.45 523.65,-503.45 523.65,-509.45 517.65,-515.45 511.65,-515.45 511.65,-515.45 398.65,-515.45 398.65,-515.45 392.65,-515.45 386.65,-509.45 386.65,-503.45 386.65,-503.45 386.65,-443.45 386.65,-443.45 386.65,-437.45 392.65,-431.45 398.65,-431.45"/> +<text text-anchor="middle" x="420.65" y="-500.25" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="454.65,-492.45 454.65,-515.45 "/> +<text text-anchor="middle" x="489.15" y="-500.25" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="386.65,-492.45 523.65,-492.45 "/> +<text text-anchor="middle" x="455.15" y="-477.25" font-family="Times,serif" font-size="14.00">ov2740 4-0036</text> +<text text-anchor="middle" x="455.15" y="-462.25" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev4</text> +<polyline fill="none" stroke="black" points="386.65,-454.45 523.65,-454.45 "/> +<text text-anchor="middle" x="455.15" y="-439.25" font-family="Times,serif" font-size="14.00">0</text> +</g> +<!-- n000000e9->n0000008b --> +<g id="edge33" class="edge"> +<title>n000000e9:port0->n0000008b:port0</title> +<path fill="none" stroke="black" stroke-width="2" d="M386.14,-442.55C386.14,-442.55 361.11,-493.23 383.45,-518.65 391.47,-527.78 484.31,-519.72 492.3,-528.88 508.64,-547.6 499.26,-579.87 493.12,-595.68"/> +<polygon fill="black" stroke="black" stroke-width="2" points="489.86,-594.41 489.11,-604.98 496.29,-597.19 489.86,-594.41"/> +</g> +</g> +</svg> diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index 2977f74d7e26..5ac69b833a7a 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -1,8 +1,19 @@ .. SPDX-License-Identifier: GPL-2.0 -==================== -mgb4 sysfs interface -==================== +.. include:: <isonum.txt> + +The mgb4 driver +=============== + +Copyright |copy| 2023 - 2025 Digiteq Automotive + author: Martin Tůma <martin.tuma@digiteqautomotive.com> + +This is a v4l2 device driver for the Digiteq Automotive FrameGrabber 4, a PCIe +card capable of capturing and generating FPD-Link III and GMSL2/3 video streams +as used in the automotive industry. + +sysfs interface +--------------- The mgb4 driver provides a sysfs interface, that is used to configure video stream related parameters (some of them must be set properly before the v4l2 @@ -12,16 +23,17 @@ There are two types of parameters - global / PCI card related, found under ``/sys/class/video4linux/videoX/device`` and module specific found under ``/sys/class/video4linux/videoX``. - Global (PCI card) parameters -============================ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **module_type** (R): Module type. | 0 - No module present | 1 - FPDL3 - | 2 - GMSL + | 2 - GMSL (one serializer, two daisy chained deserializers) + | 3 - GMSL (one serializer, two deserializers) + | 4 - GMSL (two deserializers with two daisy chain outputs) **module_version** (R): Module version number. Zero in case of a missing module. @@ -42,9 +54,8 @@ Global (PCI card) parameters where each component is a 8b number. - Common FPDL3/GMSL input parameters -================================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **input_id** (R): Input number ID, zero based. @@ -190,9 +201,8 @@ Common FPDL3/GMSL input parameters *Note: This parameter can not be changed while the input v4l2 device is open.* - Common FPDL3/GMSL output parameters -=================================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **output_id** (R): Output number ID, zero based. @@ -228,8 +238,13 @@ Common FPDL3/GMSL output parameters open.* **frame_rate** (RW): - Output video frame rate in frames per second. The default frame rate is - 60Hz. + Output video signal frame rate limit in frames per second. Due to + the limited output pixel clock steps, the card can not always generate + a frame rate perfectly matching the value required by the connected display. + Using this parameter one can limit the frame rate by "crippling" the signal + so that the lines are not equal (the porches of the last line differ) but + the signal appears like having the exact frame rate to the connected display. + The default frame rate limit is 60Hz. **hsync_polarity** (RW): HSYNC signal polarity. @@ -254,37 +269,36 @@ Common FPDL3/GMSL output parameters and there is a non-linear stepping between two consecutive allowed frequencies. The driver finds the nearest allowed frequency to the given value and sets it. When reading this property, you get the exact - frequency set by the driver. The default frequency is 70000kHz. + frequency set by the driver. The default frequency is 61150kHz. *Note: This parameter can not be changed while the output v4l2 device is open.* **hsync_width** (RW): - Width of the HSYNC signal in pixels. The default value is 16. + Width of the HSYNC signal in pixels. The default value is 40. **vsync_width** (RW): - Width of the VSYNC signal in video lines. The default value is 2. + Width of the VSYNC signal in video lines. The default value is 20. **hback_porch** (RW): Number of PCLK pulses between deassertion of the HSYNC signal and the first - valid pixel in the video line (marked by DE=1). The default value is 32. + valid pixel in the video line (marked by DE=1). The default value is 50. **hfront_porch** (RW): Number of PCLK pulses between the end of the last valid pixel in the video line (marked by DE=1) and assertion of the HSYNC signal. The default value - is 32. + is 50. **vback_porch** (RW): Number of video lines between deassertion of the VSYNC signal and the video - line with the first valid pixel (marked by DE=1). The default value is 2. + line with the first valid pixel (marked by DE=1). The default value is 31. **vfront_porch** (RW): Number of video lines between the end of the last valid pixel line (marked - by DE=1) and assertion of the VSYNC signal. The default value is 2. - + by DE=1) and assertion of the VSYNC signal. The default value is 30. FPDL3 specific input parameters -=============================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **fpdl3_input_width** (RW): Number of deserializer input lines. @@ -294,7 +308,7 @@ FPDL3 specific input parameters | 2 - dual FPDL3 specific output parameters -================================ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **fpdl3_output_width** (RW): Number of serializer output lines. @@ -304,7 +318,7 @@ FPDL3 specific output parameters | 2 - dual GMSL specific input parameters -============================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **gmsl_mode** (RW): GMSL speed mode. @@ -328,10 +342,8 @@ GMSL specific input parameters | 0 - disabled | 1 - enabled (default) - -==================== -mgb4 mtd partitions -==================== +MTD partitions +-------------- The mgb4 driver creates a MTD device with two partitions: - mgb4-fw.X - FPGA firmware. @@ -344,9 +356,8 @@ also have a third partition named *mgb4-flash* available in the system. This partition represents the whole, unpartitioned, card's FLASH memory and one should not fiddle with it... -==================== -mgb4 iio (triggers) -==================== +IIO (triggers) +-------------- The mgb4 driver creates an Industrial I/O (IIO) device that provides trigger and signal level status capability. The following scan elements are available: diff --git a/Documentation/admin-guide/media/omap4_camera.rst b/Documentation/admin-guide/media/omap4_camera.rst deleted file mode 100644 index 2ada9b1e6897..000000000000 --- a/Documentation/admin-guide/media/omap4_camera.rst +++ /dev/null @@ -1,62 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -OMAP4 ISS Driver -================ - -Author: Sergio Aguirre <sergio.a.aguirre@gmail.com> - -Copyright (C) 2012, Texas Instruments - -Introduction ------------- - -The OMAP44XX family of chips contains the Imaging SubSystem (a.k.a. ISS), -Which contains several components that can be categorized in 3 big groups: - -- Interfaces (2 Interfaces: CSI2-A & CSI2-B/CCP2) -- ISP (Image Signal Processor) -- SIMCOP (Still Image Coprocessor) - -For more information, please look in [#f1]_ for latest version of: -"OMAP4430 Multimedia Device Silicon Revision 2.x" - -As of Revision AB, the ISS is described in detail in section 8. - -This driver is supporting **only** the CSI2-A/B interfaces for now. - -It makes use of the Media Controller framework [#f2]_, and inherited most of the -code from OMAP3 ISP driver (found under drivers/media/platform/ti/omap3isp/\*), -except that it doesn't need an IOMMU now for ISS buffers memory mapping. - -Supports usage of MMAP buffers only (for now). - -Tested platforms ----------------- - -- OMAP4430SDP, w/ ES2.1 GP & SEVM4430-CAM-V1-0 (Contains IMX060 & OV5640, in - which only the last one is supported, outputting YUV422 frames). - -- TI Blaze MDP, w/ OMAP4430 ES2.2 EMU (Contains 1 IMX060 & 2 OV5650 sensors, in - which only the OV5650 are supported, outputting RAW10 frames). - -- PandaBoard, Rev. A2, w/ OMAP4430 ES2.1 GP & OV adapter board, tested with - following sensors: - * OV5640 - * OV5650 - -- Tested on mainline kernel: - - http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=summary - - Tag: v3.3 (commit c16fa4f2ad19908a47c63d8fa436a1178438c7e7) - -File list ---------- -drivers/staging/media/omap4iss/ -include/linux/platform_data/media/omap4iss.h - -References ----------- - -.. [#f1] http://focus.ti.com/general/docs/wtbu/wtbudocumentcenter.tsp?navigationId=12037&templateId=6123#62 -.. [#f2] http://lwn.net/Articles/420485/ diff --git a/Documentation/admin-guide/media/pci-cardlist.rst b/Documentation/admin-guide/media/pci-cardlist.rst index 7d8e3c8987db..239879634ea5 100644 --- a/Documentation/admin-guide/media/pci-cardlist.rst +++ b/Documentation/admin-guide/media/pci-cardlist.rst @@ -86,7 +86,6 @@ saa7134 Philips SAA7134 saa7164 NXP SAA7164 smipcie SMI PCIe DVBSky cards solo6x10 Bluecherry / Softlogic 6x10 capture cards (MPEG-4/H.264) -sta2x11_vip STA2X11 VIP Video For Linux tw5864 Techwell TW5864 video/audio grabber and encoder tw686x Intersil/Techwell TW686x tw68 Techwell tw68x Video For Linux diff --git a/Documentation/admin-guide/media/raspberrypi-pisp-be.dot b/Documentation/admin-guide/media/raspberrypi-pisp-be.dot new file mode 100644 index 000000000000..55671dc1d443 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-pisp-be.dot @@ -0,0 +1,20 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0 | <port1> 1 | <port2> 2 | <port7> 7} | pispbe\n | {<port3> 3 | <port4> 4 | <port5> 5 | <port6> 6}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port3 -> n0000001c [style=bold] + n00000001:port4 -> n00000022 [style=bold] + n00000001:port5 -> n00000028 [style=bold] + n00000001:port6 -> n0000002e [style=bold] + n0000000a [label="pispbe-input\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n0000000a -> n00000001:port0 [style=bold] + n00000010 [label="pispbe-tdn_input\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n00000010 -> n00000001:port1 [style=bold] + n00000016 [label="pispbe-stitch_input\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n00000016 -> n00000001:port2 [style=bold] + n0000001c [label="pispbe-output0\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n00000022 [label="pispbe-output1\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n00000028 [label="pispbe-tdn_output\n/dev/video5", shape=box, style=filled, fillcolor=yellow] + n0000002e [label="pispbe-stitch_output\n/dev/video6", shape=box, style=filled, fillcolor=yellow] + n00000034 [label="pispbe-config\n/dev/video7", shape=box, style=filled, fillcolor=yellow] + n00000034 -> n00000001:port7 [style=bold] +} diff --git a/Documentation/admin-guide/media/raspberrypi-pisp-be.rst b/Documentation/admin-guide/media/raspberrypi-pisp-be.rst new file mode 100644 index 000000000000..0fcf46f26276 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-pisp-be.rst @@ -0,0 +1,109 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================================= +Raspberry Pi PiSP Back End Memory-to-Memory ISP (pisp-be) +========================================================= + +The PiSP Back End +================= + +The PiSP Back End is a memory-to-memory Image Signal Processor (ISP) which reads +image data from DRAM memory and performs image processing as specified by the +application through the parameters in a configuration buffer, before writing +pixel data back to memory through two distinct output channels. + +The ISP registers and programming model are documented in the `Raspberry Pi +Image Signal Processor (PiSP) Specification document`_ + +The PiSP Back End ISP processes images in tiles. The handling of image +tessellation and the computation of low-level configuration parameters is +realized by a free software library called `libpisp +<https://github.com/raspberrypi/libpisp>`_. + +The full image processing pipeline, which involves capturing RAW Bayer data from +an image sensor through a MIPI CSI-2 compatible capture interface, storing them +in DRAM memory and processing them in the PiSP Back End to obtain images usable +by an application is implemented in `libcamera <https://libcamera.org>`_ as +part of the Raspberry Pi platform support. + +The pisp-be driver +================== + +The Raspberry Pi PiSP Back End (pisp-be) driver is located under +drivers/media/platform/raspberrypi/pisp-be. It uses the `V4L2 API` to register +a number of video capture and output devices, the `V4L2 subdev API` to register +a subdevice for the ISP that connects the video devices in a single media graph +realized using the `Media Controller (MC) API`. + +The media topology registered by the `pisp-be` driver is represented below: + +.. _pips-be-topology: + +.. kernel-figure:: raspberrypi-pisp-be.dot + :alt: Diagram of the default media pipeline topology + :align: center + + +The media graph registers the following video device nodes: + +- pispbe-input: output device for images to be submitted to the ISP for + processing. +- pispbe-tdn_input: output device for temporal denoise. +- pispbe-stitch_input: output device for image stitching (HDR). +- pispbe-output0: first capture device for processed images. +- pispbe-output1: second capture device for processed images. +- pispbe-tdn_output: capture device for temporal denoise. +- pispbe-stitch_output: capture device for image stitching (HDR). +- pispbe-config: output device for ISP configuration parameters. + +pispbe-input +------------ + +Images to be processed by the ISP are queued to the `pispbe-input` output device +node. For a list of image formats supported as input to the ISP refer to the +`Raspberry Pi Image Signal Processor (PiSP) Specification document`_. + +pispbe-tdn_input, pispbe-tdn_output +----------------------------------- + +The `pispbe-tdn_input` output video device receives images to be processed by +the temporal denoise block which are captured from the `pispbe-tdn_output` +capture video device. Userspace is responsible for maintaining queues on both +devices, and ensuring that buffers completed on the output are queued to the +input. + +pispbe-stitch_input, pispbe-stitch_output +----------------------------------------- + +To realize HDR (high dynamic range) image processing the image stitching and +tonemapping blocks are used. The `pispbe-stitch_output` writes images to memory +and the `pispbe-stitch_input` receives the previously written frame to process +it along with the current input image. Userspace is responsible for maintaining +queues on both devices, and ensuring that buffers completed on the output are +queued to the input. + +pispbe-output0, pispbe-output1 +------------------------------ + +The two capture devices write to memory the pixel data as processed by the ISP. + +pispbe-config +------------- + +The `pispbe-config` output video devices receives a buffer of configuration +parameters that define the desired image processing to be performed by the ISP. + +The format of the ISP configuration parameter is defined by +:c:type:`pisp_be_tiles_config` C structure and the meaning of each parameter is +described in the `Raspberry Pi Image Signal Processor (PiSP) Specification +document`_. + +ISP configuration +================= + +The ISP configuration is described solely by the content of the parameters +buffer. The only parameter that userspace needs to configure using the V4L2 API +is the image format on the output and capture video devices for validation of +the content of the parameters buffer. + +.. _Raspberry Pi Image Signal Processor (PiSP) Specification document: https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf diff --git a/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot new file mode 100644 index 000000000000..7717f2291049 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot @@ -0,0 +1,27 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0} | csi2\n/dev/v4l-subdev0 | {<port1> 1 | <port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port1 -> n00000011 [style=dashed] + n00000001:port1 -> n00000007:port0 + n00000001:port2 -> n00000015 + n00000001:port2 -> n00000007:port0 [style=dashed] + n00000001:port3 -> n00000019 [style=dashed] + n00000001:port3 -> n00000007:port0 [style=dashed] + n00000001:port4 -> n0000001d [style=dashed] + n00000001:port4 -> n00000007:port0 [style=dashed] + n00000007 [label="{{<port0> 0 | <port1> 1} | pisp-fe\n/dev/v4l-subdev1 | {<port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green] + n00000007:port2 -> n00000021 + n00000007:port3 -> n00000025 [style=dashed] + n00000007:port4 -> n00000029 + n0000000d [label="{imx219 6-0010\n/dev/v4l-subdev2 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000d:port0 -> n00000001:port0 [style=bold] + n00000011 [label="rp1-cfe-csi2-ch0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n00000015 [label="rp1-cfe-csi2-ch1\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n00000019 [label="rp1-cfe-csi2-ch2\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n0000001d [label="rp1-cfe-csi2-ch3\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n00000021 [label="rp1-cfe-fe-image0\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n00000025 [label="rp1-cfe-fe-image1\n/dev/video5", shape=box, style=filled, fillcolor=yellow] + n00000029 [label="rp1-cfe-fe-stats\n/dev/video6", shape=box, style=filled, fillcolor=yellow] + n0000002d [label="rp1-cfe-fe-config\n/dev/video7", shape=box, style=filled, fillcolor=yellow] + n0000002d -> n00000007:port1 +} diff --git a/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst new file mode 100644 index 000000000000..668d978a9875 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================ +Raspberry Pi PiSP Camera Front End (rp1-cfe) +============================================ + +The PiSP Camera Front End +========================= + +The PiSP Camera Front End (CFE) is a module which combines a CSI-2 receiver with +a simple ISP, called the Front End (FE). + +The CFE has four DMA engines and can write frames from four separate streams +received from the CSI-2 to the memory. One of those streams can also be routed +directly to the FE, which can do minimal image processing, write two versions +(e.g. non-scaled and downscaled versions) of the received frames to memory and +provide statistics of the received frames. + +The FE registers are documented in the `Raspberry Pi Image Signal Processor +(ISP) Specification document +<https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf>`_, +and example code for FE can be found in `libpisp +<https://github.com/raspberrypi/libpisp>`_. + +The rp1-cfe driver +================== + +The Raspberry Pi PiSP Camera Front End (rp1-cfe) driver is located under +drivers/media/platform/raspberrypi/rp1-cfe. It uses the `V4L2 API` to register +a number of video capture and output devices, the `V4L2 subdev API` to register +subdevices for the CSI-2 received and the FE that connects the video devices in +a single media graph realized using the `Media Controller (MC) API`. + +The media topology registered by the `rp1-cfe` driver, in this particular +example connected to an imx219 sensor, is the following one: + +.. _rp1-cfe-topology: + +.. kernel-figure:: raspberrypi-rp1-cfe.dot + :alt: Diagram of an example media pipeline topology + :align: center + +The media graph contains the following video device nodes: + +- rp1-cfe-csi2-ch0: capture device for the first CSI-2 stream +- rp1-cfe-csi2-ch1: capture device for the second CSI-2 stream +- rp1-cfe-csi2-ch2: capture device for the third CSI-2 stream +- rp1-cfe-csi2-ch3: capture device for the fourth CSI-2 stream +- rp1-cfe-fe-image0: capture device for the first FE output +- rp1-cfe-fe-image1: capture device for the second FE output +- rp1-cfe-fe-stats: capture device for the FE statistics +- rp1-cfe-fe-config: output device for FE configuration + +rp1-cfe-csi2-chX +---------------- + +The rp1-cfe-csi2-chX capture devices are normal V4L2 capture devices which +can be used to capture video frames or metadata received from the CSI-2. + +rp1-cfe-fe-image0, rp1-cfe-fe-image1 +------------------------------------ + +The rp1-cfe-fe-image0 and rp1-cfe-fe-image1 capture devices are used to write +the processed frames to memory. + +rp1-cfe-fe-stats +---------------- + +The format of the FE statistics buffer is defined by +:c:type:`pisp_statistics` C structure and the meaning of each parameter is +described in the `PiSP specification` document. + +rp1-cfe-fe-config +----------------- + +The format of the FE configuration buffer is defined by +:c:type:`pisp_fe_config` C structure and the meaning of each parameter is +described in the `PiSP specification` document. diff --git a/Documentation/admin-guide/media/rkisp1.rst b/Documentation/admin-guide/media/rkisp1.rst index 6f14d9561fa5..6c878c71442f 100644 --- a/Documentation/admin-guide/media/rkisp1.rst +++ b/Documentation/admin-guide/media/rkisp1.rst @@ -114,11 +114,18 @@ to be applied to the hardware during a video stream, allowing userspace to dynamically modify values such as black level, cross talk corrections and others. -The buffer format is defined by struct :c:type:`rkisp1_params_cfg`, and -userspace should set +The ISP driver supports two different parameters configuration methods, the +`fixed parameters format` or the `extensible parameters format`. + +When using the `fixed parameters` method the buffer format is defined by struct +:c:type:`rkisp1_params_cfg`, and userspace should set :ref:`V4L2_META_FMT_RK_ISP1_PARAMS <v4l2-meta-fmt-rk-isp1-params>` as the dataformat. +When using the `extensible parameters` method the buffer format is defined by +struct :c:type:`rkisp1_ext_params_cfg`, and userspace should set +:ref:`V4L2_META_FMT_RK_ISP1_EXT_PARAMS <v4l2-meta-fmt-rk-isp1-ext-params>` as +the dataformat. Capturing Video Frames Example ============================== diff --git a/Documentation/admin-guide/media/saa7134.rst b/Documentation/admin-guide/media/saa7134.rst index 51eae7eb5ab7..18d7cbc897db 100644 --- a/Documentation/admin-guide/media/saa7134.rst +++ b/Documentation/admin-guide/media/saa7134.rst @@ -67,7 +67,7 @@ Changes / Fixes Please mail to linux-media AT vger.kernel.org unified diffs against the linux media git tree: - https://git.linuxtv.org/media_tree.git/ + https://git.linuxtv.org/media.git/ This is done by committing a patch at a clone of the git tree and submitting the patch using ``git send-email``. Don't forget to diff --git a/Documentation/admin-guide/media/tuner-cardlist.rst b/Documentation/admin-guide/media/tuner-cardlist.rst index 362617c59c5d..65ecf48ddf24 100644 --- a/Documentation/admin-guide/media/tuner-cardlist.rst +++ b/Documentation/admin-guide/media/tuner-cardlist.rst @@ -97,4 +97,6 @@ Tuner number Card name 89 Sony BTF-PG472Z PAL/SECAM 90 Sony BTF-PK467Z NTSC-M-JP 91 Sony BTF-PB463Z NTSC-M +92 Silicon Labs Si2157 tuner +93 Tena TNF931D-DFDR1 ============ ===================================================== diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index f4bb2605f07e..3bac5165b134 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -10,20 +10,23 @@ Video4Linux (V4L) driver-specific documentation :maxdepth: 2 bttv + c3-isp cafe_ccic cx88 fimc imx imx7 ipu3 + ipu6-isys ivtv mgb4 omap3isp - omap4_camera philips qcom_camss + raspberrypi-pisp-be rcar-fdp1 rkisp1 + raspberrypi-rp1-cfe saa7134 si470x si4713 diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst index b6f658c0997e..034ca7c77fb9 100644 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@ -302,6 +302,15 @@ all configurable using the following module options: - 0: forbid hints - 1: allow hints +- supports_requests: + + specifies if the device should support the Request API. There are + three possible values, default is 1: + + - 0: no request + - 1: supports requests + - 2: requires requests + Taken together, all these module options allow you to precisely customize the driver behavior and test your application with all sorts of permutations. It is also very suitable to emulate hardware that is not yet available, e.g. @@ -313,13 +322,13 @@ Video Capture This is probably the most frequently used feature. The video capture device can be configured by using the module options num_inputs, input_types and -ccs_cap_mode (see section 1 for more detailed information), but by default -four inputs are configured: a webcam, a TV tuner, an S-Video and an HDMI -input, one input for each input type. Those are described in more detail -below. +ccs_cap_mode (see "Configuring the driver" for more detailed information), +but by default four inputs are configured: a webcam, a TV tuner, an S-Video +and an HDMI input, one input for each input type. Those are described in more +detail below. Special attention has been given to the rate at which new frames become -available. The jitter will be around 1 jiffie (that depends on the HZ +available. The jitter will be around 1 jiffy (that depends on the HZ configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second), but the long-term behavior is exactly following the framerate. So a framerate of 59.94 Hz is really different from 60 Hz. If the framerate @@ -434,10 +443,10 @@ Video Output ------------ The video output device can be configured by using the module options -num_outputs, output_types and ccs_out_mode (see section 1 for more detailed -information), but by default two outputs are configured: an S-Video and an -HDMI input, one output for each output type. Those are described in more detail -below. +num_outputs, output_types and ccs_out_mode (see "Configuring the driver" +for more detailed information), but by default two outputs are configured: +an S-Video and an HDMI input, one output for each output type. Those are +described in more detail below. Like with video capture the framerate is also exact in the long term. @@ -1011,11 +1020,6 @@ Digital Video Controls affects the reported colorspace since DVI_D outputs will always use sRGB. -- Display Present: - - sets the presence of a "display" on the HDMI output. This affects - the tx_edid_present, tx_hotplug and tx_rxsense controls. - FM Radio Receiver Controls ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -1130,35 +1134,34 @@ Metadata Capture Controls if set, then the generated metadata stream contains Source Clock information. -Video, VBI and RDS Looping --------------------------- -The vivid driver supports looping of video output to video input, VBI output -to VBI input and RDS output to RDS input. For video/VBI looping this emulates -as if a cable was hooked up between the output and input connector. So video -and VBI looping is only supported between S-Video and HDMI inputs and outputs. -VBI is only valid for S-Video as it makes no sense for HDMI. +Video, Sliced VBI and HDMI CEC Looping +-------------------------------------- -Since radio is wireless this looping always happens if the radio receiver -frequency is close to the radio transmitter frequency. In that case the radio -transmitter will 'override' the emulated radio stations. - -Looping is currently supported only between devices created by the same -vivid driver instance. +Video Looping functionality is supported for devices created by the same +vivid driver instance, as well as across multiple instances of the vivid driver. +The vivid driver supports looping of video and Sliced VBI data between an S-Video output +and an S-Video input. It also supports looping of video and HDMI CEC data between an +HDMI output and an HDMI input. +To enable looping, set the 'HDMI/S-Video XXX-N Is Connected To' control(s) to select +whether an input uses the Test Pattern Generator, or is disconnected, or is connected +to an output. An input can be connected to an output from any vivid instance. +The inputs and outputs are numbered XXX-N where XXX is the vivid instance number +(see module option n_devs). If there is only one vivid instance (the default), then +XXX will be 000. And N is the Nth S-Video/HDMI input or output of that instance. +If vivid is loaded without module options, then you can connect the S-Video 000-0 input +to the S-Video 000-0 output, or the HDMI 000-0 input to the HDMI 000-0 output. +This is the equivalent of connecting or disconnecting a cable between an input and an +output in a physical device. -Video and Sliced VBI looping -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If an 'HDMI/S-Video XXX-N Is Connected To' control selected an output, then the video +output will be looped to the video input provided that: -The way to enable video/VBI looping is currently fairly crude. A 'Loop Video' -control is available in the "Vivid" control class of the video -capture and VBI capture devices. When checked the video looping will be enabled. -Once enabled any video S-Video or HDMI input will show a static test pattern -until the video output has started. At that time the video output will be -looped to the video input provided that: +- the currently selected input matches the input indicated by the control name. -- the input type matches the output type. So the HDMI input cannot receive - video from the S-Video output. +- in the vivid instance of the output connector, the currently selected output matches + the output indicated by the control's value. - the video resolution of the video input must match that of the video output. So it is not possible to loop a 50 Hz (720x576) S-Video output to a 60 Hz @@ -1185,6 +1188,8 @@ looped to the video input provided that: "DV Timings Signal Mode" for the HDMI input should be configured so that a valid signal is passed to the video input. +If any condition is not valid, then the 'Noise' test pattern is shown. + The framerates do not have to match, although this might change in the future. By default you will see the OSD text superimposed on top of the looped video. @@ -1198,17 +1203,26 @@ and WSS (50 Hz formats) VBI data is looped. Teletext VBI data is not looped. Radio & RDS Looping -~~~~~~~~~~~~~~~~~~~ - -As mentioned in section 6 the radio receiver emulates stations are regular -frequency intervals. Depending on the frequency of the radio receiver a -signal strength value is calculated (this is returned by VIDIOC_G_TUNER). -However, it will also look at the frequency set by the radio transmitter and -if that results in a higher signal strength than the settings of the radio -transmitter will be used as if it was a valid station. This also includes -the RDS data (if any) that the transmitter 'transmits'. This is received -faithfully on the receiver side. Note that when the driver is loaded the -frequencies of the radio receiver and transmitter are not identical, so +------------------- + +The vivid driver supports looping of RDS output to RDS input. + +Since radio is wireless this looping always happens if the radio receiver +frequency is close to the radio transmitter frequency. In that case the radio +transmitter will 'override' the emulated radio stations. + +RDS looping is currently supported only between devices created by the same +vivid driver instance. + +As mentioned in the "Radio Receiver" section, the radio receiver emulates +stations at regular frequency intervals. Depending on the frequency of the +radio receiver a signal strength value is calculated (this is returned by +VIDIOC_G_TUNER). However, it will also look at the frequency set by the radio +transmitter and if that results in a higher signal strength than the settings +of the radio transmitter will be used as if it was a valid station. This also +includes the RDS data (if any) that the transmitter 'transmits'. This is +received faithfully on the receiver side. Note that when the driver is loaded +the frequencies of the radio receiver and transmitter are not identical, so initially no looping takes place. @@ -1218,8 +1232,8 @@ Cropping, Composing, Scaling This driver supports cropping, composing and scaling in any combination. Normally which features are supported can be selected through the Vivid controls, but it is also possible to hardcode it when the module is loaded through the -ccs_cap_mode and ccs_out_mode module options. See section 1 on the details of -these module options. +ccs_cap_mode and ccs_out_mode module options. See "Configuring the driver" on +the details of these module options. This allows you to test your application for all these variations. @@ -1260,7 +1274,8 @@ is set, then the alpha component is only used for the color red and set to The driver has to be configured to support the multiplanar formats. By default the driver instances are single-planar. This can be changed by setting the -multiplanar module option, see section 1 for more details on that option. +multiplanar module option, see "Configuring the driver" for more details on that +option. If the driver instance is using the multiplanar formats/API, then the first single planar format (YUYV) and the multiplanar NV16M and NV61M formats the @@ -1270,74 +1285,6 @@ data_offset to be non-zero, so this is a useful feature for testing applications Video output will also honor any data_offset that the application set. -Capture Overlay ---------------- - -Note: capture overlay support is implemented primarily to test the existing -V4L2 capture overlay API. In practice few if any GPUs support such overlays -anymore, and neither are they generally needed anymore since modern hardware -is so much more capable. By setting flag 0x10000 in the node_types module -option the vivid driver will create a simple framebuffer device that can be -used for testing this API. Whether this API should be used for new drivers is -questionable. - -This driver has support for a destructive capture overlay with bitmap clipping -and list clipping (up to 16 rectangles) capabilities. Overlays are not -supported for multiplanar formats. It also honors the struct v4l2_window field -setting: if it is set to FIELD_TOP or FIELD_BOTTOM and the capture setting is -FIELD_ALTERNATE, then only the top or bottom fields will be copied to the overlay. - -The overlay only works if you are also capturing at that same time. This is a -vivid limitation since it copies from a buffer to the overlay instead of -filling the overlay directly. And if you are not capturing, then no buffers -are available to fill. - -In addition, the pixelformat of the capture format and that of the framebuffer -must be the same for the overlay to work. Otherwise VIDIOC_OVERLAY will return -an error. - -In order to really see what it going on you will need to create two vivid -instances: the first with a framebuffer enabled. You configure the capture -overlay of the second instance to use the framebuffer of the first, then -you start capturing in the second instance. For the first instance you setup -the output overlay for the video output, turn on video looping and capture -to see the blended framebuffer overlay that's being written to by the second -instance. This setup would require the following commands: - -.. code-block:: none - - $ sudo modprobe vivid n_devs=2 node_types=0x10101,0x1 - $ v4l2-ctl -d1 --find-fb - /dev/fb1 is the framebuffer associated with base address 0x12800000 - $ sudo v4l2-ctl -d2 --set-fbuf fb=1 - $ v4l2-ctl -d1 --set-fbuf fb=1 - $ v4l2-ctl -d0 --set-fmt-video=pixelformat='AR15' - $ v4l2-ctl -d1 --set-fmt-video-out=pixelformat='AR15' - $ v4l2-ctl -d2 --set-fmt-video=pixelformat='AR15' - $ v4l2-ctl -d0 -i2 - $ v4l2-ctl -d2 -i2 - $ v4l2-ctl -d2 -c horizontal_movement=4 - $ v4l2-ctl -d1 --overlay=1 - $ v4l2-ctl -d0 -c loop_video=1 - $ v4l2-ctl -d2 --stream-mmap --overlay=1 - -And from another console: - -.. code-block:: none - - $ v4l2-ctl -d1 --stream-out-mmap - -And yet another console: - -.. code-block:: none - - $ qv4l2 - -and start streaming. - -As you can see, this is not for the faint of heart... - - Output Overlay -------------- @@ -1396,7 +1343,7 @@ Some Future Improvements Just as a reminder and in no particular order: - Add a virtual alsa driver to test audio -- Add virtual sub-devices and media controller support +- Add virtual sub-devices - Some support for testing compressed video - Add support to loop raw VBI output to raw VBI input - Add support to loop teletext sliced VBI output to VBI input @@ -1405,12 +1352,10 @@ Just as a reminder and in no particular order: - Add ARGB888 overlay support: better testing of the alpha channel - Improve pixel aspect support in the tpg code by passing a real v4l2_fract - Use per-queue locks and/or per-device locks to improve throughput -- Add support to loop from a specific output to a specific input across - vivid instances - The SDR radio should use the same 'frequencies' for stations as the normal radio receiver, and give back noise if the frequency doesn't match up with a station frequency - Make a thread for the RDS generation, that would help in particular for the "Controls" RDS Rx I/O Mode as the read-only RDS controls could be updated in real-time. -- Changing the EDID should cause hotplug detect emulation to happen. +- Changing the EDID doesn't wait 100 ms before setting the HPD signal. diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 7367e6294ef6..4120e9cb0cd5 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -12,10 +12,16 @@ its CMA name like below: The structure of the files created under that directory is as follows: - - [RO] base_pfn: The base PFN (Page Frame Number) of the zone. + - [RO] base_pfn: The base PFN (Page Frame Number) of the CMA area. + This is the same as ranges/0/base_pfn. - [RO] count: Amount of memory in the CMA area. - [RO] order_per_bit: Order of pages represented by one bit. - - [RO] bitmap: The bitmap of page states in the zone. + - [RO] bitmap: The bitmap of allocated pages in the area. + This is the same as ranges/0/base_pfn. + - [RO] ranges/N/base_pfn: The base PFN of contiguous range N + in the CMA area. + - [RO] ranges/N/bitmap: The bit map of allocated pages in + range N in the CMA area. - [WO] alloc: Allocate N pages from that CMA area. For example:: echo 5 > <debugfs>/cma/<cma_name>/alloc diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 33d37bb2fb4e..bc7e976120e0 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,12 +1,11 @@ .. SPDX-License-Identifier: GPL-2.0 -========================== -DAMON: Data Access MONitor -========================== +================================================================ +DAMON: Data Access MONitoring and Access-aware System Operations +================================================================ -:doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. -Using DAMON, users can analyze the memory access patterns of their systems and -optimize those. +:doc:`DAMON </mm/damon/index>` is a Linux kernel subsystem for efficient data +access monitoring and access-aware system operations. .. toctree:: :maxdepth: 2 diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 7aa0071ff1c3..ede14b679d02 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -7,7 +7,7 @@ Getting Started This document briefly describes how you can use DAMON by demonstrating its default user space tool. Please note that this document describes only a part of its features for brevity. Please refer to the usage `doc -<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more +<https://github.com/damonitor/damo/blob/next/USAGE.md>`_ of the tool for more details. @@ -26,7 +26,7 @@ User Space Tool For the demonstration, we will use the default user space tool for DAMON, called DAMON Operator (DAMO). It is available at -https://github.com/awslabs/damo. The examples below assume that ``damo`` is on +https://github.com/damonitor/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. Because DAMO is using the sysfs interface (refer to :doc:`usage` for the @@ -34,18 +34,69 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is mounted. +Snapshot Data Access Patterns +============================= + +The commands below show the memory access pattern of a program at the moment of +the execution. :: + + $ git clone https://github.com/sjp38/masim; cd masim; make + $ sudo damo start "./masim ./configs/stairs.cfg --quiet" + $ sudo damo report access + heatmap: 641111111000000000000000000000000000000000000000000000[...]33333333333333335557984444[...]7 + # min/max temperatures: -1,840,000,000, 370,010,000, column size: 3.925 MiB + 0 addr 86.182 TiB size 8.000 KiB access 0 % age 14.900 s + 1 addr 86.182 TiB size 8.000 KiB access 60 % age 0 ns + 2 addr 86.182 TiB size 3.422 MiB access 0 % age 4.100 s + 3 addr 86.182 TiB size 2.004 MiB access 95 % age 2.200 s + 4 addr 86.182 TiB size 29.688 MiB access 0 % age 14.100 s + 5 addr 86.182 TiB size 29.516 MiB access 0 % age 16.700 s + 6 addr 86.182 TiB size 29.633 MiB access 0 % age 17.900 s + 7 addr 86.182 TiB size 117.652 MiB access 0 % age 18.400 s + 8 addr 126.990 TiB size 62.332 MiB access 0 % age 9.500 s + 9 addr 126.990 TiB size 13.980 MiB access 0 % age 5.200 s + 10 addr 126.990 TiB size 9.539 MiB access 100 % age 3.700 s + 11 addr 126.990 TiB size 16.098 MiB access 0 % age 6.400 s + 12 addr 127.987 TiB size 132.000 KiB access 0 % age 2.900 s + total size: 314.008 MiB + $ sudo damo stop + +The first command of the above example downloads and builds an artificial +memory access generator program called ``masim``. The second command asks DAMO +to start the program via the given command and make DAMON monitors the newly +started process. The third command retrieves the current snapshot of the +monitored access pattern of the process from DAMON and shows the pattern in a +human readable format. + +The first line of the output shows the relative access temperature (hotness) of +the regions in a single row hetmap format. Each column on the heatmap +represents regions of same size on the monitored virtual address space. The +position of the colun on the row and the number on the column represents the +relative location and access temperature of the region. ``[...]`` means +unmapped huge regions on the virtual address spaces. The second line shows +additional information for better understanding the heatmap. + +Each line of the output from the third line shows which virtual address range +(``addr XX size XX``) of the process is how frequently (``access XX %``) +accessed for how long time (``age XX``). For example, the evelenth region of +~9.5 MiB size is being most frequently accessed for last 3.7 seconds. Finally, +the fourth command stops DAMON. + +Note that DAMON can monitor not only virtual address spaces but multiple types +of address spaces including the physical address space. + + Recording Data Access Patterns ============================== The commands below record the memory access patterns of a program and save the monitoring results to a file. :: - $ git clone https://github.com/sjp38/masim - $ cd masim; make; ./masim ./configs/zigzag.cfg & + $ ./masim ./configs/zigzag.cfg & $ sudo damo record -o damon.data $(pidof masim) -The first two lines of the commands download an artificial memory access -generator program and run it in the background. The generator will repeatedly +The line of the commands run the artificial memory access +generator program again. The generator will repeatedly access two 100 MiB sized memory regions one by one. You can substitute this with your real workload. The last line asks ``damo`` to record the access pattern in the ``damon.data`` file. @@ -57,7 +108,7 @@ Visualizing Recorded Patterns You can visualize the pattern in a heatmap, showing which memory region (x-axis) got accessed when (y-axis) and how frequently (number).:: - $ sudo damo report heats --heatmap stdout + $ sudo damo report heatmap 22222222222222222222222222222222222222211111111111111111111111111111111111111100 44444444444444444444444444444444444444434444444444444444444444444444444444443200 44444444444444444444444444444444444444433444444444444444444444444444444444444200 @@ -122,6 +173,6 @@ Data Access Pattern Aware Memory Management Below command makes every memory region of size >=4K that has not accessed for >=60 seconds in your workload to be swapped out. :: - $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \ - --damos_age 60s max --damos_action pageout \ - <pid of your workload> + $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \ + --damos_age 60s max --damos_action pageout \ + <pid of your workload> diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 6fce035fdbf5..d960aba72b82 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -7,31 +7,25 @@ Detailed Usages DAMON provides below interfaces for different users. - *DAMON user space tool.* - `This <https://github.com/awslabs/damo>`_ is for privileged people such as + `This <https://github.com/damonitor/damo>`_ is for privileged people such as system administrators who want a just-working human-friendly interface. Using this, users can use the DAMON’s major features in a human-friendly way. It may not be highly tuned for special cases, though. For more detail, please refer to its `usage document - <https://github.com/awslabs/damo/blob/next/USAGE.md>`_. + <https://github.com/damonitor/damo/blob/next/USAGE.md>`_. - *sysfs interface.* :ref:`This <sysfs_interface>` is for privileged user space programmers who want more optimized use of DAMON. Using this, users can use DAMON’s major features by reading from and writing to special sysfs files. Therefore, you can write and use your personalized DAMON sysfs wrapper programs that reads/writes the sysfs files instead of you. The `DAMON user space tool - <https://github.com/awslabs/damo>`_ is one example of such programs. + <https://github.com/damonitor/damo>`_ is one example of such programs. - *Kernel Space Programming Interface.* :doc:`This </mm/damon/api>` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by writing kernel space DAMON application programs for you. You can even extend DAMON for various address spaces. For detail, please refer to the interface :doc:`document </mm/damon/api>`. -- *debugfs interface. (DEPRECATED!)* - :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface - <sysfs_interface>`. This is deprecated, so users should move to the - :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. .. _sysfs_interface: @@ -70,6 +64,7 @@ comma (","). │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ │ │ │ │ │ │ intervals/sample_us,aggr_us,update_us + │ │ │ │ │ │ │ intervals_goal/access_bp,aggrs,min_sample_us,max_sample_us │ │ │ │ │ │ nr_regions/min,max │ │ │ │ │ :ref:`targets <sysfs_targets>`/nr_targets │ │ │ │ │ │ :ref:`0 <sysfs_target>`/pid_target @@ -78,7 +73,7 @@ comma (","). │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes - │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us + │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/ │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max @@ -86,13 +81,13 @@ comma (","). │ │ │ │ │ │ │ :ref:`quotas <sysfs_quotas>`/ms,bytes,reset_interval_ms,effective_bytes │ │ │ │ │ │ │ │ weights/sz_permil,nr_accesses_permil,age_permil │ │ │ │ │ │ │ │ :ref:`goals <sysfs_schemes_quota_goals>`/nr_goals - │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value + │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value,nid │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low - │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,memcg_id - │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes - │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age + │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... @@ -138,6 +133,11 @@ Users can write below commands for the kdamond to the ``state`` file. - ``off``: Stop running. - ``commit``: Read the user inputs in the sysfs files except ``state`` file again. +- ``update_tuned_intervals``: Update the contents of ``sample_us`` and + ``aggr_us`` files of the kdamond with the auto-tuning applied ``sampling + interval`` and ``aggregation interval`` for the files. Please refer to + :ref:`intervals_goal section <damon_usage_sysfs_monitoring_intervals_goal>` + for more details. - ``commit_schemes_quota_goals``: Read the DAMON-based operation schemes' :ref:`quota goals <sysfs_schemes_quota_goals>`. - ``update_schemes_stats``: Update the contents of stats files for each @@ -153,7 +153,7 @@ Users can write below commands for the kdamond to the ``state`` file. - ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme action tried regions directory for each DAMON-based operation scheme of the kdamond. -- ``update_schemes_effective_bytes``: Update the contents of +- ``update_schemes_effective_quotas``: Update the contents of ``effective_bytes`` files for each DAMON-based operation scheme of the kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`. @@ -219,6 +219,25 @@ writing to and rading from the files. For more details about the intervals and monitoring regions range, please refer to the Design document (:doc:`/mm/damon/design`). +.. _damon_usage_sysfs_monitoring_intervals_goal: + +contexts/<N>/monitoring_attrs/intervals/intervals_goal/ +------------------------------------------------------- + +Under the ``intervals`` directory, one directory for automated tuning of +``sample_us`` and ``aggr_us``, namely ``intervals_goal`` directory also exists. +Under the directory, four files for the auto-tuning control, namely +``access_bp``, ``aggrs``, ``min_sample_us`` and ``max_sample_us`` exist. +Please refer to the :ref:`design document of the feature +<damon_design_monitoring_intervals_autotuning>` for the internal of the tuning +mechanism. Reading and writing the four files under ``intervals_goal`` +directory shows and updates the tuning parameters that described in the +:ref:design doc <damon_design_monitoring_intervals_autotuning>` with the same +names. The tuning starts with the user-set ``sample_us`` and ``aggr_us``. The +tuning-applied current values of the two intervals can be read from the +``sample_us`` and ``aggr_us`` files after writing ``update_tuned_intervals`` to +the ``state`` file. + .. _sysfs_targets: contexts/<N>/targets/ @@ -288,15 +307,20 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes/<N>/ ------------ -In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files -(``action`` and ``apply_interval``) exist. +In each scheme directory, seven directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and +``tried_regions``) and three files (``action``, ``target_nid`` and +``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read from the file and their meaning are same to those of the list on :ref:`design doc <damon_design_damos_action>`. +The ``target_nid`` file is for setting the migration target node, which is +only meaningful when the ``action`` is either ``migrate_hot`` or +``migrate_cold``. + The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval <damon_design_damos>` in microseconds. @@ -342,7 +366,7 @@ Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the effective size quota is further adjusted. Reading ``effective_bytes`` returns the current effective size quota. The file is not updated in real time, so users should ask DAMON sysfs interface to update the content of the file for -the stats by writing a special keyword, ``update_schemes_effective_bytes`` to +the stats by writing a special keyword, ``update_schemes_effective_quotas`` to the relevant ``kdamonds/<N>/state`` file. Under ``weights`` directory, three files (``sz_permil``, @@ -366,11 +390,11 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each goal and current achievement. Among the multiple feedback, the best one is used. -Each goal directory contains three files, namely ``target_metric``, -``target_value`` and ``current_value``. Users can set and get the three -parameters for the quota auto-tuning goals that specified on the :ref:`design -doc <damon_design_damos_quotas_auto_tuning>` by writing to and reading from each -of the files. Note that users should further write +Each goal directory contains four files, namely ``target_metric``, +``target_value``, ``current_value`` and ``nid``. Users can set and get the +four parameters for the quota auto-tuning goals that specified on the +:ref:`design doc <damon_design_damos_quotas_auto_tuning>` by writing to and +reading from each of the files. Note that users should further write ``commit_schemes_quota_goals`` to the ``state`` file of the :ref:`kdamond directory <sysfs_kdamond>` to pass the feedback to DAMON. @@ -397,70 +421,84 @@ The ``interval`` should written in microseconds unit. .. _sysfs_filters: -schemes/<N>/filters/ --------------------- +schemes/<N>/{core\_,ops\_,}filters/ +----------------------------------- -The directory for the :ref:`filters <damon_design_damos_filters>` of the given +Directories for :ref:`filters <damon_design_damos_filters>` of the given DAMON-based operation scheme. -In the beginning, this directory has only one file, ``nr_filters``. Writing a +``core_filters`` and ``ops_filters`` directories are for the filters handled by +the DAMON core layer and operations set layer, respectively. ``filters`` +directory can be used for installing filters regardless of their handled +layers. Filters that requested by ``core_filters`` and ``ops_filters`` will be +installed before those of ``filters``. All three directories have same files. + +Use of ``filters`` directory can make expecting evaluation orders of given +filters with the files under directory bit confusing. Users are hence +recommended to use ``core_filters`` and ``ops_filters`` directories. The +``filters`` directory could be deprecated in future. + +In the beginning, the directory has only one file, ``nr_filters``. Writing a number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains six files, namely ``type``, ``matcing``, -``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type`` -file, you can write one of four special keywords: ``anon`` for anonymous pages, -``memcg`` for specific memory cgroup, ``addr`` for specific address range (an -open-ended interval), or ``target`` for specific DAMON monitoring target -filtering. In case of the memory cgroup filtering, you can specify the memory -cgroup of the interest by writing the path of the memory cgroup from the -cgroups mount point to ``memcg_path`` file. In case of the address range -filtering, you can specify the start and end address of the range to -``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring -target filtering, you can specify the index of the target between the list of -the DAMON context's monitoring targets list to ``target_idx`` file. You can -write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does -not match to the type, respectively. Then, the scheme's action will not be -applied to the pages that specified to be filtered out. +Each filter directory contains nine files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, ``min``, ``max`` +and ``target_idx``. To ``type`` file, you can write the type of the filter. +Refer to :ref:`the design doc <damon_design_damos_filters>` for available type +names, their meaning and on what layer those are handled. + +For ``memcg`` type, you can specify the memory cgroup of the interest by +writing the path of the memory cgroup from the cgroups mount point to +``memcg_path`` file. For ``addr`` type, you can specify the start and end +address of the range (open-ended interval) to ``addr_start`` and ``addr_end`` +files, respectively. For ``hugepage_size`` type, you can specify the minimum +and maximum size of the range (closed interval) to ``min`` and ``max`` files, +respectively. For ``target`` type, you can specify the index of the target +between the list of the DAMON context's monitoring targets list to +``target_idx`` file. + +You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter +is for memory that matches the ``type``. You can write ``Y`` or ``N`` to +``allow`` file to specify if applying the action to the memory that satisfies +the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: + # cd ops_filters/0/ # echo 2 > nr_filters - # # filter out anonymous pages + # # disallow anonymous pages echo anon > 0/type echo Y > 0/matching + echo N > 0/allow # # further filter out all cgroups except one at '/having_care_already' echo memcg > 1/type echo /having_care_already > 1/memcg_path - echo N > 1/matching - -Note that ``anon`` and ``memcg`` filters are currently supported only when -``paddr`` :ref:`implementation <sysfs_context>` is being used. + echo Y > 1/matching + echo N > 1/allow -Also, memory regions that are filtered out by ``addr`` or ``target`` filters -are not counted as the scheme has tried to those, while regions that filtered -out by other type filters are counted as the scheme has tried to. The -difference is applied to :ref:`stats <damos_stats>` and -:ref:`tried regions <sysfs_schemes_tried_regions>`. +Refer to the :ref:`DAMOS filters design documentation +<damon_design_damos_filters>` for more details including how multiple filters +of different ``allow`` works, when each of the filters are supported, and +differences on stats. .. _sysfs_schemes_stats: schemes/<N>/stats/ ------------------ -DAMON counts the total number and bytes of regions that each scheme is tried to -be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. +DAMON counts statistics for each scheme. This statistics can be used for +online analysis or tuning of the schemes. Refer to :ref:`design doc +<damon_design_damos_stat>` for more details about the stats. The statistics can be retrieved by reading the files under ``stats`` directory -(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and -``qt_exceeds``), respectively. The files are not updated in real time, so you -should ask DAMON sysfs interface to update the content of the files for the -stats by writing a special keyword, ``update_schemes_stats`` to the relevant -``kdamonds/<N>/state`` file. +(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, +``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not +updated in real time, so you should ask DAMON sysfs interface to update the +content of the files for the stats by writing a special keyword, +``update_schemes_stats`` to the relevant ``kdamonds/<N>/state`` file. .. _sysfs_schemes_tried_regions: @@ -497,10 +535,10 @@ set the ``access pattern`` as their interested pattern that they want to query. tried_regions/<N>/ ------------------ -In each region directory, you will find four files (``start``, ``end``, -``nr_accesses``, and ``age``). Reading the files will show the start and end -addresses, ``nr_accesses``, and ``age`` of the region that corresponding -DAMON-based operation scheme ``action`` has tried to be applied. +In each region directory, you will find five files (``start``, ``end``, +``nr_accesses``, ``age``, and ``sz_filter_passed``). Reading the files will +show the properties of the region that corresponding DAMON-based operation +scheme ``action`` has tried to be applied. Example ~~~~~~~ @@ -539,7 +577,7 @@ memory rate becomes larger than 60%, or lower than 30%". :: # echo 300 > watermarks/low Please note that it's highly recommended to use user space tools like `damo -<https://github.com/awslabs/damo>`_ rather than manually reading and writing +<https://github.com/damonitor/damo>`_ rather than manually reading and writing the files as above. Above is only for an example. .. _tracepoint: @@ -596,306 +634,3 @@ fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``) of the scheme in the list of the contexts of the context's kdamond, the index of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in addition to the output of ``damon_aggregated`` tracepoint. - - -.. _debugfs_interface: - -debugfs Interface (DEPRECATED!) -=============================== - -.. note:: - - THIS IS DEPRECATED! - - DAMON debugfs interface is deprecated, so users should move to the - :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. - -DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``, -``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, -``mk_contexts`` and ``rm_contexts`` under its debugfs directory, -``<debugfs>/damon/``. - - -``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation -notice. Reading it returns the deprecation notice, as below:: - - # cat DEPRECATED - DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org. - - -Attributes ----------- - -Users can get and set the ``sampling interval``, ``aggregation interval``, -``update interval``, and min/max number of monitoring target regions by -reading from and writing to the ``attrs`` file. To know about the monitoring -attributes in detail, please refer to the :doc:`/mm/damon/design`. For -example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and -1000, and then check it again:: - - # cd <debugfs>/damon - # echo 5000 100000 1000000 10 1000 > attrs - # cat attrs - 5000 100000 1000000 10 1000 - - -Target IDs ----------- - -Some types of address spaces supports multiple monitoring target. For example, -the virtual memory address spaces monitoring can have multiple processes as the -monitoring targets. Users can set the targets by writing relevant id values of -the targets to, and get the ids of the current targets by reading from the -``target_ids`` file. In case of the virtual address spaces monitoring, the -values should be pids of the monitoring target processes. For example, below -commands set processes having pids 42 and 4242 as the monitoring targets and -check it again:: - - # cd <debugfs>/damon - # echo 42 4242 > target_ids - # cat target_ids - 42 4242 - -Users can also monitor the physical memory address space of the system by -writing a special keyword, "``paddr\n``" to the file. Because physical address -space monitoring doesn't support multiple targets, reading the file will show a -fake value, ``42``, as below:: - - # cd <debugfs>/damon - # echo paddr > target_ids - # cat target_ids - 42 - -Note that setting the target ids doesn't start the monitoring. - - -Initial Monitoring Target Regions ---------------------------------- - -In case of the virtual address space monitoring, DAMON automatically sets and -updates the monitoring target regions so that entire memory mappings of target -processes can be covered. However, users can want to limit the monitoring -region to specific address ranges, such as the heap, the stack, or specific -file-mapped area. Or, some users can know the initial access pattern of their -workloads and therefore want to set optimal initial regions for the 'adaptive -regions adjustment'. - -In contrast, DAMON do not automatically sets and updates the monitoring target -regions in case of physical memory monitoring. Therefore, users should set the -monitoring target regions by themselves. - -In such cases, users can explicitly set the initial monitoring target regions -as they want, by writing proper values to the ``init_regions`` file. The input -should be a sequence of three integers separated by white spaces that represent -one region in below form.:: - - <target idx> <start address> <end address> - -The ``target idx`` should be the index of the target in ``target_ids`` file, -starting from ``0``, and the regions should be passed in address order. For -example, below commands will set a couple of address ranges, ``1-100`` and -``100-200`` as the initial monitoring target region of pid 42, which is the -first one (index ``0``) in ``target_ids``, and another couple of address -ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one -(index ``1``) in ``target_ids``.:: - - # cd <debugfs>/damon - # cat target_ids - 42 4242 - # echo "0 1 100 \ - 0 100 200 \ - 1 20 40 \ - 1 50 100" > init_regions - -Note that this sets the initial monitoring target regions only. In case of -virtual memory monitoring, DAMON will automatically updates the boundary of the -regions after one ``update interval``. Therefore, users should set the -``update interval`` large enough in this case, if they don't want the -update. - - -Schemes -------- - -Users can get and set the DAMON-based operation :ref:`schemes -<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file. -Reading the file also shows the statistics of each scheme. To the file, each -of the schemes should be represented in each line in below form:: - - <target access pattern> <action> <quota> <watermarks> - -You can disable schemes by simply writing an empty string to the file. - -Target Access Pattern -~~~~~~~~~~~~~~~~~~~~~ - -The target access :ref:`pattern <damon_design_damos_access_pattern>` of the -scheme. The ``<target access pattern>`` is constructed with three ranges in -below form:: - - min-size max-size min-acc max-acc min-age max-age - -Specifically, bytes for the size of regions (``min-size`` and ``max-size``), -number of monitored accesses per aggregate interval for access frequency -(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of -regions (``min-age`` and ``max-age``) are specified. Note that the ranges are -closed interval. - -Action -~~~~~~ - -The ``<action>`` is a predefined integer for memory management :ref:`actions -<damon_design_damos_action>`. The mapping between the ``<action>`` values and -the memory management actions is as below. For the detailed meaning of the -action and DAMON operations set supporting each action, please refer to the -list on :ref:`design doc <damon_design_damos_action>`. - - - 0: ``willneed`` - - 1: ``cold`` - - 2: ``pageout`` - - 3: ``hugepage`` - - 4: ``nohugepage`` - - 5: ``stat`` - -Quota -~~~~~ - -Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme -via the ``<quota>`` in below form:: - - <ms> <sz> <reset interval> <priority weights> - -This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying -the action to memory regions of the ``target access pattern`` within the -``<reset interval>`` milliseconds, and to apply the action to only up to -``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both -``<ms>`` and ``<sz>`` zero disables the quota limits. - -For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users -can set the weights for the three properties in ``<priority weights>`` in below -form:: - - <size weight> <access frequency weight> <age weight> - -Watermarks -~~~~~~~~~~ - -Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the -given scheme via ``<watermarks>`` in below form:: - - <metric> <check interval> <high mark> <middle mark> <low mark> - -``<metric>`` is a predefined integer for the metric to be checked. The -supported numbers and their meanings are as below. - - - 0: Ignore the watermarks - - 1: System's free memory rate (per thousand) - -The value of the metric is checked every ``<check interval>`` microseconds. - -If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the -scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme -is activated. - -.. _damos_stats: - -Statistics -~~~~~~~~~~ - -It also counts the total number and bytes of regions that each scheme is tried -to be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. - -The statistics can be shown by reading the ``schemes`` file. Reading the file -will show each scheme you entered in each line, and the five numbers for the -statistics will be added at the end of each line. - -Example -~~~~~~~ - -Below commands applies a scheme saying "If a memory region of size in [4KiB, -8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate -interval in [10, 20], page out the region. For the paging out, use only up to -10ms per second, and also don't page out more than 1GiB per second. Under the -limitation, page out memory regions having longer age first. Also, check the -free memory rate of the system every 5 seconds, start the monitoring and paging -out when the free memory rate becomes lower than 50%, but stop it if the free -memory rate becomes larger than 60%, or lower than 30%".:: - - # cd <debugfs>/damon - # scheme="4096 8192 0 5 10 20 2" # target access pattern and action - # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas - # scheme+=" 0 0 100" # prioritization weights - # scheme+=" 1 5000000 600 500 300" # watermarks - # echo "$scheme" > schemes - - -Turning On/Off --------------- - -Setting the files as described above doesn't incur effect unless you explicitly -start the monitoring. You can start, stop, and check the current status of the -monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file. -Writing ``on`` to the file starts the monitoring of the targets with the -attributes. Writing ``off`` to the file stops those. DAMON also stops if -every target process is terminated. Below example commands turn on, off, and -check the status of DAMON:: - - # cd <debugfs>/damon - # echo on > monitor_on_DEPRECATED - # echo off > monitor_on_DEPRECATED - # cat monitor_on_DEPRECATED - off - -Please note that you cannot write to the above-mentioned debugfs files while -the monitoring is turned on. If you write to the files while DAMON is running, -an error code such as ``-EBUSY`` will be returned. - - -Monitoring Thread PID ---------------------- - -DAMON does requested monitoring with a kernel thread called ``kdamond``. You -can get the pid of the thread by reading the ``kdamond_pid`` file. When the -monitoring is turned off, reading the file returns ``none``. :: - - # cd <debugfs>/damon - # cat monitor_on_DEPRECATED - off - # cat kdamond_pid - none - # echo on > monitor_on_DEPRECATED - # cat kdamond_pid - 18594 - - -Using Multiple Monitoring Threads ---------------------------------- - -One ``kdamond`` thread is created for each monitoring context. You can create -and remove monitoring contexts for multiple ``kdamond`` required use case using -the ``mk_contexts`` and ``rm_contexts`` files. - -Writing the name of the new context to the ``mk_contexts`` file creates a -directory of the name on the DAMON debugfs directory. The directory will have -DAMON debugfs files for the context. :: - - # cd <debugfs>/damon - # ls foo - # ls: cannot access 'foo': No such file or directory - # echo foo > mk_contexts - # ls foo - # attrs init_regions kdamond_pid schemes target_ids - -If the context is not needed anymore, you can remove it and the corresponding -directory by putting the name of the context to the ``rm_contexts`` file. :: - - # echo foo > rm_contexts - # ls foo - # ls: cannot access 'foo': No such file or directory - -Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files -are in the root directory only. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index e4d4b4a8dc97..67a941903fd2 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -145,7 +145,17 @@ hugepages It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1. If the node number is invalid, the parameter will be ignored. +hugepage_alloc_threads + Specify the number of threads that should be used to allocate hugepages + during boot. This parameter can be used to improve system bootup time + when allocating a large amount of huge pages. + The default value is 25% of the available hardware threads. + Example to use 8 allocation threads:: + + hugepage_alloc_threads=8 + + Note that this parameter only applies to non-gigantic huge pages. default_hugepagesz Specify the default huge page size. This parameter can only be specified once on the command line. default_hugepagesz can @@ -376,6 +386,13 @@ Note that the number of overcommit and reserve pages remain global quantities, as we don't know until fault time, when the faulting task's mempolicy is applied, from which node the huge page allocation will be attempted. +The hugetlb may be migrated between the per-node hugepages pool in the following +scenarios: memory offline, memory failure, longterm pinning, syscalls(mbind, +migrate_pages and move_pages), alloc_contig_range() and alloc_contig_pages(). +Now only memory offline, memory failure and syscalls allow fallbacking to allocate +a new hugetlb on a different node if the current node is unable to allocate during +hugetlb migration, that means these 3 cases can break the per-node hugepages pool. + .. _using_huge_pages: Using Huge Pages diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 1f883abf3f00..2d2f6c222308 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -10,7 +10,7 @@ processes address space and many other cool things. Linux memory management is a complex system with many configurable settings. Most of these settings are available via ``/proc`` -filesystem and can be quired and adjusted using ``sysctl``. These APIs +filesystem and can be queried and adjusted using ``sysctl``. These APIs are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html @@ -42,3 +42,4 @@ the Linux memory management. transhuge userfaultfd zswap + kho diff --git a/Documentation/admin-guide/mm/kho.rst b/Documentation/admin-guide/mm/kho.rst new file mode 100644 index 000000000000..6dc18ed4b886 --- /dev/null +++ b/Documentation/admin-guide/mm/kho.rst @@ -0,0 +1,115 @@ +.. SPDX-License-Identifier: GPL-2.0-or-later + +==================== +Kexec Handover Usage +==================== + +Kexec HandOver (KHO) is a mechanism that allows Linux to preserve memory +regions, which could contain serialized system states, across kexec. + +This document expects that you are familiar with the base KHO +:ref:`concepts <kho-concepts>`. If you have not read +them yet, please do so now. + +Prerequisites +============= + +KHO is available when the kernel is compiled with ``CONFIG_KEXEC_HANDOVER`` +set to y. Every KHO producer may have its own config option that you +need to enable if you would like to preserve their respective state across +kexec. + +To use KHO, please boot the kernel with the ``kho=on`` command line +parameter. You may use ``kho_scratch`` parameter to define size of the +scratch regions. For example ``kho_scratch=16M,512M,256M`` will reserve a +16 MiB low memory scratch area, a 512 MiB global scratch region, and 256 MiB +per NUMA node scratch regions on boot. + +Perform a KHO kexec +=================== + +First, before you perform a KHO kexec, you need to move the system into +the :ref:`KHO finalization phase <kho-finalization-phase>` :: + + $ echo 1 > /sys/kernel/debug/kho/out/finalize + +After this command, the KHO FDT is available in +``/sys/kernel/debug/kho/out/fdt``. Other subsystems may also register +their own preserved sub FDTs under +``/sys/kernel/debug/kho/out/sub_fdts/``. + +Next, load the target payload and kexec into it. It is important that you +use the ``-s`` parameter to use the in-kernel kexec file loader, as user +space kexec tooling currently has no support for KHO with the user space +based file loader :: + + # kexec -l /path/to/bzImage --initrd /path/to/initrd -s + # kexec -e + +The new kernel will boot up and contain some of the previous kernel's state. + +For example, if you used ``reserve_mem`` command line parameter to create +an early memory reservation, the new kernel will have that memory at the +same physical address as the old kernel. + +Abort a KHO exec +================ + +You can move the system out of KHO finalization phase again by calling :: + + $ echo 0 > /sys/kernel/debug/kho/out/active + +After this command, the KHO FDT is no longer available in +``/sys/kernel/debug/kho/out/fdt``. + +debugfs Interfaces +================== + +Currently KHO creates the following debugfs interfaces. Notice that these +interfaces may change in the future. They will be moved to sysfs once KHO is +stabilized. + +``/sys/kernel/debug/kho/out/finalize`` + Kexec HandOver (KHO) allows Linux to transition the state of + compatible drivers into the next kexec'ed kernel. To do so, + device drivers will instruct KHO to preserve memory regions, + which could contain serialized kernel state. + While the state is serialized, they are unable to perform + any modifications to state that was serialized, such as + handed over memory allocations. + + When this file contains "1", the system is in the transition + state. When contains "0", it is not. To switch between the + two states, echo the respective number into this file. + +``/sys/kernel/debug/kho/out/fdt`` + When KHO state tree is finalized, the kernel exposes the + flattened device tree blob that carries its current KHO + state in this file. Kexec user space tooling can use this + as input file for the KHO payload image. + +``/sys/kernel/debug/kho/out/scratch_len`` + Lengths of KHO scratch regions, which are physically contiguous + memory regions that will always stay available for future kexec + allocations. Kexec user space tools can use this file to determine + where it should place its payload images. + +``/sys/kernel/debug/kho/out/scratch_phys`` + Physical locations of KHO scratch regions. Kexec user space tools + can use this file in conjunction to scratch_phys to determine where + it should place its payload images. + +``/sys/kernel/debug/kho/out/sub_fdts/`` + In the KHO finalization phase, KHO producers register their own + FDT blob under this directory. + +``/sys/kernel/debug/kho/in/fdt`` + When the kernel was booted with Kexec HandOver (KHO), + the state tree that carries metadata about the previous + kernel's state is in this file in the format of flattened + device tree. This file may disappear when all consumers of + it finished to interpret their metadata. + +``/sys/kernel/debug/kho/in/sub_fdts/`` + Similar to ``kho/out/sub_fdts/``, but contains sub FDT blobs + of KHO producers passed from the old kernel. diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index a639cac12477..ad8e7a41f3b5 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -308,7 +308,7 @@ limited by the ``advisor_max_cpu`` parameter. In addition there is also the ``advisor_target_scan_time`` parameter. This parameter sets the target time to scan all the KSM candidate pages. The parameter ``advisor_target_scan_time`` decides how aggressive the scan time advisor scans candidate pages. Lower -values make the scan time advisor to scan more aggresively. This is the most +values make the scan time advisor to scan more aggressively. This is the most important parameter for the configuration of the scan time advisor. The initial value and the maximum value can be changed with diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 098f14d83e99..33c886f3d198 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -280,8 +280,8 @@ The following files are currently defined: blocks; configure auto-onlining. The default value depends on the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel configuration + options. See the ``state`` property of memory blocks for details. ``block_size_bytes`` read-only: the size in bytes of a memory block. @@ -294,8 +294,9 @@ The following files are currently defined: ``crash_hotplug`` read-only: when changes to the system memory map occur due to hot un/plug of memory, this file contains '1' if the kernel updates the kdump capture kernel memory - map itself (via elfcorehdr), or '0' if userspace must update - the kdump capture kernel memory map. + map itself (via elfcorehdr and other relevant kexec + segments), or '0' if userspace must update the kdump + capture kernel memory map. Availability depends on the CONFIG_MEMORY_HOTPLUG kernel configuration option. diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst index 33e068830497..9cb54b4ff5d9 100644 --- a/Documentation/admin-guide/mm/multigen_lru.rst +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -151,8 +151,9 @@ generations less than or equal to ``min_gen_nr``. ``min_gen_nr`` should be less than ``max_gen_nr-1``, since ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to the active list) and therefore cannot be evicted. ``swappiness`` -overrides the default value in ``/proc/sys/vm/swappiness``. -``nr_to_reclaim`` limits the number of pages to evict. +overrides the default value in ``/proc/sys/vm/swappiness`` and the valid +range is [0-200, max], with max being exclusively used for the reclamation +of anonymous memory. ``nr_to_reclaim`` limits the number of pages to evict. A typical use case is that a job scheduler runs this command before it tries to land a new job on a server. If it fails to materialize enough diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index f5f065c67615..e60e9211fd9b 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -21,7 +21,8 @@ There are four components to pagemap: * Bit 56 page exclusively mapped (since 4.2) * Bit 57 pte is uffd-wp write-protected (since 5.13) (see Documentation/admin-guide/mm/userfaultfd.rst) - * Bits 58-60 zero + * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man page) + * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) * Bit 62 page swapped * Bit 63 page present @@ -37,12 +38,28 @@ There are four components to pagemap: precisely which pages are mapped (or in swap) and comparing mapped pages between processes. + Traditionally, bit 56 indicates that a page is mapped exactly once and bit + 56 is clear when a page is mapped multiple times, even when mapped in the + same process multiple times. In some kernel configurations, the semantics + for pages part of a larger allocation (e.g., THP) can differ: bit 56 is set + if all pages part of the corresponding large allocation are *certainly* + mapped in the same process, even if the page is mapped multiple times in that + process. Bit 56 is clear when any page page of the larger allocation + is *maybe* mapped in a different process. In some cases, a large allocation + might be treated as "maybe mapped by multiple processes" even though this + is no longer the case. + Efficient users of this interface will use ``/proc/pid/maps`` to determine which areas of memory are actually mapped and llseek to skip over unmapped regions. * ``/proc/kpagecount``. This file contains a 64-bit count of the number of - times each page is mapped, indexed by PFN. + times each page is mapped, indexed by PFN. Some kernel configurations do + not track the precise number of times a page part of a larger allocation + (e.g., THP) is mapped. In these configurations, the average number of + mappings per page in this larger allocation is returned instead. However, + if any page of the large allocation is mapped, the returned value will + be at least 1. The page-types tool in the tools/mm directory can be used to query the number of times a page is mapped. @@ -118,7 +135,7 @@ Short descriptions to the page flags 21 - KSM Identical memory pages dynamically shared between one or more processes. 22 - THP - Contiguous pages which construct transparent hugepages. + Contiguous pages which construct THP of any size and mapped by any granularity. 23 - OFFLINE The page is logically offline. 24 - ZERO_PAGE @@ -173,27 +190,6 @@ LRU related page flags The page-types tool in the tools/mm directory can be used to query the above flags. -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - Exceptions for Shared Memory ============================ @@ -252,8 +248,9 @@ Following flags about pages are currently supported: - ``PAGE_IS_PRESENT`` - Page is present in the memory - ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_PFNZERO`` - Page has zero PFN -- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed +- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty +- ``PAGE_IS_GUARD`` - Page is a part of a guard region The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 04eb45a2f940..dff8d5985f0f 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,12 +202,21 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -khugepaged will be automatically started when one or more hugepage -sizes are enabled (either by directly setting "always" or "madvise", -or by setting "inherit" while the top-level enabled is set to "always" -or "madvise"), and it'll be automatically shutdown when the last -hugepage size is disabled (either by directly setting "never", or by -setting "inherit" while the top-level enabled is set to "never"). +All THPs at fault and collapse time will be added to _deferred_list, +and will therefore be split under memory presure if they are considered +"underused". A THP is underused if the number of zero-filled pages in +the THP is above max_ptes_none (see below). It is possible to disable +this behaviour by writing 0 to shrink_underused, and enable it by writing +1 to it:: + + echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused + echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused + +khugepaged will be automatically started when PMD-sized THP is enabled +(either of the per-size anon control or the top-level control are set +to "always" or "madvise"), and it'll be automatically shutdown when +PMD-sized THP is disabled (when both the per-size anon control and the +top-level control are "never") Khugepaged controls ------------------- @@ -278,25 +287,92 @@ collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. ``max_ptes_shared`` specifies how many pages can be shared across multiple -processes. Exceeding the number would block the collapse:: +processes. khugepaged might treat pages of THPs as shared if any page of +that THP is shared. Exceeding the number would block the collapse:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared A higher value may increase memory footprint for some workloads. -Boot parameter -============== - -You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter ``transparent_hugepage=always`` or -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` -to the kernel command line. +Boot parameters +=============== + +You can change the sysfs boot time default for the top-level "enabled" +control by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the +kernel command line. + +Alternatively, each supported anonymous THP size can be controlled by +passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``, +where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and +supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``, +``never`` or ``inherit``. + +For example, the following will set 16K, 32K, 64K THP to ``always``, +set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M +to ``never``:: + + thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never + +``thp_anon=`` may be specified multiple times to configure all THP sizes as +required. If ``thp_anon=`` is specified at least once, any anon THP sizes +not explicitly configured on the command line are implicitly set to +``never``. + +``transparent_hugepage`` setting only affects the global toggle. If +``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``. +However, if a valid ``thp_anon`` setting is provided by the user, the +PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER +is not defined within a valid ``thp_anon``, its policy will default to +``never``. + +Similarly to ``transparent_hugepage``, you can control the hugepage +allocation policy for the internal shmem mount by using the kernel parameter +``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the +seven valid policies for shmem (``always``, ``within_size``, ``advise``, +``never``, ``deny``, and ``force``). + +Similarly to ``transparent_hugepage_shmem``, you can control the default +hugepage allocation policy for the tmpfs mount by using the kernel parameter +``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the +four valid policies for tmpfs (``always``, ``within_size``, ``advise``, +``never``). The tmpfs mount default policy is ``never``. + +In the same manner as ``thp_anon`` controls each supported anonymous THP +size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` +has the same format as ``thp_anon``, but also supports the policy +``within_size``. + +``thp_shmem=`` may be specified multiple times to configure all THP sizes +as required. If ``thp_shmem=`` is specified at least once, any shmem THP +sizes not explicitly configured on the command line are implicitly set to +``never``. + +``transparent_hugepage_shmem`` setting only affects the global toggle. If +``thp_shmem`` is not specified, PMD_ORDER hugepage will default to +``inherit``. However, if a valid ``thp_shmem`` setting is provided by the +user, the PMD_ORDER hugepage policy will be overridden. If the policy for +PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will +default to ``never``. Hugepages in tmpfs/shmem ======================== -You can control hugepage allocation policy in tmpfs with mount option -``huge=``. It can have following values: +Traditionally, tmpfs only supported a single huge page size ("PMD"). Today, +it also supports smaller sizes just like anonymous memory, often referred +to as "multi-size THP" (mTHP). Huge pages of any size are commonly +represented in the kernel as "large folios". + +While there is fine control over the huge page sizes to use for the internal +shmem mount (see below), ordinary tmpfs mounts will make use of all available +huge page sizes without any control over the exact sizes, behaving more like +other file systems. + +tmpfs mounts +------------ + +The THP allocation policy for tmpfs mounts can be adjusted using the mount +option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; @@ -306,24 +382,24 @@ never within_size Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; + Also respect madvise() hints; advise - Only allocate huge pages if requested with fadvise()/madvise(); + Only allocate huge pages if requested with madvise(); -The default policy is ``never``. +Remember, that the kernel may use huge pages of all available sizes, and +that no fine control as for the internal tmpfs mount is available. + +The default policy in the past was ``never``, but it can now be adjusted +using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``. ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting ``huge=never`` will not attempt to break up huge pages at all, just stop more from being allocated. -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: +In addition to policies listed above, the sysfs knob +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the +allocation policy of tmpfs mounts, when set to the following values: deny For use in emergencies, to force the huge option off from @@ -331,6 +407,42 @@ deny force Force the huge option on for all - very useful for testing; +shmem / internal tmpfs +---------------------- +The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +To control the THP allocation policy for this internal tmpfs mount, the +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs +per THP size in +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled' +can be used. + +The global knob has the same semantics as the ``huge=`` mount options +for tmpfs mounts, except that the different huge page sizes can be controlled +individually, and will only use the setting of the global knob when the +per-size knob is set to 'inherit'. + +The options 'force' and 'deny' are dropped for the individual sizes, which +are rather testing artifacts from the old ages. + +always + Attempt to allocate <size> huge pages every time we need a new page; + +inherit + Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages + have enabled="inherit" and all other hugepage sizes have enabled="never"; + +never + Do not allocate <size> huge pages; + +within_size + Only allocate <size> huge page if it will be fully within i_size. + Also respect madvise() hints; + +advise + Only allocate <size> huge pages if requested with madvise(); + Need of application restart =========================== @@ -343,10 +455,6 @@ also applies to the regions registered in khugepaged. Monitoring usage ================ -.. note:: - Currently the below counters only record events relating to - PMD-sized THP. Events relating to other THP sizes are not included. - The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. To identify what applications are using PMD-sized anonymous transparent huge @@ -358,7 +466,7 @@ AnonHugePmdMapped). The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. To identify what applications are mapping file transparent huge pages, it -is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields +is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields for each mapping. Note that reading the smaps file is expensive and reading it @@ -369,7 +477,7 @@ monitor how successfully the system is providing huge pages for use. thp_fault_alloc is incremented every time a huge page is successfully - allocated to handle a page fault. + allocated and charged to handle a page fault. thp_collapse_alloc is incremented by khugepaged when it has found @@ -377,7 +485,7 @@ thp_collapse_alloc successfully allocated a new huge page to store the data. thp_fault_fallback - is incremented if a page fault fails to allocate + is incremented if a page fault fails to allocate or charge a huge page and instead falls back to using small pages. thp_fault_fallback_charge @@ -391,20 +499,23 @@ thp_collapse_alloc_failed the allocation. thp_file_alloc - is incremented every time a file huge page is successfully - allocated. + is incremented every time a shmem huge page is successfully + allocated (Note that despite being named after "file", the counter + measures only shmem). thp_file_fallback - is incremented if a file huge page is attempted to be allocated - but fails and instead falls back to using small pages. + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. (Note that + despite being named after "file", the counter measures only shmem). thp_file_fallback_charge - is incremented if a file huge page cannot be charged and instead + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was - successful. + successful. (Note that despite being named after "file", the + counter measures only shmem). thp_file_mapped - is incremented every time a file huge page is mapped into + is incremented every time a file or shmem huge page is mapped into user address space. thp_split_page @@ -423,6 +534,12 @@ thp_deferred_split_page splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. +thp_underused_split_page + is incremented when a huge page on the split queue was split + because it was underused. A THP is underused if the number of + zero pages in the THP is above a certain threshold + (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). + thp_split_pmd is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or @@ -447,6 +564,92 @@ thp_swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. +In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are +also individual counters for each huge page size, which can be utilized to +monitor the system's effectiveness in providing huge pages for usage. Each +counter has its own corresponding file. + +anon_fault_alloc + is incremented every time a huge page is successfully + allocated and charged to handle a page fault. + +anon_fault_fallback + is incremented if a page fault fails to allocate or charge + a huge page and instead falls back to using huge pages with + lower orders or small pages. + +anon_fault_fallback_charge + is incremented if a page fault fails to charge a huge page and + instead falls back to using huge pages with lower orders or + small pages even though the allocation was successful. + +zswpout + is incremented every time a huge page is swapped out to zswap in one + piece without splitting. + +swpin + is incremented every time a huge page is swapped in from a non-zswap + swap device in one piece. + +swpin_fallback + is incremented if swapin fails to allocate or charge a huge page + and instead falls back to using huge pages with lower orders or + small pages. + +swpin_fallback_charge + is incremented if swapin fails to charge a huge page and instead + falls back to using huge pages with lower orders or small pages + even though the allocation was successful. + +swpout + is incremented every time a huge page is swapped out to a non-zswap + swap device in one piece without splitting. + +swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + +shmem_alloc + is incremented every time a shmem huge page is successfully + allocated. + +shmem_fallback + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. + +shmem_fallback_charge + is incremented if a shmem huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + +split + is incremented every time a huge page is successfully split into + smaller orders. This can happen for a variety of reasons but a + common reason is that a huge page is old and is being reclaimed. + +split_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +split_deferred + is incremented when a huge page is put onto split queue. + This happens when a huge page is partially unmapped and splitting + it would free up some memory. Pages on split queue are going to + be split under memory pressure, if splitting is possible. + +nr_anon + the number of anonymous THP we have in the whole system. These THPs + might be currently entirely mapped or have partially unmapped/unused + subpages. + +nr_anon_partially_mapped + the number of anonymous THP which are likely partially mapped, possibly + wasting memory, and have been queued for deferred memory reclamation. + Note that in corner some cases (e.g., failed migration), we might detect + an anonymous THP as "partially mapped" and count it here, even though it + is not actually partially mapped anymore. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 13632671adae..fd3370aa43fe 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -60,15 +60,13 @@ accessed. The compressed memory pool grows on demand and shrinks as compressed pages are freed. The pool is not preallocated. By default, a zpool of type selected in ``CONFIG_ZSWAP_ZPOOL_DEFAULT`` Kconfig option is created, but it can be overridden at boot time by setting the ``zpool`` attribute, -e.g. ``zswap.zpool=zbud``. It can also be changed at runtime using the sysfs +e.g. ``zswap.zpool=zsmalloc``. It can also be changed at runtime using the sysfs ``zpool`` attribute, e.g.:: - echo zbud > /sys/module/zswap/parameters/zpool + echo zsmalloc > /sys/module/zswap/parameters/zpool -The zbud type zpool allocates exactly 1 page to store 2 compressed pages, which -means the compression ratio will always be 2:1 or worse (because of half-full -zbud pages). The zsmalloc type zpool has a more complex compressed page -storage method, and it can achieve greater storage densities. +The zsmalloc type zpool has a complex compressed page storage method, and it +can achieve great storage densities. When a swap page is passed from swapout to zswap, zswap maintains a mapping of the swap entry, a combination of the swap type and swap offset, to the zpool @@ -111,35 +109,6 @@ checked if it is a same-value filled page before compressing it. If true, the compressed length of the page is set to zero and the pattern or same-filled value is stored. -Same-value filled pages identification feature is enabled by default and can be -disabled at boot time by setting the ``same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and -disabled at runtime using the sysfs ``same_filled_pages_enabled`` -attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled - -When zswap same-filled page identification is disabled at runtime, it will stop -checking for the same-value filled pages during store operation. -In other words, every page will be then considered non-same-value filled. -However, the existing pages which are marked as same-value filled pages remain -stored unchanged in zswap until they are either loaded or invalidated. - -In some circumstances it might be advantageous to make use of just the zswap -ability to efficiently store same-filled pages without enabling the whole -compressed page storage. -In this case the handling of non-same-value pages by zswap (enabled by default) -can be disabled by setting the ``non_same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``. -It can also be enabled and disabled at runtime using the sysfs -``non_same_filled_pages_enabled`` attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled - -Disabling both ``zswap.same_filled_pages_enabled`` and -``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new -pages by zswap. - To prevent zswap from shrinking pool when zswap is full and there's a high pressure on swap (this will result in flipping pages in and out zswap pool without any real benefit but with a performance drop for the system), a diff --git a/Documentation/admin-guide/namespaces/resource-control.rst b/Documentation/admin-guide/namespaces/resource-control.rst index 369556e00f0c..553a44803231 100644 --- a/Documentation/admin-guide/namespaces/resource-control.rst +++ b/Documentation/admin-guide/namespaces/resource-control.rst @@ -1,17 +1,17 @@ -=========================== -Namespaces research control -=========================== +==================================== +User namespaces and resource control +==================================== -There are a lot of kinds of objects in the kernel that don't have -individual limits or that have limits that are ineffective when a set -of processes is allowed to switch user ids. With user namespaces -enabled in a kernel for people who don't trust their users or their -users programs to play nice this problems becomes more acute. +The kernel contains many kinds of objects that either don't have +individual limits or that have limits which are ineffective when +a set of processes is allowed to switch their UID. On a system +where the admins don't trust their users or their users' programs, +user namespaces expose the system to potential misuse of resources. -Therefore it is recommended that memory control groups be enabled in -kernels that enable user namespaces, and it is further recommended -that userspace configure memory control groups to limit how much -memory user's they don't trust to play nice can use. +In order to mitigate this, we recommend that admins enable memory +control groups on any system that enables user namespaces. +Furthermore, we recommend that admins configure the memory control +groups to limit the maximum memory usable by any untrusted user. Memory control groups can be configured by installing the libcgroup package present on most distros editing /etc/cgrules.conf, diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst new file mode 100644 index 000000000000..97ca1ccef459 --- /dev/null +++ b/Documentation/admin-guide/nvme-multipath.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Linux NVMe multipath +==================== + +This document describes NVMe multipath and its path selection policies supported +by the Linux NVMe host driver. + + +Introduction +============ + +The NVMe multipath feature in Linux integrates namespaces with the same +identifier into a single block device. Using multipath enhances the reliability +and stability of I/O access while improving bandwidth performance. When a user +sends I/O to this merged block device, the multipath mechanism selects one of +the underlying block devices (paths) according to the configured policy. +Different policies result in different path selections. + + +Policies +======== + +All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning +that when an optimized path is available, it will be chosen over a non-optimized +one. Current the NVMe multipath policies include numa(default), round-robin and +queue-depth. + +To set the desired policy (e.g., round-robin), use one of the following methods: + 1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy + 2. or add the "nvme_core.iopolicy=round-robin" to cmdline. + + +NUMA +---- + +The NUMA policy selects the path closest to the NUMA node of the current CPU for +I/O distribution. This policy maintains the nearest paths to each NUMA node +based on network interface connections. + +When to use the NUMA policy: + 1. Multi-core Systems: Optimizes memory access in multi-core and + multi-processor systems, especially under NUMA architecture. + 2. High Affinity Workloads: Binds I/O processing to the CPU to reduce + communication and data transfer delays across nodes. + + +Round-Robin +----------- + +The round-robin policy distributes I/O requests evenly across all paths to +enhance throughput and resource utilization. Each I/O operation is sent to the +next path in sequence. + +When to use the round-robin policy: + 1. Balanced Workloads: Effective for balanced and predictable workloads with + similar I/O size and type. + 2. Homogeneous Path Performance: Utilizes all paths efficiently when + performance characteristics (e.g., latency, bandwidth) are similar. + + +Queue-Depth +----------- + +The queue-depth policy manages I/O requests based on the current queue depth +of each path, selecting the path with the least number of in-flight I/Os. + +When to use the queue-depth policy: + 1. High load with small I/Os: Effectively balances load across paths when + the load is high, and I/O operations consist of small, relatively + fixed-sized requests. diff --git a/Documentation/admin-guide/perf/arm-ni.rst b/Documentation/admin-guide/perf/arm-ni.rst new file mode 100644 index 000000000000..d26a8f697c36 --- /dev/null +++ b/Documentation/admin-guide/perf/arm-ni.rst @@ -0,0 +1,17 @@ +==================================== +Arm Network-on Chip Interconnect PMU +==================================== + +NI-700 and friends implement a distinct PMU for each clock domain within the +interconnect. Correspondingly, the driver exposes multiple PMU devices named +arm_ni_<x>_cd_<y>, where <x> is an (arbitrary) instance identifier and <y> is +the clock domain ID within that particular instance. If multiple NI instances +exist within a system, the PMU devices can be correlated with the underlying +hardware instance via sysfs parentage. + +Each PMU exposes base event aliases for the interface types present in its clock +domain. These require qualifying with the "eventid" and "nodeid" parameters +to specify the event code to count and the interface at which to count it +(per the configured hardware ID as reflected in the xxNI_NODE_INFO register). +The exception is the "cycles" alias for the PMU cycle counter, which is encoded +with the PMU node type and needs no further qualification. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index d47cd229d710..cb376f335f40 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -46,41 +46,41 @@ Some of the events only exist for specific configurations. DesignWare Cores (DWC) PCIe PMU Driver ======================================= -This driver adds PMU devices for each PCIe Root Port named based on the BDF of +This driver adds PMU devices for each PCIe Root Port named based on the SBDF of the Root Port. For example, - 30:03.0 PCI bridge: Device 1ded:8000 (rev 01) + 0001:30:03.0 PCI bridge: Device 1ded:8000 (rev 01) -the PMU device name for this Root Port is dwc_rootport_3018. +the PMU device name for this Root Port is dwc_rootport_13018. The DWC PCIe PMU driver registers a perf PMU driver, which provides description of available events and configuration options in sysfs, see -/sys/bus/event_source/devices/dwc_rootport_{bdf}. +/sys/bus/event_source/devices/dwc_rootport_{sbdf}. The "format" directory describes format of the config fields of the perf_event_attr structure. The "events" directory provides configuration templates for all documented events. For example, -"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1". +"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0". The "perf list" command shall list the available events from sysfs, e.g.:: $# perf list | grep dwc_rootport <...> - dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] + dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] <...> - dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event] + dwc_rootport_13018/rx_memory_read,lane=?/ [Kernel PMU event] Time Based Analysis Event Usage ------------------------------- Example usage of counting PCIe RX TLP data payload (Units of bytes):: - $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ + $# perf stat -a -e dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ The average RX/TX bandwidth can be calculated using the following formula: - PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window - PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window + PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window + PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window Lane Event Usage ------------------------------- @@ -88,7 +88,7 @@ Lane Event Usage Each lane has the same event set and to avoid generating a list of hundreds of events, the user need to specify the lane ID explicitly, e.g.:: - $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/ + $# perf stat -a -e dwc_rootport_13018/rx_memory_read,lane=4/ The driver does not support sampling, therefore "perf record" will not work. Per-task (without "-a") perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst index 5541ff40e06a..083ca50de896 100644 --- a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst @@ -28,7 +28,9 @@ The "identifier" sysfs file allows users to identify the version of the PMU hardware device. The "bus" sysfs file allows users to get the bus number of Root Ports -monitored by PMU. +monitored by PMU. Furthermore users can get the Root Ports range in +[bdf_min, bdf_max] from "bdf_min" and "bdf_max" sysfs attributes +respectively. Example usage of perf:: diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst index e0174d20809a..48992a0b8e94 100644 --- a/Documentation/admin-guide/perf/hisi-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -20,7 +20,6 @@ interrupt, and the PMU driver shall register perf PMU drivers like L3C, HHA and DDRC etc. The available events and configuration options shall be described in the sysfs, see: -/sys/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>/, or /sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>. The "perf list" command shall list the available events from sysfs. @@ -36,7 +35,10 @@ e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in SCCL ID #1. The driver also provides a "cpumask" sysfs attribute, which shows the CPU core -ID used to count the uncore PMU event. +ID used to count the uncore PMU event. An "associated_cpus" sysfs attribute is +also provided to show the CPUs associated with this PMU. The "cpumask" indicates +the CPUs to open the events, usually as a hint for userspaces tools like perf. +It only contains one associated CPU from the "associated_cpus". Example usage of perf:: diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst index 75a40846d47f..1195e570f2d6 100644 --- a/Documentation/admin-guide/perf/hns3-pmu.rst +++ b/Documentation/admin-guide/perf/hns3-pmu.rst @@ -16,7 +16,7 @@ HNS3 PMU driver The HNS3 PMU driver registers a perf PMU with the name of its sicl id.:: - /sys/devices/hns3_pmu_sicl_<sicl_id> + /sys/bus/event_source/devices/hns3_pmu_sicl_<sicl_id> PMU driver provides description of available events, filter modes, format, identifier and cpumask in sysfs. @@ -40,9 +40,9 @@ device. Example usage of checking event code and subevent code:: - $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time + $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time config=0x00204 - $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num + $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num config=0x10204 Each performance statistic has a pair of events to get two values to @@ -60,7 +60,7 @@ computation to calculate real performance data is::: Example usage of checking supported filter mode:: - $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num + $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num filter mode supported: global/port/port-tc/func/func-queue/ Example usage of perf:: diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 7eb3dcd6f4da..072b510385c4 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -14,8 +14,11 @@ Performance monitor support qcom_l2_pmu qcom_l3_pmu starfive_starlink_pmu + mrvl-odyssey-ddr-pmu + mrvl-odyssey-tad-pmu arm-ccn arm-cmn + arm-ni xgene-pmu arm_dsu_pmu thunderx2-pmu @@ -25,3 +28,4 @@ Performance monitor support meson-ddr-pmu cxl ampere_cspmu + mrvl-pem-pmu diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst new file mode 100644 index 000000000000..2e817593a4d9 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst @@ -0,0 +1,80 @@ +=================================================================== +Marvell Odyssey DDR PMU Performance Monitoring Unit (PMU UNCORE) +=================================================================== + +Odyssey DRAM Subsystem supports eight counters for monitoring performance +and software can program those counters to monitor any of the defined +performance events. Supported performance events include those counted +at the interface between the DDR controller and the PHY, interface between +the DDR Controller and the CHI interconnect, or within the DDR Controller. + +Additionally DSS also supports two fixed performance event counters, one +for ddr reads and the other for ddr writes. + +The counter will be operating in either manual or auto mode. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/events/ + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/format/ + +Examples:: + + $ perf list | grep ddr + mrvl_ddr_pmu_<>/ddr_act_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_alloc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_starvation/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_mwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_read/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_write/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_capar_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_crit_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_reads/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_writes/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cmd_is_retry/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cycles/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_parity_poison/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_rd_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_wr_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mpc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_mpsm/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_powerdown/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_selfref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_pri_rdaccess/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rmw_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_load_mode/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_lpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_other/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_rdwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_raw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_uc_ecc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rdwr_transitions/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_refresh/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_retry_fifo_full/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_spec_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_tcr_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_war_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_waw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_rd/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_wr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_trxn_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_write_combine/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqcl/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqlatch/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqstart/ [Kernel PMU event] + + $ perf stat -e ddr_cam_read,ddr_cam_write,ddr_cam_active_access,ddr_cam + rd_or_wr_access,ddr_cam_rd_active_access,ddr_cam_mwr <workload> diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst new file mode 100644 index 000000000000..ad1975b14087 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst @@ -0,0 +1,37 @@ +==================================================================== +Marvell Odyssey LLC-TAD Performance Monitoring Unit (PMU UNCORE) +==================================================================== + +Each TAD provides eight 64-bit counters for monitoring +cache behavior.The driver always configures the same counter for +all the TADs. The user would end up effectively reserving one of +eight counters in every TAD to look across all TADs. +The occurrences of events are aggregated and presented to the user +at the end of running the workload. The driver does not provide a +way for the user to partition TADs so that different TADs are used for +different applications. + +The performance events reflect various internal or interface activities. +By combining the values from multiple performance counters, cache +performance can be measured in terms such as: cache miss rate, cache +allocations, interface retry rate, internal resource occupancy, etc. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/tad/events/ + /sys/bus/event_source/devices/tad/format/ + +Examples:: + + $ perf list | grep tad + tad/tad_alloc_any/ [Kernel PMU event] + tad/tad_alloc_dtg/ [Kernel PMU event] + tad/tad_alloc_ltg/ [Kernel PMU event] + tad/tad_hit_any/ [Kernel PMU event] + tad/tad_hit_dtg/ [Kernel PMU event] + tad/tad_hit_ltg/ [Kernel PMU event] + tad/tad_req_msh_in_exlmn/ [Kernel PMU event] + tad/tad_tag_rd/ [Kernel PMU event] + tad/tad_tot_cycle/ [Kernel PMU event] + + $ perf stat -e tad_alloc_dtg,tad_alloc_ltg,tad_alloc_any,tad_hit_dtg,tad_hit_ltg,tad_hit_any,tad_tag_rd <workload> diff --git a/Documentation/admin-guide/perf/mrvl-pem-pmu.rst b/Documentation/admin-guide/perf/mrvl-pem-pmu.rst new file mode 100644 index 000000000000..c39007149b97 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-pem-pmu.rst @@ -0,0 +1,56 @@ +================================================================= +Marvell Odyssey PEM Performance Monitoring Unit (PMU UNCORE) +================================================================= + +The PCI Express Interface Units(PEM) are associated with a corresponding +monitoring unit. This includes performance counters to track various +characteristics of the data that is transmitted over the PCIe link. + +The counters track inbound and outbound transactions which +includes separate counters for posted/non-posted/completion TLPs. +Also, inbound and outbound memory read requests along with their +latencies can also be monitored. Address Translation Services(ATS)events +such as ATS Translation, ATS Page Request, ATS Invalidation along with +their corresponding latencies are also tracked. + +There are separate 64 bit counters to measure posted/non-posted/completion +tlps in inbound and outbound transactions. ATS events are measured by +different counters. + +The PMU driver exposes the available events and format options under sysfs, +/sys/bus/event_source/devices/mrvl_pcie_rc_pmu_<>/events/ +/sys/bus/event_source/devices/mrvl_pcie_rc_pmu_<>/format/ + +Examples:: + + # perf list | grep mrvl_pcie_rc_pmu + mrvl_pcie_rc_pmu_<>/ats_inv/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_inv_latency/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_pri/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_pri_latency/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_trans/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_trans_latency/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_inflight/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_reads/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_req_no_ro_ebus/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_req_no_ro_ncb/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_npr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_pr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_npr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_pr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_inflight_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_merges_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_merges_npr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_merges_pr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_reads_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_npr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_pr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_npr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_pr_partid/ [Kernel PMU event] + + + # perf stat -e ib_inflight,ib_reads,ib_req_no_ro_ebus,ib_req_no_ro_ncb <workload> diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst index 2e0d47cfe7ea..f538ef67e0e8 100644 --- a/Documentation/admin-guide/perf/nvidia-pmu.rst +++ b/Documentation/admin-guide/perf/nvidia-pmu.rst @@ -34,7 +34,7 @@ strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>. Example usage: @@ -66,7 +66,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. Example usage: @@ -86,6 +86,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/ + NVLink-C2C1 PMU ------------------- @@ -96,7 +112,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. Example usage: @@ -116,6 +132,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/ + CNVLink PMU --------------- @@ -125,13 +157,14 @@ to local memory. For PCIE traffic, this PMU captures read and relaxed ordered for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>. Each SoC socket can be connected to one or more sockets via CNVLink. The user can use "rem_socket" bitmap parameter to select the remote socket(s) to monitor. Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to -socket 1 to 3. -/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket +socket 1 to 3. The PMU will monitor all remote sockets by default if not +specified. +/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket shows the valid bits that can be set in the "rem_socket" parameter. The PMU can not distinguish the remote traffic initiator, therefore it does not @@ -165,12 +198,13 @@ local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>. Each SoC socket can support multiple root ports. The user can use "root_port" bitmap parameter to select the port(s) to monitor, i.e. -"root_port=0xF" corresponds to root port 0 to 3. -/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port +"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root +ports by default if not specified. +/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port shows the valid bits that can be set in the "root_port" parameter. Example usage: diff --git a/Documentation/admin-guide/perf/qcom_l2_pmu.rst b/Documentation/admin-guide/perf/qcom_l2_pmu.rst index c130178a4a55..c37c6be9b8d8 100644 --- a/Documentation/admin-guide/perf/qcom_l2_pmu.rst +++ b/Documentation/admin-guide/perf/qcom_l2_pmu.rst @@ -10,7 +10,7 @@ There is one logical L2 PMU exposed, which aggregates the results from the physical PMUs. The driver provides a description of its available events and configuration -options in sysfs, see /sys/devices/l2cache_0. +options in sysfs, see /sys/bus/event_source/devices/l2cache_0. The "format" directory describes the format of the events. diff --git a/Documentation/admin-guide/perf/qcom_l3_pmu.rst b/Documentation/admin-guide/perf/qcom_l3_pmu.rst index a3d014a46bfd..a66556b7e985 100644 --- a/Documentation/admin-guide/perf/qcom_l3_pmu.rst +++ b/Documentation/admin-guide/perf/qcom_l3_pmu.rst @@ -9,7 +9,7 @@ PMU with device name l3cache_<socket>_<instance>. User space is responsible for aggregating across slices. The driver provides a description of its available events and configuration -options in sysfs, see /sys/devices/l3cache*. Given that these are uncore PMUs +options in sysfs, see /sys/bus/event_source/devices/l3cache*. Given that these are uncore PMUs the driver also exposes a "cpumask" sysfs attribute which contains a mask consisting of one CPU per socket which will be used to handle all the PMU events on that socket. diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst index 01f158238ae1..9255f7bf9452 100644 --- a/Documentation/admin-guide/perf/thunderx2-pmu.rst +++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst @@ -22,7 +22,7 @@ The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and L3C devices. Each PMU can be used to count up to 4 (DMC/L3C) or up to 8 (CCPI2) events simultaneously. The PMUs provide a description of their available events and configuration options under sysfs, see -/sys/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. +/sys/bus/event_source/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. The driver does not support sampling, therefore "perf record" will not work. Per-task perf sessions are also not supported. diff --git a/Documentation/admin-guide/perf/xgene-pmu.rst b/Documentation/admin-guide/perf/xgene-pmu.rst index 644f8ed89152..98ccb8e777c4 100644 --- a/Documentation/admin-guide/perf/xgene-pmu.rst +++ b/Documentation/admin-guide/perf/xgene-pmu.rst @@ -13,7 +13,7 @@ PMU (perf) driver The xgene-pmu driver registers several perf PMU drivers. Each of the perf driver provides description of its available events and configuration options -in sysfs, see /sys/devices/<l3cX/iobX/mcbX/mcX>/. +in sysfs, see /sys/bus/event_source/devices/<l3cX/iobX/mcbX/mcX>/. The "format" directory describes format of the config (event ID), config1 (agent ID) fields of the perf_event_attr structure. The "events" diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 1e0d101b020a..412423c54f25 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -262,6 +262,17 @@ lowest non-linear performance in `AMD CPPC Performance Capability <perf_cap_>`_.) This attribute is read-only. +``amd_pstate_hw_prefcore`` + +Whether the platform supports the preferred core feature and it has been +enabled. This attribute is read-only. + +``amd_pstate_prefcore_ranking`` + +The performance ranking of the core. This number doesn't have any unit, but +larger numbers are preferred at the time of reading. This can change at +runtime based on platform conditions. This attribute is read-only. + ``energy_performance_available_preferences`` A list of all the supported EPP preferences that could be used for @@ -281,6 +292,22 @@ integer values defined between 0 to 255 when EPP feature is enabled by platform firmware, if EPP feature is disabled, driver will ignore the written value This attribute is read-write. +``boost`` +The `boost` sysfs attribute provides control over the CPU core +performance boost, allowing users to manage the maximum frequency limitation +of the CPU. This attribute can be used to enable or disable the boost feature +on individual CPUs. + +When the boost feature is enabled, the CPU can dynamically increase its frequency +beyond the base frequency, providing enhanced performance for demanding workloads. +On the other hand, disabling the boost feature restricts the CPU to operate at the +base frequency, which may be desirable in certain scenarios to prioritize power +efficiency or manage temperature. + +To manipulate the `boost` attribute, users can write a value of `0` to disable the +boost or `1` to enable it, for the respective CPU using the sysfs path +`/sys/devices/system/cpu/cpuX/cpufreq/boost`, where `X` represents the CPU number. + Other performance and frequency values can be read back from ``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`. @@ -406,7 +433,7 @@ control its functionality at the system level. They are located in the ``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs. ``status`` - Operation mode of the driver: "active", "passive" or "disable". + Operation mode of the driver: "active", "passive", "guided" or "disable". "active" The driver is functional and in the ``active mode`` diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 6adb7988e0eb..2d74af7f0efe 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -231,7 +231,7 @@ are the following: present). The existence of the limit may be a result of some (often unintentional) - BIOS settings, restrictions coming from a service processor or another + BIOS settings, restrictions coming from a service processor or other BIOS/HW-based mechanisms. This does not cover ACPI thermal limitations which can be discovered @@ -248,6 +248,20 @@ are the following: If that frequency cannot be determined, this attribute should not be present. +``cpuinfo_avg_freq`` + An average frequency (in KHz) of all CPUs belonging to a given policy, + derived from a hardware provided feedback and reported on a time frame + spanning at most few milliseconds. + + This is expected to be based on the frequency the hardware actually runs + at and, as such, might require specialised hardware support (such as AMU + extension on ARM). If one cannot be determined, this attribute should + not be present. + + Note that failed attempt to retrieve current frequency for a given + CPU(s) will result in an appropriate error, i.e.: EAGAIN for CPU that + remains idle (raised on ARM). + ``cpuinfo_max_freq`` Maximum possible operating frequency the CPUs belonging to this policy can run at (in kHz). @@ -267,6 +281,10 @@ are the following: ``related_cpus`` List of all (online and offline) CPUs belonging to this policy. +``scaling_available_frequencies`` + List of available frequencies of the CPUs belonging to this policy + (in kHz). + ``scaling_available_governors`` List of ``CPUFreq`` scaling governors present in the kernel that can be attached to this policy or (if the |intel_pstate| scaling driver is @@ -289,7 +307,8 @@ are the following: Some architectures (e.g. ``x86``) may attempt to provide information more precisely reflecting the current CPU frequency through this attribute, but that still may not be the exact current CPU frequency as - seen by the hardware at the moment. + seen by the hardware at the moment. This behavior though, is only + available via c:macro:``CPUFREQ_ARCH_CUR_FREQ`` option. ``scaling_driver`` The scaling driver currently in use. @@ -421,8 +440,8 @@ This governor exposes only one tunable: ``rate_limit_us`` Minimum time (in microseconds) that has to pass between two consecutive - runs of governor computations (default: 1000 times the scaling driver's - transition latency). + runs of governor computations (default: 1.5 times the scaling driver's + transition latency or the maximum 2ms). The purpose of this tunable is to reduce the scheduler context overhead of the governor which might be excessive without it. @@ -470,17 +489,17 @@ This governor exposes the following tunables: This is how often the governor's worker routine should run, in microseconds. - Typically, it is set to values of the order of 10000 (10 ms). Its - default value is equal to the value of ``cpuinfo_transition_latency`` - for each policy this governor is attached to (but since the unit here - is greater by 1000, this means that the time represented by - ``sampling_rate`` is 1000 times greater than the transition latency by - default). + Typically, it is set to values of the order of 2000 (2 ms). Its + default value is to add a 50% breathing room + to ``cpuinfo_transition_latency`` on each policy this governor is + attached to. The minimum is typically the length of two scheduler + ticks. If this tunable is per-policy, the following shell command sets the time - represented by it to be 750 times as high as the transition latency:: + represented by it to be 1.5 times as high as the transition latency + (the default):: - # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate + # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2))` > ondemand/sampling_rate ``up_threshold`` If the estimated CPU load is above this value (in percent), the governor diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 19754beb5a4e..0c090b076224 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -269,61 +269,56 @@ Namely, when invoked to select an idle state for a CPU (i.e. an idle state that the CPU will ask the processor hardware to enter), it attempts to predict the idle duration and uses the predicted value for idle state selection. -It first obtains the time until the closest timer event with the assumption -that the scheduler tick will be stopped. That time, referred to as the *sleep -length* in what follows, is the upper bound on the time before the next CPU -wakeup. It is used to determine the sleep length range, which in turn is needed -to get the sleep length correction factor. - -The ``menu`` governor maintains two arrays of sleep length correction factors. -One of them is used when tasks previously running on the given CPU are waiting -for some I/O operations to complete and the other one is used when that is not -the case. Each array contains several correction factor values that correspond -to different sleep length ranges organized so that each range represented in the -array is approximately 10 times wider than the previous one. - -The correction factor for the given sleep length range (determined before -selecting the idle state for the CPU) is updated after the CPU has been woken -up and the closer the sleep length is to the observed idle duration, the closer -to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). -The sleep length is multiplied by the correction factor for the range that it -falls into to obtain the first approximation of the predicted idle duration. - -Next, the governor uses a simple pattern recognition algorithm to refine its +It first uses a simple pattern recognition algorithm to obtain a preliminary idle duration prediction. Namely, it saves the last 8 observed idle duration values and, when predicting the idle duration next time, it computes the average and variance of them. If the variance is small (smaller than 400 square milliseconds) or it is small relative to the average (the average is greater that 6 times the standard deviation), the average is regarded as the "typical -interval" value. Otherwise, the longest of the saved observed idle duration +interval" value. Otherwise, either the longest or the shortest (depending on +which one is farther from the average) of the saved observed idle duration values is discarded and the computation is repeated for the remaining ones. + Again, if the variance of them is small (in the above sense), the average is taken as the "typical interval" value and so on, until either the "typical -interval" is determined or too many data points are disregarded, in which case -the "typical interval" is assumed to equal "infinity" (the maximum unsigned -integer value). The "typical interval" computed this way is compared with the -sleep length multiplied by the correction factor and the minimum of the two is -taken as the predicted idle duration. - -Then, the governor computes an extra latency limit to help "interactive" -workloads. It uses the observation that if the exit latency of the selected -idle state is comparable with the predicted idle duration, the total time spent -in that state probably will be very short and the amount of energy to save by -entering it will be relatively small, so likely it is better to avoid the -overhead related to entering that state and exiting it. Thus selecting a -shallower state is likely to be a better option then. The first approximation -of the extra latency limit is the predicted idle duration itself which -additionally is divided by a value depending on the number of tasks that -previously ran on the given CPU and now they are waiting for I/O operations to -complete. The result of that division is compared with the latency limit coming -from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_, -framework and the minimum of the two is taken as the limit for the idle states' -exit latency. +interval" is determined or too many data points are disregarded. In the latter +case, if the size of the set of data points still under consideration is +sufficiently large, the next idle duration is not likely to be above the largest +idle duration value still in that set, so that value is taken as the predicted +next idle duration. Finally, if the set of data points still under +consideration is too small, no prediction is made. + +If the preliminary prediction of the next idle duration computed this way is +long enough, the governor obtains the time until the closest timer event with +the assumption that the scheduler tick will be stopped. That time, referred to +as the *sleep length* in what follows, is the upper bound on the time before the +next CPU wakeup. It is used to determine the sleep length range, which in turn +is needed to get the sleep length correction factor. + +The ``menu`` governor maintains an array containing several correction factor +values that correspond to different sleep length ranges organized so that each +range represented in the array is approximately 10 times wider than the previous +one. + +The correction factor for the given sleep length range (determined before +selecting the idle state for the CPU) is updated after the CPU has been woken +up and the closer the sleep length is to the observed idle duration, the closer +to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). +The sleep length is multiplied by the correction factor for the range that it +falls into to obtain an approximation of the predicted idle duration that is +compared to the "typical interval" determined previously and the minimum of +the two is taken as the final idle duration prediction. + +If the "typical interval" value is small, which means that the CPU is likely +to be woken up soon enough, the sleep length computation is skipped as it may +be costly and the idle duration is simply predicted to equal the "typical +interval" value. Now, the governor is ready to walk the list of idle states and choose one of them. For this purpose, it compares the target residency of each state with -the predicted idle duration and the exit latency of it with the computed latency -limit. It selects the state with the target residency closest to the predicted +the predicted idle duration and the exit latency of it with the with the latency +limit coming from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_, +framework. It selects the state with the target residency closest to the predicted idle duration, but still below it, and exit latency that does not exceed the limit. diff --git a/Documentation/admin-guide/pm/intel_idle.rst b/Documentation/admin-guide/pm/intel_idle.rst index 39bd6ecce7de..ed6f055d4b14 100644 --- a/Documentation/admin-guide/pm/intel_idle.rst +++ b/Documentation/admin-guide/pm/intel_idle.rst @@ -38,6 +38,27 @@ instruction at all. only way to pass early-configuration-time parameters to it is via the kernel command line. +Sysfs Interface +=============== + +The ``intel_idle`` driver exposes the following ``sysfs`` attributes in +``/sys/devices/system/cpu/cpuidle/``: + +``intel_c1_demotion`` + Enable or disable C1 demotion for all CPUs in the system. This file is + only exposed on platforms that support the C1 demotion feature and where + it was tested. Value 0 means that C1 demotion is disabled, value 1 means + that it is enabled. Write 0 or 1 to disable or enable C1 demotion for + all CPUs. + + The C1 demotion feature involves the platform firmware demoting deep + C-state requests from the OS (e.g., C6 requests) to C1. The idea is that + firmware monitors CPU wake-up rate, and if it is higher than a + platform-specific threshold, the firmware demotes deep C-state requests + to C1. For example, Linux requests C6, but firmware noticed too many + wake-ups per second, and it keeps the CPU in C1. When the CPU stays in + C1 long enough, the platform promotes it back to C6. This may improve + some workloads' performance, but it may also increase power consumption. .. _intel-idle-enumeration-of-states: @@ -192,11 +213,19 @@ even if they have been enumerated (see :ref:`cpu-pm-qos` in Documentation/admin-guide/pm/cpuidle.rst). Setting ``max_cstate`` to 0 causes the ``intel_idle`` initialization to fail. -The ``no_acpi`` and ``use_acpi`` module parameters (recognized by ``intel_idle`` -if the kernel has been configured with ACPI support) can be set to make the -driver ignore the system's ACPI tables entirely or use them for all of the -recognized processor models, respectively (they both are unset by default and -``use_acpi`` has no effect if ``no_acpi`` is set). +The ``no_acpi``, ``use_acpi`` and ``no_native`` module parameters are +recognized by ``intel_idle`` if the kernel has been configured with ACPI +support. In the case that ACPI is not configured these flags have no impact +on functionality. + +``no_acpi`` - Do not use ACPI at all. Only native mode is available, no +ACPI mode. + +``use_acpi`` - No-op in ACPI mode, the driver will consult ACPI tables for +C-states on/off status in native mode. + +``no_native`` - Work only in ACPI mode, no native mode available (ignore +all custom tables). The value of the ``states_off`` module parameter (0 by default) represents a list of idle states to be disabled by default in the form of a bitmask. diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst index bf13ad25a32f..26e702c7016e 100644 --- a/Documentation/admin-guide/pm/intel_pstate.rst +++ b/Documentation/admin-guide/pm/intel_pstate.rst @@ -329,6 +329,106 @@ information listed above is the same for all of the processors supporting the HWP feature, which is why ``intel_pstate`` works with all of them.] +Support for Hybrid Processors +============================= + +Some processors supported by ``intel_pstate`` contain two or more types of CPU +cores differing by the maximum turbo P-state, performance vs power characteristics, +cache sizes, and possibly other properties. They are commonly referred to as +hybrid processors. To support them, ``intel_pstate`` requires HWP to be enabled +and it assumes the HWP performance units to be the same for all CPUs in the +system, so a given HWP performance level always represents approximately the +same physical performance regardless of the core (CPU) type. + +Hybrid Processors with SMT +-------------------------- + +On systems where SMT (Simultaneous Multithreading), also referred to as +HyperThreading (HT) in the context of Intel processors, is enabled on at least +one core, ``intel_pstate`` assigns performance-based priorities to CPUs. Namely, +the priority of a given CPU reflects its highest HWP performance level which +causes the CPU scheduler to generally prefer more performant CPUs, so the less +performant CPUs are used when the other ones are fully loaded. However, SMT +siblings (that is, logical CPUs sharing one physical core) are treated in a +special way such that if one of them is in use, the effective priority of the +other ones is lowered below the priorities of the CPUs located in the other +physical cores. + +This approach maximizes performance in the majority of cases, but unfortunately +it also leads to excessive energy usage in some important scenarios, like video +playback, which is not generally desirable. While there is no other viable +choice with SMT enabled because the effective capacity and utilization of SMT +siblings are hard to determine, hybrid processors without SMT can be handled in +more energy-efficient ways. + +.. _CAS: + +Capacity-Aware Scheduling Support +--------------------------------- + +The capacity-aware scheduling (CAS) support in the CPU scheduler is enabled by +``intel_pstate`` by default on hybrid processors without SMT. CAS generally +causes the scheduler to put tasks on a CPU so long as there is a sufficient +amount of spare capacity on it, and if the utilization of a given task is too +high for it, the task will need to go somewhere else. + +Since CAS takes CPU capacities into account, it does not require CPU +prioritization and it allows tasks to be distributed more symmetrically among +the more performant and less performant CPUs. Once placed on a CPU with enough +capacity to accommodate it, a task may just continue to run there regardless of +whether or not the other CPUs are fully loaded, so on average CAS reduces the +utilization of the more performant CPUs which causes the energy usage to be more +balanced because the more performant CPUs are generally less energy-efficient +than the less performant ones. + +In order to use CAS, the scheduler needs to know the capacity of each CPU in +the system and it needs to be able to compute scale-invariant utilization of +CPUs, so ``intel_pstate`` provides it with the requisite information. + +First of all, the capacity of each CPU is represented by the ratio of its highest +HWP performance level, multiplied by 1024, to the highest HWP performance level +of the most performant CPU in the system, which works because the HWP performance +units are the same for all CPUs. Second, the frequency-invariance computations, +carried out by the scheduler to always express CPU utilization in the same units +regardless of the frequency it is currently running at, are adjusted to take the +CPU capacity into account. All of this happens when ``intel_pstate`` has +registered itself with the ``CPUFreq`` core and it has figured out that it is +running on a hybrid processor without SMT. + +Energy-Aware Scheduling Support +------------------------------- + +If ``CONFIG_ENERGY_MODEL`` has been set during kernel configuration and +``intel_pstate`` runs on a hybrid processor without SMT, in addition to enabling +`CAS <CAS_>`_ it registers an Energy Model for the processor. This allows the +Energy-Aware Scheduling (EAS) support to be enabled in the CPU scheduler if +``schedutil`` is used as the ``CPUFreq`` governor which requires ``intel_pstate`` +to operate in the `passive mode <Passive Mode_>`_. + +The Energy Model registered by ``intel_pstate`` is artificial (that is, it is +based on abstract cost values and it does not include any real power numbers) +and it is relatively simple to avoid unnecessary computations in the scheduler. +There is a performance domain in it for every CPU in the system and the cost +values for these performance domains have been chosen so that running a task on +a less performant (small) CPU appears to be always cheaper than running that +task on a more performant (big) CPU. However, for two CPUs of the same type, +the cost difference depends on their current utilization, and the CPU whose +current utilization is higher generally appears to be a more expensive +destination for a given task. This helps to balance the load among CPUs of the +same type. + +Since EAS works on top of CAS, high-utilization tasks are always migrated to +CPUs with enough capacity to accommodate them, but thanks to EAS, low-utilization +tasks tend to be placed on the CPUs that look less expensive to the scheduler. +Effectively, this causes the less performant and less loaded CPUs to be +preferred as long as they have enough spare capacity to run the given task +which generally leads to reduced energy usage. + +The Energy Model created by ``intel_pstate`` can be inspected by looking at +the ``energy_model`` directory in ``debugfs`` (typlically mounted on +``/sys/kernel/debug/``). + + User Space Interface in ``sysfs`` ================================= @@ -696,6 +796,9 @@ of them have to be prepended with the ``intel_pstate=`` prefix. Use per-logical-CPU P-State limits (see `Coordination of P-state Limits`_ for details). +``no_cas`` + Do not enable `capacity-aware scheduling <CAS_>`_ which is enabled by + default on hybrid systems without SMT. Diagnostics and Tuning ====================== diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst index 5ab3440e6cee..d367ba4d744a 100644 --- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst +++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst @@ -91,12 +91,22 @@ Attributes in each directory: ``domain_id`` This attribute is used to get the power domain id of this instance. +``die_id`` + This attribute is used to get the Linux die id of this instance. + This attribute is only present for domains with core agents and + when the CPUID leaf 0x1f presents die ID. + ``fabric_cluster_id`` This attribute is used to get the fabric cluster id of this instance. ``package_id`` This attribute is used to get the package id of this instance. +``agent_types`` + This attribute displays all the hardware agents present within the + domain. Each agent has the capability to control one or more hardware + subsystems, which include: core, cache, memory, and I/O. + The other attributes are same as presented at package_*_die_* level. In most of current use cases, the "max_freq_khz" and "min_freq_khz" @@ -113,3 +123,62 @@ to apply at each uncore* level. Support for "current_freq_khz" is available only at each fabric cluster level (i.e., in uncore* directory). + +Efficiency vs. Latency Tradeoff +------------------------------- + +The Efficiency Latency Control (ELC) feature improves performance +per watt. With this feature hardware power management algorithms +optimize trade-off between latency and power consumption. For some +latency sensitive workloads further tuning can be done by SW to +get desired performance. + +The hardware monitors the average CPU utilization across all cores +in a power domain at regular intervals and decides an uncore frequency. +While this may result in the best performance per watt, workload may be +expecting higher performance at the expense of power. Consider an +application that intermittently wakes up to perform memory reads on an +otherwise idle system. In such cases, if hardware lowers uncore +frequency, then there may be delay in ramp up of frequency to meet +target performance. + +The ELC control defines some parameters which can be changed from SW. +If the average CPU utilization is below a user-defined threshold +(elc_low_threshold_percent attribute below), the user-defined uncore +floor frequency will be used (elc_floor_freq_khz attribute below) +instead of hardware calculated minimum. + +Similarly in high load scenario where the CPU utilization goes above +the high threshold value (elc_high_threshold_percent attribute below) +instead of jumping to maximum uncore frequency, frequency is increased +in 100MHz steps. This avoids consuming unnecessarily high power +immediately with CPU utilization spikes. + +Attributes for efficiency latency control: + +``elc_floor_freq_khz`` + This attribute is used to get/set the efficiency latency floor frequency. + If this variable is lower than the 'min_freq_khz', it is ignored by + the firmware. + +``elc_low_threshold_percent`` + This attribute is used to get/set the efficiency latency control low + threshold. This attribute is in percentages of CPU utilization. + +``elc_high_threshold_percent`` + This attribute is used to get/set the efficiency latency control high + threshold. This attribute is in percentages of CPU utilization. + +``elc_high_threshold_enable`` + This attribute is used to enable/disable the efficiency latency control + high threshold. Write '1' to enable, '0' to disable. + +Example system configuration below, which does following: + * when CPU utilization is less than 10%: sets uncore frequency to 800MHz + * when CPU utilization is higher than 95%: increases uncore frequency in + 100MHz steps, until power limit is reached + + elc_floor_freq_khz:800000 + elc_high_threshold_percent:95 + elc_high_threshold_enable:1 + elc_low_threshold_percent:10 diff --git a/Documentation/admin-guide/pmf.rst b/Documentation/admin-guide/pmf.rst deleted file mode 100644 index 9ee729ffc19b..000000000000 --- a/Documentation/admin-guide/pmf.rst +++ /dev/null @@ -1,24 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Set udev rules for PMF Smart PC Builder ---------------------------------------- - -AMD PMF(Platform Management Framework) Smart PC Solution builder has to set the system states -like S0i3, Screen lock, hibernate etc, based on the output actions provided by the PMF -TA (Trusted Application). - -In order for this to work the PMF driver generates a uevent for userspace to react to. Below are -sample udev rules that can facilitate this experience when a machine has PMF Smart PC solution builder -enabled. - -Please add the following line(s) to -``/etc/udev/rules.d/99-local.rules``:: - - DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="0", RUN+="/usr/bin/systemctl suspend" - DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="1", RUN+="/usr/bin/systemctl hibernate" - DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="2", RUN+="/bin/loginctl lock-sessions" - -EVENT_ID values: -0= Put the system to S0i3/S2Idle -1= Put the system to hibernate -2= Lock the screen diff --git a/Documentation/admin-guide/pnp.rst b/Documentation/admin-guide/pnp.rst index 3eda08191d13..24d80e3eb309 100644 --- a/Documentation/admin-guide/pnp.rst +++ b/Documentation/admin-guide/pnp.rst @@ -129,9 +129,6 @@ pnp_put_protocol pnp_register_protocol use this to register a new PnP protocol -pnp_unregister_protocol - use this function to remove a PnP protocol from the Plug and Play Layer - pnp_register_driver adds a PnP driver to the Plug and Play Layer diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst index f08149bc53f8..4a5ffb0996a3 100644 --- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst +++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst @@ -347,7 +347,7 @@ again. [:ref:`details<uninstall>`] -.. _submit_improvements: +.. _submit_improvements_qbtl: Did you run into trouble following any of the above steps that is not cleared up by the reference section below? Or do you have ideas how to improve the text? @@ -733,7 +733,7 @@ can easily happen that your self-built kernel will lack modules for tasks you did not perform before utilizing this make target. That's because those tasks require kernel modules that are normally autoloaded when you perform that task for the first time; if you didn't perform that task at least once before using -localmodonfig, the latter will thus assume these modules are superfluous and +localmodconfig, the latter will thus assume these modules are superfluous and disable them. You can try to avoid this by performing typical tasks that often will autoload @@ -1070,7 +1070,7 @@ complicated, and harder to follow. That being said: this of course is a balancing act. Hence, if you think an additional use-case is worth describing, suggest it to the maintainers of this -document, as :ref:`described above <submit_improvements>`. +document, as :ref:`described above <submit_improvements_qbtl>`. .. diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index e9f85142182d..2eabef31220d 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -23,6 +23,8 @@ and type of the memory area are set using three variables: * ``mem_size`` for the size. The memory size will be rounded down to a power of two. * ``mem_type`` to specify if the memory type (default is pgprot_writecombine). + * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command + line parameter. Typically the default value of ``mem_type=0`` should be used as that sets the pstore mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use @@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several different manners: return ret; } + D. Using a region of memory reserved via ``reserve_mem`` command line + parameter. The address and size will be defined by the ``reserve_mem`` + parameter. Note, that ``reserve_mem`` may not always allocate memory + in the same location, and cannot be relied upon. Testing will need + to be done, and it may not work on every machine, nor every kernel. + Consider this a "best effort" approach. The ``reserve_mem`` option + takes a size, alignment and name as arguments. The name is used + to map the memory to a label that can be retrieved by ramoops. + + reserve_mem=2M:4096:oops ramoops.mem_name=oops + You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() very early in the architecture code, e.g.:: diff --git a/Documentation/admin-guide/reporting-issues.rst b/Documentation/admin-guide/reporting-issues.rst index 2fd5a030235a..9a847506f6ec 100644 --- a/Documentation/admin-guide/reporting-issues.rst +++ b/Documentation/admin-guide/reporting-issues.rst @@ -41,7 +41,7 @@ If you are facing multiple issues with the Linux kernel at once, report each separately. While writing your report, include all information relevant to the issue, like the kernel and the distro used. In case of a regression, CC the regressions mailing list (regressions@lists.linux.dev) to your report. Also try -to pin-point the culprit with a bisection; if you succeed, include its +to pinpoint the culprit with a bisection; if you succeed, include its commit-id and CC everyone in the sign-off-by chain. Once the report is out, answer any questions that come up and help where you @@ -206,7 +206,7 @@ Reporting issues only occurring in older kernel version lines This subsection is for you, if you tried the latest mainline kernel as outlined above, but failed to reproduce your issue there; at the same time you want to see the issue fixed in a still supported stable or longterm series or vendor -kernels regularly rebased on those. If that the case, follow these steps: +kernels regularly rebased on those. If that is the case, follow these steps: * Prepare yourself for the possibility that going through the next few steps might not get the issue solved in older releases: the fix might be too big @@ -312,7 +312,7 @@ small modifications to a kernel based on a recent Linux version; that for example often holds true for the mainline kernels shipped by Debian GNU/Linux Sid or Fedora Rawhide. Some developers will also accept reports about issues with kernels from distributions shipping the latest stable kernel, as long as -its only slightly modified; that for example is often the case for Arch Linux, +it's only slightly modified; that for example is often the case for Arch Linux, regular Fedora releases, and openSUSE Tumbleweed. But keep in mind, you better want to use a mainline Linux and avoid using a stable kernel for this process, as outlined in the section 'Install a fresh kernel for testing' in more diff --git a/Documentation/admin-guide/reporting-regressions.rst b/Documentation/admin-guide/reporting-regressions.rst index 76b246ecf21b..946518355a2c 100644 --- a/Documentation/admin-guide/reporting-regressions.rst +++ b/Documentation/admin-guide/reporting-regressions.rst @@ -42,12 +42,12 @@ The important basics -------------------- -What is a "regression" and what is the "no regressions rule"? +What is a "regression" and what is the "no regressions" rule? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's a regression if some application or practical use case running fine with one Linux kernel works worse or not at all with a newer version compiled using a -similar configuration. The "no regressions rule" forbids this to take place; if +similar configuration. The "no regressions" rule forbids this to take place; if it happens by accident, developers that caused it are expected to quickly fix the issue. @@ -173,7 +173,7 @@ Additional details about regressions ------------------------------------ -What is the goal of the "no regressions rule"? +What is the goal of the "no regressions" rule? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Users should feel safe when updating kernel versions and not have to worry @@ -199,8 +199,8 @@ Exceptions to this rule are extremely rare; in the past developers almost always turned out to be wrong when they assumed a particular situation was warranting an exception. -Who ensures the "no regressions" is actually followed? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Who ensures the "no regressions" rule is actually followed? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The subsystem maintainers should take care of that, which are watched and supported by the tree maintainers -- e.g. Linus Torvalds for mainline and diff --git a/Documentation/admin-guide/serial-console.rst b/Documentation/admin-guide/serial-console.rst index a3dfc2c66e01..1609e7479249 100644 --- a/Documentation/admin-guide/serial-console.rst +++ b/Documentation/admin-guide/serial-console.rst @@ -78,7 +78,9 @@ If no console device is specified, the first device found capable of acting as a system console will be used. At this time, the system first looks for a VGA card and then for a serial port. So if you don't have a VGA card in your system the first serial port will automatically -become the console. +become the console, unless the kernel is configured with the +CONFIG_NULL_TTY_DEFAULT_CONSOLE option, then it will default to using the +ttynull device. You will need to create a new device to use ``/dev/console``. The official ``/dev/console`` is now character device 5,1. diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index 47499a1742bd..6c54718c9d04 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -38,6 +38,11 @@ requests. ``aio-max-nr`` allows you to change the maximum value ``aio-max-nr`` does not result in the pre-allocation or re-sizing of any kernel data structures. +dentry-negative +---------------------------- + +Policy for negative dentries. Set to 1 to always delete the dentry when a +file is removed, and 0 to disable it. By default, this behavior is disabled. dentry-state ------------ @@ -332,3 +337,38 @@ Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit one. The current default value for ``max_user_watches`` is 4% of the available low memory, divided by the "watch" cost in bytes. + +5. /proc/sys/fs/fuse - Configuration options for FUSE filesystems +===================================================================== + +This directory contains the following configuration options for FUSE +filesystems: + +``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for +setting/getting the maximum number of pages that can be used for servicing +requests in FUSE. + +``/proc/sys/fs/fuse/default_request_timeout`` is a read/write file for +setting/getting the default timeout (in seconds) for a fuse server to +reply to a kernel-issued request in the event where the server did not +specify a timeout at mount. If the server set a timeout, +then default_request_timeout will be ignored. The default +"default_request_timeout" is set to 0. 0 indicates no default timeout. +The maximum value that can be set is 65535. + +``/proc/sys/fs/fuse/max_request_timeout`` is a read/write file for +setting/getting the maximum timeout (in seconds) for a fuse server to +reply to a kernel-issued request. A value greater than 0 automatically opts +the server into a timeout that will be set to at most "max_request_timeout", +even if the server did not specify a timeout and default_request_timeout is +set to 0. If max_request_timeout is greater than 0 and the server set a timeout +greater than max_request_timeout or default_request_timeout is set to a value +greater than max_request_timeout, the system will use max_request_timeout as the +timeout. 0 indicates no max request timeout. The maximum value that can be set +is 65535. + +For timeouts, if the server does not respond to the request by the time +the set timeout elapses, then the connection to the fuse server will be aborted. +Please note that the timeouts are not 100% precise (eg you may set 60 seconds but +the timeout may kick in after 70 seconds). The upper margin of error for the +timeout is roughly FUSE_TIMEOUT_TIMER_FREQ seconds. diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 7fd43947832f..dd49a89a62d3 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -212,6 +212,17 @@ pid>/``). This value defaults to 0. +core_sort_vma +============= + +The default coredump writes VMAs in address order. By setting +``core_sort_vma`` to 1, VMAs will be written from smallest size +to largest size. This is known to break at least elfutils, but +can be handy when dealing with very large (and truncated) +coredumps where the more useful debugging details are included +in the smaller VMAs. + + core_uses_pid ============= @@ -401,6 +412,15 @@ The upper bound on the number of tasks that are checked. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_detect_count +====================== + +Indicates the total number of tasks that have been detected as hung since +the system boot. + +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + + hung_task_timeout_secs ====================== @@ -454,7 +474,7 @@ ignore-unaligned-usertrap On architectures where unaligned accesses cause traps, and where this feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``; -currently, ``arc`` and ``loongarch``), controls whether all +currently, ``arc``, ``parisc`` and ``loongarch``), controls whether all unaligned traps are logged. = ============================================================= @@ -1535,6 +1555,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff). If a value outside of this range is written to ``threads-max`` an ``EINVAL`` error occurs. +timer_migration +=============== + +When set to a non-zero value, attempt to migrate timers away from idle cpus to +allow them to remain in low power states longer. + +Default is set (1). traceoff_on_warning =================== diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 7250c0542828..7b0c4291c686 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -72,6 +72,7 @@ two flavors of JITs, the newer eBPF JIT currently supported on: - riscv64 - riscv32 - loongarch64 + - arc And the older cBPF JIT supported on the following archs: diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index c59889de122b..9bef46151d53 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -28,6 +28,7 @@ Currently, these files are in /proc/sys/vm: - compact_memory - compaction_proactiveness - compact_unevictable_allowed +- defrag_mode - dirty_background_bytes - dirty_background_ratio - dirty_bytes @@ -36,6 +37,7 @@ Currently, these files are in /proc/sys/vm: - dirtytime_expire_seconds - dirty_writeback_centisecs - drop_caches +- enable_soft_offline - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group @@ -43,6 +45,7 @@ Currently, these files are in /proc/sys/vm: - legacy_va_layout - lowmem_reserve_ratio - max_map_count +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y) - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes @@ -72,6 +75,7 @@ Currently, these files are in /proc/sys/vm: - unprivileged_userfaultfd - user_reserve_kbytes - vfs_cache_pressure +- vfs_cache_pressure_denom - watermark_boost_factor - watermark_scale_factor - zone_reclaim_mode @@ -128,6 +132,12 @@ to latency spikes in unsuspecting applications. The kernel employs various heuristics to avoid wasting CPU cycles if it detects that proactive compaction is not being effective. +Setting the value above 80 will, in addition to lowering the acceptable level +of fragmentation, make the compaction code more sensitive to increases in +fragmentation, i.e. compaction will trigger more often, but reduce +fragmentation by a smaller amount. +This makes the fragmentation level more stable over time. + Be careful when setting it to extreme values like 100, as that may cause excessive background compaction activity. @@ -143,6 +153,14 @@ On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due to compaction, which would block the task from becoming active until the fault is resolved. +defrag_mode +=========== + +When set to 1, the page allocator tries harder to avoid fragmentation +and maintain the ability to produce huge pages / higher-order pages. + +It is recommended to enable this right after boot, as fragmentation, +once it occurred, can be long-lasting or even permanent. dirty_background_bytes ====================== @@ -266,6 +284,43 @@ used:: These are informational only. They do not mean that anything is wrong with your system. To disable them, echo 4 (bit 2) into drop_caches. +enable_soft_offline +=================== +Correctable memory errors are very common on servers. Soft-offline is kernel's +solution for memory pages having (excessive) corrected memory errors. + +For different types of page, soft-offline has different behaviors / costs. + +- For a raw error page, soft-offline migrates the in-use page's content to + a new raw page. + +- For a page that is part of a transparent hugepage, soft-offline splits the + transparent hugepage into raw pages, then migrates only the raw error page. + As a result, user is transparently backed by 1 less hugepage, impacting + memory access performance. + +- For a page that is part of a HugeTLB hugepage, soft-offline first migrates + the entire HugeTLB hugepage, during which a free hugepage will be consumed + as migration target. Then the original hugepage is dissolved into raw + pages without compensation, reducing the capacity of the HugeTLB pool by 1. + +It is user's call to choose between reliability (staying away from fragile +physical memory) vs performance / capacity implications in transparent and +HugeTLB cases. + +For all architectures, enable_soft_offline controls whether to soft offline +memory pages. When set to 1, kernel attempts to soft offline the pages +whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to +the request to soft offline the pages. Its default value is 1. + +It is worth mentioning that after setting enable_soft_offline to 0, the +following requests to soft offline pages will not be performed: + +- Request to soft offline pages from RAS Correctable Errors Collector. + +- On ARM, the request to soft offline pages from GHES driver. + +- On PARISC, the request to soft offline pages from Page Deallocation Table. extfrag_threshold ================= @@ -425,6 +480,21 @@ e.g., up to one or two maps per allocation. The default value is 65530. +mem_profiling +============== + +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y) + +1: Enable memory profiling. + +0: Disable memory profiling. + +Enabling memory profiling introduces a small performance overhead for all +memory allocations. + +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. + + memory_failure_early_kill: ========================== @@ -954,19 +1024,28 @@ vfs_cache_pressure This percentage value controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects. -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. +At the default value of vfs_cache_pressure=vfs_cache_pressure_denom the kernel +will attempt to reclaim dentries and inodes at a "fair" rate with respect to +pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the +kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, +the kernel will never reclaim dentries and inodes due to memory pressure and +this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure +beyond vfs_cache_pressure_denom causes the kernel to prefer to reclaim dentries +and inodes. -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. +Increasing vfs_cache_pressure significantly beyond vfs_cache_pressure_denom may +have negative performance impact. Reclaim code needs to take various locks to +find freeable directory and inode objects. When vfs_cache_pressure equals +(10 * vfs_cache_pressure_denom), it will look for ten times more freeable +objects than there are. + +Note: This setting should always be used together with vfs_cache_pressure_denom. + +vfs_cache_pressure_denom +======================== +Defaults to 100 (minimum allowed value). Requires corresponding +vfs_cache_pressure setting to take effect. watermark_boost_factor ====================== diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index 2f2e5bd440f9..9c7aa817adc7 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -49,26 +49,26 @@ How do I use the magic SysRq key? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On x86 - You press the key combo :kbd:`ALT-SysRq-<command key>`. + You press the key combo `ALT-SysRq-<command key>`. .. note:: Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is also known as the 'Print Screen' key. Also some keyboards cannot handle so many keys being pressed at the same time, so you might - have better luck with press :kbd:`Alt`, press :kbd:`SysRq`, - release :kbd:`SysRq`, press :kbd:`<command key>`, release everything. + have better luck with press `Alt`, press `SysRq`, + release `SysRq`, press `<command key>`, release everything. On SPARC - You press :kbd:`ALT-STOP-<command key>`, I believe. + You press `ALT-STOP-<command key>`, I believe. On the serial console (PC style standard serial ports only) You send a ``BREAK``, then within 5 seconds a command key. Sending ``BREAK`` twice is interpreted as a normal BREAK. On PowerPC - Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:`<command key>`. - :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`<command key>` may suffice. + Press `ALT - Print Screen` (or `F13`) - `<command key>`. + `Print Screen` (or `F13`) - `<command key>` may suffice. On other If you know of the key combos for other architectures, please @@ -88,7 +88,7 @@ On all echo _reisub > /proc/sysrq-trigger -The :kbd:`<command key>` is case sensitive. +The `<command key>` is case sensitive. What are the 'command' keys? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -161,6 +161,8 @@ Command Function will be printed to your console. (``0``, for example would make it so that only emergency messages like PANICs or OOPSes would make it to your console.) + +``R`` Replay the kernel log messages on consoles. =========== =================================================================== Okay, so what can I use them for? @@ -211,14 +213,21 @@ processes. "just thaw ``it(j)``" is useful if your system becomes unresponsive due to a frozen (probably root) filesystem via the FIFREEZE ioctl. +``Replay logs(R)`` is useful to view the kernel log messages when system is hung +or you are not able to use dmesg command to view the messages in printk buffer. +User may have to press the key combination multiple times if console system is +busy. If it is completely locked up, then messages won't be printed. Output +messages depend on current console loglevel, which can be modified using +sysrq[0-9] (see above). + Sometimes SysRq seems to get 'stuck' after using it, what can I do? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When this happens, try tapping shift, alt and control on both sides of the keyboard, and hitting an invalid sysrq sequence again. (i.e., something like -:kbd:`alt-sysrq-z`). +`alt-sysrq-z`). -Switching to another virtual console (:kbd:`ALT+Fn`) and then back again +Switching to another virtual console (`ALT+Fn`) and then back again should also help. I hit SysRq, but nothing seems to happen, what's wrong? @@ -281,7 +290,7 @@ exception the header line from the sysrq command is passed to all console consumers as if the current loglevel was maximum. If only the header is emitted it is almost certain that the kernel loglevel is too low. Should you require the output on the console channel then you will need -to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or:: +to temporarily up the console loglevel using `alt-sysrq-8` or:: echo 8 > /proc/sysrq-trigger diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index f92551539e8a..a0cc017e4424 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted 16 _/X 65536 auxiliary taint, defined for and used by distros 17 _/T 131072 kernel was built with the struct randomization plugin 18 _/N 262144 an in-kernel test has been run + 19 _/J 524288 userspace used a mutating debug operation in fwctl === === ====== ======================================================== Note: The character ``_`` is representing a blank in this table to make reading @@ -182,3 +183,9 @@ More detailed explanation for tainting produce extremely unusual kernel structure layouts (even performance pathological ones), which is important to know when debugging. Set at build time. + + 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. + + 19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE + to use the devices debugging features. Device debugging features could + cause the device to malfunction in undefined ways. diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index 2ed79f41a411..d0502691dfa1 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -28,7 +28,7 @@ should be a userspace tool that handles all the low-level details, keeps a database of the authorized devices and prompts users for new connections. More details about the sysfs interface for Thunderbolt devices can be -found in ``Documentation/ABI/testing/sysfs-bus-thunderbolt``. +found in Documentation/ABI/testing/sysfs-bus-thunderbolt. Those users who just want to connect any device without any sort of manual work can add following line to diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst index c389d4fd7599..d8946b084b1e 100644 --- a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -23,7 +23,7 @@ mistakes occasionally made even by experienced developers. up in the reference section, then jump back to where you left off. .. Find the latest rendered version of this text here: - https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.rst.html + https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html The essence of the process (aka 'TL;DR') ======================================== @@ -267,7 +267,7 @@ culprit might be known already. For further details on what actually qualifies as a regression check out Documentation/admin-guide/reporting-regressions.rst. If you run into any problems while following this guide or have ideas how to -improve it, :ref:`please let the kernel developers know <submit_improvements>`. +improve it, :ref:`please let the kernel developers know <submit_improvements_vbbr>`. .. _introprep_bissbs: @@ -1055,7 +1055,7 @@ follow these instructions. [:ref:`details <introoptional_bisref>`] -.. _submit_improvements: +.. _submit_improvements_vbbr: Conclusion ---------- @@ -1431,7 +1431,7 @@ can easily happen that your self-built kernels will lack modules for tasks you did not perform at least once before utilizing this make target. That happens when a task requires kernel modules which are only autoloaded when you execute it for the first time. So when you never performed that task since starting your -kernel the modules will not have been loaded -- and from localmodonfig's point +kernel the modules will not have been loaded -- and from localmodconfig's point of view look superfluous, which thus disables them to reduce the amount of code to be compiled. diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst index b2e254ec8ee8..d6313890ee41 100644 --- a/Documentation/admin-guide/workload-tracing.rst +++ b/Documentation/admin-guide/workload-tracing.rst @@ -82,8 +82,8 @@ Install tools to build Linux kernel and tools in kernel repository. scripts/ver_linux is a good way to check if your system already has the necessary tools:: - sudo apt-get build-essentials flex bison yacc - sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev + sudo apt-get install build-essential flex bison yacc + sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev cscope is a good tool to browse kernel sources. Let's install it now:: diff --git a/Documentation/admin-guide/xfs.rst b/Documentation/admin-guide/xfs.rst index b67772cf36d6..a18328a5fb93 100644 --- a/Documentation/admin-guide/xfs.rst +++ b/Documentation/admin-guide/xfs.rst @@ -124,6 +124,14 @@ When mounting an XFS filesystem, the following options are accepted. controls the size of each buffer and so is also relevant to this case. + lifetime (default) or nolifetime + Enable data placement based on write life time hints provided + by the user. This turns on co-allocation of data of similar + life times when statistically favorable to reduce garbage + collection cost. + + These options are only available for zoned rt file systems. + logbsize=value Set the size of each in-memory log buffer. The size may be specified in bytes, or in kilobytes with a "k" suffix. @@ -143,6 +151,25 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. + max_atomic_write=value + Set the maximum size of an atomic write. The size may be + specified in bytes, in kilobytes with a "k" suffix, in megabytes + with a "m" suffix, or in gigabytes with a "g" suffix. The size + cannot be larger than the maximum write size, larger than the + size of any allocation group, or larger than the size of a + remapping operation that the log can complete atomically. + + The default value is to set the maximum I/O completion size + to allow each CPU to handle one at a time. + + max_open_zones=value + Specify the max number of zones to keep open for writing on a + zoned rt device. Many open zones aids file data separation + but may impact performance on HDDs. + + If ``max_open_zones`` is not specified, the value is determined + by the capabilities and the size of the zoned rt device. + noalign Data allocations will not be aligned at stripe unit boundaries. This is only relevant to filesystems created @@ -542,3 +569,24 @@ The interesting knobs for XFS workqueues are as follows: nice Relative priority of scheduling the threads. These are the same nice levels that can be applied to userspace processes. ============ =========== + +Zoned Filesystems +================= + +For zoned file systems, the following attributes are exposed in: + + /sys/fs/xfs/<dev>/zoned/ + + max_open_zones (Min: 1 Default: Varies Max: UINTMAX) + This read-only attribute exposes the maximum number of open zones + available for data placement. The value is determined at mount time and + is limited by the capabilities of the backing zoned device, file system + size and the max_open_zones mount option. + + zonegc_low_space (Min: 0 Default: 0 Max: 100) + Define a percentage for how much of the unused space that GC should keep + available for writing. A high value will reclaim more of the space + occupied by unused blocks, creating a larger buffer against write + bursts at the cost of increased write amplification. Regardless + of this value, garbage collection will always aim to free a minimum + amount of blocks to keep max_open_zones open for data placement purposes. |