diff options
Diffstat (limited to 'Documentation/admin-guide')
90 files changed, 4669 insertions, 1471 deletions
diff --git a/Documentation/admin-guide/LSM/apparmor.rst b/Documentation/admin-guide/LSM/apparmor.rst index 6cf81bbd7ce8..47939ee89d74 100644 --- a/Documentation/admin-guide/LSM/apparmor.rst +++ b/Documentation/admin-guide/LSM/apparmor.rst @@ -18,8 +18,11 @@ set ``CONFIG_SECURITY_APPARMOR=y`` If AppArmor should be selected as the default security module then set:: - CONFIG_DEFAULT_SECURITY="apparmor" - CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1 + CONFIG_DEFAULT_SECURITY_APPARMOR=y + +The CONFIG_LSM parameter manages the order and selection of LSMs. +Specify apparmor as the first "major" module (e.g. AppArmor, SELinux, Smack) +in the list. Build the kernel diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst index a6ba95fbaa9f..ce63be6d64ad 100644 --- a/Documentation/admin-guide/LSM/index.rst +++ b/Documentation/admin-guide/LSM/index.rst @@ -47,3 +47,4 @@ subdirectories. tomoyo Yama SafeSetID + ipe diff --git a/Documentation/admin-guide/LSM/ipe.rst b/Documentation/admin-guide/LSM/ipe.rst new file mode 100644 index 000000000000..f93a467db628 --- /dev/null +++ b/Documentation/admin-guide/LSM/ipe.rst @@ -0,0 +1,793 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Integrity Policy Enforcement (IPE) +================================== + +.. NOTE:: + + This is the documentation for admins, system builders, or individuals + attempting to use IPE. If you're looking for more developer-focused + documentation about IPE please see :doc:`the design docs </security/ipe>`. + +Overview +-------- + +Integrity Policy Enforcement (IPE) is a Linux Security Module that takes a +complementary approach to access control. Unlike traditional access control +mechanisms that rely on labels and paths for decision-making, IPE focuses +on the immutable security properties inherent to system components. These +properties are fundamental attributes or features of a system component +that cannot be altered, ensuring a consistent and reliable basis for +security decisions. + +To elaborate, in the context of IPE, system components primarily refer to +files or the devices these files reside on. However, this is just a +starting point. The concept of system components is flexible and can be +extended to include new elements as the system evolves. The immutable +properties include the origin of a file, which remains constant and +unchangeable over time. For example, IPE policies can be crafted to trust +files originating from the initramfs. Since initramfs is typically verified +by the bootloader, its files are deemed trustworthy; "file is from +initramfs" becomes an immutable property under IPE's consideration. + +The immutable property concept extends to the security features enabled on +a file's origin, such as dm-verity or fs-verity, which provide a layer of +integrity and trust. For example, IPE allows the definition of policies +that trust files from a dm-verity protected device. dm-verity ensures the +integrity of an entire device by providing a verifiable and immutable state +of its contents. Similarly, fs-verity offers filesystem-level integrity +checks, allowing IPE to enforce policies that trust files protected by +fs-verity. These two features cannot be turned off once established, so +they are considered immutable properties. These examples demonstrate how +IPE leverages immutable properties, such as a file's origin and its +integrity protection mechanisms, to make access control decisions. + +For the IPE policy, specifically, it grants the ability to enforce +stringent access controls by assessing security properties against +reference values defined within the policy. This assessment can be based on +the existence of a security property (e.g., verifying if a file originates +from initramfs) or evaluating the internal state of an immutable security +property. The latter includes checking the roothash of a dm-verity +protected device, determining whether dm-verity possesses a valid +signature, assessing the digest of a fs-verity protected file, or +determining whether fs-verity possesses a valid built-in signature. This +nuanced approach to policy enforcement enables a highly secure and +customizable system defense mechanism, tailored to specific security +requirements and trust models. + +To enable IPE, ensure that ``CONFIG_SECURITY_IPE`` (under +:menuselection:`Security -> Integrity Policy Enforcement (IPE)`) config +option is enabled. + +Use Cases +--------- + +IPE works best in fixed-function devices: devices in which their purpose +is clearly defined and not supposed to be changed (e.g. network firewall +device in a data center, an IoT device, etcetera), where all software and +configuration is built and provisioned by the system owner. + +IPE is a long-way off for use in general-purpose computing: the Linux +community as a whole tends to follow a decentralized trust model (known as +the web of trust), which IPE has no support for it yet. Instead, IPE +supports PKI (public key infrastructure), which generally designates a +set of trusted entities that provide a measure of absolute trust. + +Additionally, while most packages are signed today, the files inside +the packages (for instance, the executables), tend to be unsigned. This +makes it difficult to utilize IPE in systems where a package manager is +expected to be functional, without major changes to the package manager +and ecosystem behind it. + +The digest_cache LSM [#digest_cache_lsm]_ is a system that when combined with IPE, +could be used to enable and support general-purpose computing use cases. + +Known Limitations +----------------- + +IPE cannot verify the integrity of anonymous executable memory, such as +the trampolines created by gcc closures and libffi (<3.4.2), or JIT'd code. +Unfortunately, as this is dynamically generated code, there is no way +for IPE to ensure the integrity of this code to form a trust basis. + +IPE cannot verify the integrity of programs written in interpreted +languages when these scripts are invoked by passing these program files +to the interpreter. This is because the way interpreters execute these +files; the scripts themselves are not evaluated as executable code +through one of IPE's hooks, but they are merely text files that are read +(as opposed to compiled executables) [#interpreters]_. + +Threat Model +------------ + +IPE specifically targets the risk of tampering with user-space executable +code after the kernel has initially booted, including the kernel modules +loaded from userspace via ``modprobe`` or ``insmod``. + +To illustrate, consider a scenario where an untrusted binary, possibly +malicious, is downloaded along with all necessary dependencies, including a +loader and libc. The primary function of IPE in this context is to prevent +the execution of such binaries and their dependencies. + +IPE achieves this by verifying the integrity and authenticity of all +executable code before allowing them to run. It conducts a thorough +check to ensure that the code's integrity is intact and that they match an +authorized reference value (digest, signature, etc) as per the defined +policy. If a binary does not pass this verification process, either +because its integrity has been compromised or it does not meet the +authorization criteria, IPE will deny its execution. Additionally, IPE +generates audit logs which may be utilized to detect and analyze failures +resulting from policy violation. + +Tampering threat scenarios include modification or replacement of +executable code by a range of actors including: + +- Actors with physical access to the hardware +- Actors with local network access to the system +- Actors with access to the deployment system +- Compromised internal systems under external control +- Malicious end users of the system +- Compromised end users of the system +- Remote (external) compromise of the system + +IPE does not mitigate threats arising from malicious but authorized +developers (with access to a signing certificate), or compromised +developer tools used by them (i.e. return-oriented programming attacks). +Additionally, IPE draws hard security boundary between userspace and +kernelspace. As a result, kernel-level exploits are considered outside +the scope of IPE and mitigation is left to other mechanisms. + +Policy +------ + +IPE policy is a plain-text [#devdoc]_ policy composed of multiple statements +over several lines. There is one required line, at the top of the +policy, indicating the policy name, and the policy version, for +instance:: + + policy_name=Ex_Policy policy_version=0.0.0 + +The policy name is a unique key identifying this policy in a human +readable name. This is used to create nodes under securityfs as well as +uniquely identify policies to deploy new policies vs update existing +policies. + +The policy version indicates the current version of the policy (NOT the +policy syntax version). This is used to prevent rollback of policy to +potentially insecure previous versions of the policy. + +The next portion of IPE policy are rules. Rules are formed by key=value +pairs, known as properties. IPE rules require two properties: ``action``, +which determines what IPE does when it encounters a match against the +rule, and ``op``, which determines when the rule should be evaluated. +The ordering is significant, a rule must start with ``op``, and end with +``action``. Thus, a minimal rule is:: + + op=EXECUTE action=ALLOW + +This example will allow any execution. Additional properties are used to +assess immutable security properties about the files being evaluated. +These properties are intended to be descriptions of systems within the +kernel that can provide a measure of integrity verification, such that IPE +can determine the trust of the resource based on the value of the property. + +Rules are evaluated top-to-bottom. As a result, any revocation rules, +or denies should be placed early in the file to ensure that these rules +are evaluated before a rule with ``action=ALLOW``. + +IPE policy supports comments. The character '#' will function as a +comment, ignoring all characters to the right of '#' until the newline. + +The default behavior of IPE evaluations can also be expressed in policy, +through the ``DEFAULT`` statement. This can be done at a global level, +or a per-operation level:: + + # Global + DEFAULT action=ALLOW + + # Operation Specific + DEFAULT op=EXECUTE action=ALLOW + +A default must be set for all known operations in IPE. If you want to +preserve older policies being compatible with newer kernels that can introduce +new operations, set a global default of ``ALLOW``, then override the +defaults on a per-operation basis (as above). + +With configurable policy-based LSMs, there's several issues with +enforcing the configurable policies at startup, around reading and +parsing the policy: + +1. The kernel *should* not read files from userspace, so directly reading + the policy file is prohibited. +2. The kernel command line has a character limit, and one kernel module + should not reserve the entire character limit for its own + configuration. +3. There are various boot loaders in the kernel ecosystem, so handing + off a memory block would be costly to maintain. + +As a result, IPE has addressed this problem through a concept of a "boot +policy". A boot policy is a minimal policy which is compiled into the +kernel. This policy is intended to get the system to a state where +userspace is set up and ready to receive commands, at which point a more +complex policy can be deployed via securityfs. The boot policy can be +specified via ``SECURITY_IPE_BOOT_POLICY`` config option, which accepts +a path to a plain-text version of the IPE policy to apply. This policy +will be compiled into the kernel. If not specified, IPE will be disabled +until a policy is deployed and activated through securityfs. + +Deploying Policies +~~~~~~~~~~~~~~~~~~ + +Policies can be deployed from userspace through securityfs. These policies +are signed through the PKCS#7 message format to enforce some level of +authorization of the policies (prohibiting an attacker from gaining +unconstrained root, and deploying an "allow all" policy). These +policies must be signed by a certificate that chains to the +``SYSTEM_TRUSTED_KEYRING``, or to the secondary and/or platform keyrings if +``CONFIG_IPE_POLICY_SIG_SECONDARY_KEYRING`` and/or +``CONFIG_IPE_POLICY_SIG_PLATFORM_KEYRING`` are enabled, respectively. +With openssl, the policy can be signed by:: + + openssl smime -sign \ + -in "$MY_POLICY" \ + -signer "$MY_CERTIFICATE" \ + -inkey "$MY_PRIVATE_KEY" \ + -noattr \ + -nodetach \ + -nosmimecap \ + -outform der \ + -out "$MY_POLICY.p7b" + +Deploying the policies is done through securityfs, through the +``new_policy`` node. To deploy a policy, simply cat the file into the +securityfs node:: + + cat "$MY_POLICY.p7b" > /sys/kernel/security/ipe/new_policy + +Upon success, this will create one subdirectory under +``/sys/kernel/security/ipe/policies/``. The subdirectory will be the +``policy_name`` field of the policy deployed, so for the example above, +the directory will be ``/sys/kernel/security/ipe/policies/Ex_Policy``. +Within this directory, there will be seven files: ``pkcs7``, ``policy``, +``name``, ``version``, ``active``, ``update``, and ``delete``. + +The ``pkcs7`` file is read-only. Reading it returns the raw PKCS#7 data +that was provided to the kernel, representing the policy. If the policy being +read is the boot policy, this will return ``ENOENT``, as it is not signed. + +The ``policy`` file is read only. Reading it returns the PKCS#7 inner +content of the policy, which will be the plain text policy. + +The ``active`` file is used to set a policy as the currently active policy. +This file is rw, and accepts a value of ``"1"`` to set the policy as active. +Since only a single policy can be active at one time, all other policies +will be marked inactive. The policy being marked active must have a policy +version greater or equal to the currently-running version. + +The ``update`` file is used to update a policy that is already present +in the kernel. This file is write-only and accepts a PKCS#7 signed +policy. Two checks will always be performed on this policy: First, the +``policy_names`` must match with the updated version and the existing +version. Second the updated policy must have a policy version greater than +the currently-running version. This is to prevent rollback attacks. + +The ``delete`` file is used to remove a policy that is no longer needed. +This file is write-only and accepts a value of ``1`` to delete the policy. +On deletion, the securityfs node representing the policy will be removed. +However, delete the current active policy is not allowed and will return +an operation not permitted error. + +Similarly, writing to both ``update`` and ``new_policy`` could result in +bad message(policy syntax error) or file exists error. The latter error happens +when trying to deploy a policy with a ``policy_name`` while the kernel already +has a deployed policy with the same ``policy_name``. + +Deploying a policy will *not* cause IPE to start enforcing the policy. IPE will +only enforce the policy marked active. Note that only one policy can be active +at a time. + +Once deployment is successful, the policy can be activated, by writing file +``/sys/kernel/security/ipe/policies/$policy_name/active``. +For example, the ``Ex_Policy`` can be activated by:: + + echo 1 > "/sys/kernel/security/ipe/policies/Ex_Policy/active" + +From above point on, ``Ex_Policy`` is now the enforced policy on the +system. + +IPE also provides a way to delete policies. This can be done via the +``delete`` securityfs node, +``/sys/kernel/security/ipe/policies/$policy_name/delete``. +Writing ``1`` to that file deletes the policy:: + + echo 1 > "/sys/kernel/security/ipe/policies/$policy_name/delete" + +There is only one requirement to delete a policy: the policy being deleted +must be inactive. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack), all + writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Modes +~~~~~ + +IPE supports two modes of operation: permissive (similar to SELinux's +permissive mode) and enforced. In permissive mode, all events are +checked and policy violations are logged, but the policy is not really +enforced. This allows users to test policies before enforcing them. + +The default mode is enforce, and can be changed via the kernel command +line parameter ``ipe.enforce=(0|1)``, or the securityfs node +``/sys/kernel/security/ipe/enforce``. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Audit Events +~~~~~~~~~~~~ + +1420 AUDIT_IPE_ACCESS +^^^^^^^^^^^^^^^^^^^^^ +Event Examples:: + + type=1420 audit(1653364370.067:61): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2241 comm="ld-linux.so" path="/deny/lib/libc.so.6" dev="sda2" ino=14549020 rule="DEFAULT action=DENY" + type=1300 audit(1653364370.067:61): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=7f1105a28000 a1=195000 a2=5 a3=812 items=0 ppid=2219 pid=2241 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="ld-linux.so" exe="/tmp/ipe-test/lib/ld-linux.so" subj=unconfined key=(null) + type=1327 audit(1653364370.067:61): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + + type=1420 audit(1653364735.161:64): ipe_op=EXECUTE ipe_hook=MMAP enforcing=1 pid=2472 comm="mmap_test" path=? dev=? ino=? rule="DEFAULT action=DENY" + type=1300 audit(1653364735.161:64): SYSCALL arch=c000003e syscall=9 success=no exit=-13 a0=0 a1=1000 a2=4 a3=21 items=0 ppid=2219 pid=2472 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=2 comm="mmap_test" exe="/root/overlake_test/upstream_test/vol_fsverity/bin/mmap_test" subj=unconfined key=(null) + type=1327 audit(1653364735.161:64): 707974686F6E3300746573742F6D61696E2E7079002D6E00 + +This event indicates that IPE made an access control decision; the IPE +specific record (1420) is always emitted in conjunction with a +``AUDITSYSCALL`` record. + +Determining whether IPE is in permissive or enforced mode can be derived +from ``success`` property and exit code of the ``AUDITSYSCALL`` record. + + +Field descriptions: + ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===========+============+===========+=================================================================================+ +| ipe_op | string | No | The IPE operation name associated with the log | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ipe_hook | string | No | The name of the LSM hook that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| enforcing | integer | No | The current IPE enforcing state 1 is in enforcing mode, 0 is in permissive mode | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| pid | integer | No | The pid of the process that triggered the IPE event. | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| comm | string | No | The command line program name of the process that triggered the IPE event | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| path | string | Yes | The absolute path to the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| ino | integer | Yes | The inode number of the evaluated file | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| dev | string | Yes | The device name of the evaluated file, e.g. vda | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ +| rule | string | No | The matched policy rule | ++-----------+------------+-----------+---------------------------------------------------------------------------------+ + +1421 AUDIT_IPE_CONFIG_CHANGE +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1421 audit(1653425583.136:54): old_active_pol_name="Allow_All" old_active_pol_version=0.0.0 old_policy_digest=sha256:E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855 new_active_pol_name="boot_verified" new_active_pol_version=0.0.0 new_policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425583.136:54): SYSCALL arch=c000003e syscall=1 success=yes exit=2 a0=3 a1=5596fcae1fb0 a2=2 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425583.136:54): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2 + +This event indicates that IPE switched the active poliy from one to another +along with the version and the hash digest of the two policies. +Note IPE can only have one policy active at a time, all access decision +evaluation is based on the current active policy. +The normal procedure to deploy a new policy is loading the policy to deploy +into the kernel first, then switch the active policy to it. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++------------------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++========================+============+===========+===================================================+ +| old_active_pol_name | string | Yes | The name of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_active_pol_version | string | Yes | The version of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| old_policy_digest | string | Yes | The hash of previous active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_name | string | No | The name of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_active_pol_version | string | No | The version of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| new_policy_digest | string | No | The hash of current active policy | ++------------------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++------------------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++------------------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++------------------------+------------+-----------+---------------------------------------------------+ + +1422 AUDIT_IPE_POLICY_LOAD +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Event Example:: + + type=1422 audit(1653425529.927:53): policy_name="boot_verified" policy_version=0.0.0 policy_digest=sha256:820EEA5B40CA42B51F68962354BA083122A20BB846F26765076DD8EED7B8F4DB auid=4294967295 ses=4294967295 lsm=ipe res=1 + type=1300 audit(1653425529.927:53): arch=c000003e syscall=1 success=yes exit=2567 a0=3 a1=5596fcae1fb0 a2=a07 a3=2 items=0 ppid=184 pid=229 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=4294967295 comm="python3" exe="/usr/bin/python3.10" key=(null) + type=1327 audit(1653425529.927:53): PROCTITLE proctitle=707974686F6E3300746573742F6D61696E2E7079002D66002E2E + +This record indicates a new policy has been loaded into the kernel with the policy name, policy version and policy hash. + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++----------------+------------+-----------+---------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++================+============+===========+===================================================+ +| policy_name | string | No | The policy_name | ++----------------+------------+-----------+---------------------------------------------------+ +| policy_version | string | No | The policy_version | ++----------------+------------+-----------+---------------------------------------------------+ +| policy_digest | string | No | The policy hash | ++----------------+------------+-----------+---------------------------------------------------+ +| auid | integer | No | The login user ID | ++----------------+------------+-----------+---------------------------------------------------+ +| ses | integer | No | The login session ID | ++----------------+------------+-----------+---------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++----------------+------------+-----------+---------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++----------------+------------+-----------+---------------------------------------------------+ + + +1404 AUDIT_MAC_STATUS +^^^^^^^^^^^^^^^^^^^^^ + +Event Examples:: + + type=1404 audit(1653425689.008:55): enforcing=0 old_enforcing=1 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + + type=1404 audit(1653425689.008:55): enforcing=1 old_enforcing=0 auid=4294967295 ses=4294967295 enabled=1 old-enabled=1 lsm=ipe res=1 + type=1300 audit(1653425689.008:55): arch=c000003e syscall=1 success=yes exit=2 a0=1 a1=55c1065e5c60 a2=2 a3=0 items=0 ppid=405 pid=441 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=) + type=1327 audit(1653425689.008:55): proctitle="-bash" + +This record will always be emitted in conjunction with a ``AUDITSYSCALL`` record for the ``write`` syscall. + +Field descriptions: + ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| Field | Value Type | Optional? | Description of Value | ++===============+============+===========+=================================================================================================+ +| enforcing | integer | No | The enforcing state IPE is being switched to, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old_enforcing | integer | No | The enforcing state IPE is being switched from, 1 is in enforcing mode, 0 is in permissive mode | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| auid | integer | No | The login user ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| ses | integer | No | The login session ID | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| enabled | integer | No | The new TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| old-enabled | integer | No | The old TTY audit enabled setting | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| lsm | string | No | The lsm name associated with the event | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ +| res | integer | No | The result of the audited operation(success/fail) | ++---------------+------------+-----------+-------------------------------------------------------------------------------------------------+ + + +Success Auditing +^^^^^^^^^^^^^^^^ + +IPE supports success auditing. When enabled, all events that pass IPE +policy and are not blocked will emit an audit event. This is disabled by +default, and can be enabled via the kernel command line +``ipe.success_audit=(0|1)`` or +``/sys/kernel/security/ipe/success_audit`` securityfs file. + +This is *very* noisy, as IPE will check every userspace binary on the +system, but is useful for debugging policies. + +.. NOTE:: + + If a traditional MAC system is enabled (SELinux, apparmor, smack, etcetera), + all writes to ipe's securityfs nodes require ``CAP_MAC_ADMIN``. + +Properties +---------- + +As explained above, IPE properties are ``key=value`` pairs expressed in IPE +policy. Two properties are built-into the policy parser: 'op' and 'action'. +The other properties are used to restrict immutable security properties +about the files being evaluated. Currently those properties are: +'``boot_verified``', '``dmverity_signature``', '``dmverity_roothash``', +'``fsverity_signature``', '``fsverity_digest``'. A description of all +properties supported by IPE are listed below: + +op +~~ + +Indicates the operation for a rule to apply to. Must be in every rule, +as the first token. IPE supports the following operations: + + ``EXECUTE`` + + Pertains to any file attempting to be executed, or loaded as an + executable. + + ``FIRMWARE``: + + Pertains to firmware being loaded via the firmware_class interface. + This covers both the preallocated buffer and the firmware file + itself. + + ``KMODULE``: + + Pertains to loading kernel modules via ``modprobe`` or ``insmod``. + + ``KEXEC_IMAGE``: + + Pertains to kernel images loading via ``kexec``. + + ``KEXEC_INITRAMFS`` + + Pertains to initrd images loading via ``kexec --initrd``. + + ``POLICY``: + + Controls loading policies via reading a kernel-space initiated read. + + An example of such is loading IMA policies by writing the path + to the policy file to ``$securityfs/ima/policy`` + + ``X509_CERT``: + + Controls loading IMA certificates through the Kconfigs, + ``CONFIG_IMA_X509_PATH`` and ``CONFIG_EVM_X509_PATH``. + +action +~~~~~~ + + Determines what IPE should do when a rule matches. Must be in every + rule, as the final clause. Can be one of: + + ``ALLOW``: + + If the rule matches, explicitly allow access to the resource to proceed + without executing any more rules. + + ``DENY``: + + If the rule matches, explicitly prohibit access to the resource to + proceed without executing any more rules. + +boot_verified +~~~~~~~~~~~~~ + + This property can be utilized for authorization of files from initramfs. + The format of this property is:: + + boot_verified=(TRUE|FALSE) + + + .. WARNING:: + + This property will trust files from initramfs(rootfs). It should + only be used during early booting stage. Before mounting the real + rootfs on top of the initramfs, initramfs script will recursively + remove all files and directories on the initramfs. This is typically + implemented by using switch_root(8) [#switch_root]_. Therefore the + initramfs will be empty and not accessible after the real + rootfs takes over. It is advised to switch to a different policy + that doesn't rely on the property after this point. + This ensures that the trust policies remain relevant and effective + throughout the system's operation. + +dmverity_roothash +~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization or revocation of + specific dm-verity volumes, identified via their root hashes. It has a + dependency on the DM_VERITY module. This property is controlled by + the ``IPE_PROP_DM_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``DM_VERITY`` are all enabled. + The format of this property is:: + + dmverity_roothash=DigestName:HexadecimalString + + The supported DigestNames for dmverity_roothash are [#dmveritydigests]_ + + + blake2b-512 + + blake2s-256 + + sha256 + + sha384 + + sha512 + + sha3-224 + + sha3-256 + + sha3-384 + + sha3-512 + + sm3 + + rmd160 + +dmverity_signature +~~~~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of all dm-verity + volumes that have a signed roothash that validated by a keyring + specified by dm-verity's configuration, either the system trusted + keyring, or the secondary keyring. It depends on + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` config option and is controlled by + the ``IPE_PROP_DM_VERITY_SIGNATURE`` config option, it will be automatically + selected when ``SECURITY_IPE``, ``DM_VERITY`` and + ``DM_VERITY_VERIFY_ROOTHASH_SIG`` are all enabled. + The format of this property is:: + + dmverity_signature=(TRUE|FALSE) + +fsverity_digest +~~~~~~~~~~~~~~~ + + This property can be utilized for authorization of specific fsverity + enabled files, identified via their fsverity digests. + It depends on ``FS_VERITY`` config option and is controlled by + the ``IPE_PROP_FS_VERITY`` config option, it will be automatically + selected when ``SECURITY_IPE`` and ``FS_VERITY`` are all enabled. + The format of this property is:: + + fsverity_digest=DigestName:HexadecimalString + + The supported DigestNames for fsverity_digest are [#fsveritydigest]_ + + + sha256 + + sha512 + +fsverity_signature +~~~~~~~~~~~~~~~~~~ + + This property is used to authorize all fs-verity enabled files that have + been verified by fs-verity's built-in signature mechanism. The signature + verification relies on a key stored within the ".fs-verity" keyring. It + depends on ``FS_VERITY_BUILTIN_SIGNATURES`` config option and + it is controlled by the ``IPE_PROP_FS_VERITY`` config option, + it will be automatically selected when ``SECURITY_IPE``, ``FS_VERITY`` + and ``FS_VERITY_BUILTIN_SIGNATURES`` are all enabled. + The format of this property is:: + + fsverity_signature=(TRUE|FALSE) + +Policy Examples +--------------- + +Allow all +~~~~~~~~~ + +:: + + policy_name=Allow_All policy_version=0.0.0 + DEFAULT action=ALLOW + +Allow only initramfs +~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + +Allow any signed and validated dm-verity volume and the initramfs +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_DMV_And_Initramfs policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Prohibit execution from a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Deny_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:cd2c5bae7c6c579edaae4353049d58eb5f2e8be0244bf05345bc8e5ed257baff action=DENY + + op=EXECUTE boot_verified=TRUE action=ALLOW + op=EXECUTE dmverity_signature=TRUE action=ALLOW + +Allow only a specific dm-verity volume +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_DMV_By_Roothash policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE dmverity_roothash=sha256:401fcec5944823ae12f62726e8184407a5fa9599783f030dec146938 action=ALLOW + +Allow any fs-verity file with a valid built-in signature +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=Allow_Signed_And_Validated_FSVerity policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_signature=TRUE action=ALLOW + +Allow execution of a specific fs-verity file +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + policy_name=ALLOW_FSV_By_Digest policy_version=0.0.0 + DEFAULT action=DENY + + op=EXECUTE fsverity_digest=sha256:fd88f2b8824e197f850bf4c5109bea5cf0ee38104f710843bb72da796ba5af9e action=ALLOW + +Additional Information +---------------------- + +- `Github Repository <https://github.com/microsoft/ipe>`_ +- :doc:`Developer and design docs for IPE </security/ipe>` + +FAQ +--- + +Q: + What's the difference between other LSMs which provide a measure of + trust-based access control? + +A: + + In general, there's two other LSMs that can provide similar functionality: + IMA, and Loadpin. + + IMA and IPE are functionally very similar. The significant difference between + the two is the policy. [#devdoc]_ + + Loadpin and IPE differ fairly dramatically, as Loadpin only covers the IPE's + kernel read operations, whereas IPE is capable of controlling execution + on top of kernel read. The trust model is also different; Loadpin roots its + trust in the initial super-block, whereas trust in IPE is stemmed from kernel + itself (via ``SYSTEM_TRUSTED_KEYS``). + +----------- + +.. [#digest_cache_lsm] https://lore.kernel.org/lkml/20240415142436.2545003-1-roberto.sassu@huaweicloud.com/ + +.. [#interpreters] There is `some interest in solving this issue <https://lore.kernel.org/lkml/20220321161557.495388-1-mic@digikod.net/>`_. + +.. [#devdoc] Please see :doc:`the design docs </security/ipe>` for more on + this topic. + +.. [#switch_root] https://man7.org/linux/man-pages/man8/switch_root.8.html + +.. [#dmveritydigests] These hash algorithms are based on values accepted by + the Linux crypto API; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. + +.. [#fsveritydigest] These hash algorithms are based on values accepted by the + kernel's fsverity support; IPE does not impose any + restrictions on the digest algorithm itself; + thus, this list may be out of date. diff --git a/Documentation/admin-guide/LSM/tomoyo.rst b/Documentation/admin-guide/LSM/tomoyo.rst index 4bc9c2b4da6f..bdb2c2e2a1b2 100644 --- a/Documentation/admin-guide/LSM/tomoyo.rst +++ b/Documentation/admin-guide/LSM/tomoyo.rst @@ -9,8 +9,8 @@ TOMOYO is a name-based MAC extension (LSM module) for the Linux kernel. LiveCD-based tutorials are available at -http://tomoyo.sourceforge.jp/1.8/ubuntu12.04-live.html -http://tomoyo.sourceforge.jp/1.8/centos6-live.html +https://tomoyo.sourceforge.net/1.8/ubuntu12.04-live.html +https://tomoyo.sourceforge.net/1.8/centos6-live.html Though these tutorials use non-LSM version of TOMOYO, they are useful for you to know what TOMOYO is. @@ -21,45 +21,32 @@ How to enable TOMOYO? Build the kernel with ``CONFIG_SECURITY_TOMOYO=y`` and pass ``security=tomoyo`` on kernel's command line. -Please see http://tomoyo.osdn.jp/2.5/ for details. +Please see https://tomoyo.sourceforge.net/2.6/ for details. Where is documentation? ======================= User <-> Kernel interface documentation is available at -https://tomoyo.osdn.jp/2.5/policy-specification/index.html . +https://tomoyo.sourceforge.net/2.6/policy-specification/index.html . Materials we prepared for seminars and symposiums are available at -https://osdn.jp/projects/tomoyo/docs/?category_id=532&language_id=1 . +https://sourceforge.net/projects/tomoyo/files/docs/ . Below lists are chosen from three aspects. What is TOMOYO? TOMOYO Linux Overview - https://osdn.jp/projects/tomoyo/docs/lca2009-takeda.pdf + https://sourceforge.net/projects/tomoyo/files/docs/lca2009-takeda.pdf TOMOYO Linux: pragmatic and manageable security for Linux - https://osdn.jp/projects/tomoyo/docs/freedomhectaipei-tomoyo.pdf + https://sourceforge.net/projects/tomoyo/files/docs/freedomhectaipei-tomoyo.pdf TOMOYO Linux: A Practical Method to Understand and Protect Your Own Linux Box - https://osdn.jp/projects/tomoyo/docs/PacSec2007-en-no-demo.pdf + https://sourceforge.net/projects/tomoyo/files/docs/PacSec2007-en-no-demo.pdf What can TOMOYO do? Deep inside TOMOYO Linux - https://osdn.jp/projects/tomoyo/docs/lca2009-kumaneko.pdf + https://sourceforge.net/projects/tomoyo/files/docs/lca2009-kumaneko.pdf The role of "pathname based access control" in security. - https://osdn.jp/projects/tomoyo/docs/lfj2008-bof.pdf + https://sourceforge.net/projects/tomoyo/files/docs/lfj2008-bof.pdf History of TOMOYO? Realities of Mainlining - https://osdn.jp/projects/tomoyo/docs/lfj2008.pdf - -What is future plan? -==================== - -We believe that inode based security and name based security are complementary -and both should be used together. But unfortunately, so far, we cannot enable -multiple LSM modules at the same time. We feel sorry that you have to give up -SELinux/SMACK/AppArmor etc. when you want to use TOMOYO. - -We hope that LSM becomes stackable in future. Meanwhile, you can use non-LSM -version of TOMOYO, available at http://tomoyo.osdn.jp/1.8/ . -LSM version of TOMOYO is a subset of non-LSM version of TOMOYO. We are planning -to port non-LSM version's functionalities to LSM versions. + https://sourceforge.net/projects/tomoyo/files/docs/lfj2008.pdf diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index f2bebff6a733..eb9452668909 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -356,5 +356,5 @@ instructions at 'Documentation/admin-guide/reporting-issues.rst'. Hints on understanding kernel bug reports are in 'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel -with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and -'Documentation/dev-tools/kgdb.rst'. +with gdb is in 'Documentation/process/debugging/gdb-kernel-debugging.rst' and +'Documentation/process/debugging/kgdb.rst'. diff --git a/Documentation/admin-guide/blockdev/zram.rst b/Documentation/admin-guide/blockdev/zram.rst index ee2b0030d416..1576fb93f06c 100644 --- a/Documentation/admin-guide/blockdev/zram.rst +++ b/Documentation/admin-guide/blockdev/zram.rst @@ -47,6 +47,8 @@ The list of possible return codes: -ENOMEM zram was not able to allocate enough memory to fulfil your needs. -EINVAL invalid input has been provided. +-EAGAIN re-try operation later (e.g. when attempting to run recompress + and writeback simultaneously). ======== ============================================================= If you use 'echo', the returned value is set by the 'echo' utility, @@ -102,17 +104,41 @@ Examples:: #select lzo compression algorithm echo lzo > /sys/block/zram0/comp_algorithm -For the time being, the `comp_algorithm` content does not necessarily -show every compression algorithm supported by the kernel. We keep this -list primarily to simplify device configuration and one can configure -a new device with a compression algorithm that is not listed in -`comp_algorithm`. The thing is that, internally, ZRAM uses Crypto API -and, if some of the algorithms were built as modules, it's impossible -to list all of them using, for instance, /proc/crypto or any other -method. This, however, has an advantage of permitting the usage of -custom crypto compression modules (implementing S/W or H/W compression). - -4) Set Disksize +For the time being, the `comp_algorithm` content shows only compression +algorithms that are supported by zram. + +4) Set compression algorithm parameters: Optional +================================================= + +Compression algorithms may support specific parameters which can be +tweaked for particular dataset. ZRAM has an `algorithm_params` device +attribute which provides a per-algorithm params configuration. + +For example, several compression algorithms support `level` parameter. +In addition, certain compression algorithms support pre-trained dictionaries, +which significantly change algorithms' characteristics. In order to configure +compression algorithm to use external pre-trained dictionary, pass full +path to the `dict` along with other parameters:: + + #pass path to pre-trained zstd dictionary + echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params + + #same, but using algorithm priority + echo "priority=1 dict=/etc/dictionary" > \ + /sys/block/zram0/algorithm_params + + #pass path to pre-trained zstd dictionary and compression level + echo "algo=zstd level=8 dict=/etc/dictionary" > \ + /sys/block/zram0/algorithm_params + +Parameters are algorithm specific: not all algorithms support pre-trained +dictionaries, not all algorithms support `level`. Furthermore, for certain +algorithms `level` controls the compression level (the higher the value the +better the compression ratio, it even can take negatives values for some +algorithms), for other algorithms `level` is acceleration level (the higher +the value the lower the compression ratio). + +5) Set Disksize =============== Set disk size by writing the value to sysfs node 'disksize'. @@ -132,7 +158,7 @@ There is little point creating a zram of greater than twice the size of memory since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the size of the disk when not in use so a huge zram is wasteful. -5) Set memory limit: Optional +6) Set memory limit: Optional ============================= Set memory limit by writing the value to sysfs node 'mem_limit'. @@ -151,7 +177,7 @@ Examples:: # To disable memory limit echo 0 > /sys/block/zram0/mem_limit -6) Activate +7) Activate =========== :: @@ -162,7 +188,7 @@ Examples:: mkfs.ext4 /dev/zram1 mount /dev/zram1 /tmp -7) Add/remove zram devices +8) Add/remove zram devices ========================== zram provides a control interface, which enables dynamic (on-demand) device @@ -182,7 +208,7 @@ execute:: echo X > /sys/class/zram-control/hot_remove -8) Stats +9) Stats ======== Per-device statistics are exported as various nodes under /sys/block/zram<id>/ @@ -205,6 +231,7 @@ writeback_limit_enable RW show and set writeback_limit feature max_comp_streams RW the number of possible concurrent compress operations comp_algorithm RW show and change the compression algorithm +algorithm_params WO setup compression algorithm parameters compact WO trigger memory compaction debug_stat RO this file is used for zram debugging purposes backing_dev RW set up backend storage for zram to write out @@ -283,15 +310,15 @@ a single line of text and contains the following stats separated by whitespace: Unit: 4K bytes ============== ============================================================= -9) Deactivate -============= +10) Deactivate +============== :: swapoff /dev/zram0 umount /dev/zram1 -10) Reset +11) Reset ========= Write any positive value to 'reset' sysfs node:: @@ -466,6 +493,11 @@ of equal or greater size::: #recompress idle pages larger than 2000 bytes echo "type=idle threshold=2000" > /sys/block/zramX/recompress +It is also possible to limit the number of pages zram re-compression will +attempt to recompress::: + + echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress + Recompression of idle pages requires memory tracking. During re-compression for every page, that matches re-compression criteria, @@ -482,11 +514,14 @@ registered compression algorithms, increases our chances of finding the algorithm that successfully compresses a particular page. Sometimes, however, it is convenient (and sometimes even necessary) to limit recompression to only one particular algorithm so that it will not try any other algorithms. -This can be achieved by providing a algo=NAME parameter::: +This can be achieved by providing a `algo` or `priority` parameter::: #use zstd algorithm only (if registered) echo "type=huge algo=zstd" > /sys/block/zramX/recompress + #use zstd algorithm only (if zstd was registered under priority 1) + echo "type=huge priority=1" > /sys/block/zramX/recompress + memory tracking =============== diff --git a/Documentation/admin-guide/braille-console.rst b/Documentation/admin-guide/braille-console.rst index 18e79337dcfd..153472e93cae 100644 --- a/Documentation/admin-guide/braille-console.rst +++ b/Documentation/admin-guide/braille-console.rst @@ -21,8 +21,8 @@ override the baud rate to 115200, etc. By default, the braille device will just show the last kernel message (console mode). To review previous messages, press the Insert key to switch to the VT review mode. In review mode, the arrow keys permit to browse in the VT content, -:kbd:`PAGE-UP`/:kbd:`PAGE-DOWN` keys go at the top/bottom of the screen, and -the :kbd:`HOME` key goes back +`PAGE-UP`/`PAGE-DOWN` keys go at the top/bottom of the screen, and +the `HOME` key goes back to the cursor, hence providing very basic screen reviewing facility. Sound feedback can be obtained by adding the ``braille_console.sound=1`` kernel diff --git a/Documentation/admin-guide/bug-bisect.rst b/Documentation/admin-guide/bug-bisect.rst index 325c5d0ed34a..f4f867cabb17 100644 --- a/Documentation/admin-guide/bug-bisect.rst +++ b/Documentation/admin-guide/bug-bisect.rst @@ -1,76 +1,165 @@ -Bisecting a bug -+++++++++++++++ +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0) +.. [see the bottom of this file for redistribution information] -Last updated: 28 October 2016 +====================== +Bisecting a regression +====================== -Introduction -============ +This document describes how to use a ``git bisect`` to find the source code +change that broke something -- for example when some functionality stopped +working after upgrading from Linux 6.0 to 6.1. -Always try the latest kernel from kernel.org and build from source. If you are -not confident in doing that please report the bug to your distribution vendor -instead of to a kernel developer. +The text focuses on the gist of the process. If you are new to bisecting the +kernel, better follow Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +instead: it depicts everything from start to finish while covering multiple +aspects even kernel developers occasionally forget. This includes detecting +situations early where a bisection would be a waste of time, as nobody would +care about the result -- for example, because the problem happens after the +kernel marked itself as 'tainted', occurs in an abandoned version, was already +fixed, or is caused by a .config change you or your Linux distributor performed. -Finding bugs is not always easy. Have a go though. If you can't find it don't -give up. Report as much as you have found to the relevant maintainer. See -MAINTAINERS for who that is for the subsystem you have worked on. +Finding the change causing a kernel issue using a bisection +=========================================================== -Before you submit a bug report read -'Documentation/admin-guide/reporting-issues.rst'. +*Note: the following process assumes you prepared everything for a bisection. +This includes having a Git clone with the appropriate sources, installing the +software required to build and install kernels, as well as a .config file stored +in a safe place (the following example assumes '~/prepared_kernel_.config') to +use as pristine base at each bisection step; ideally, you have also worked out +a fully reliable and straight-forward way to reproduce the regression, too.* -Devices not appearing -===================== +* Preparation: start the bisection and tell Git about the points in the history + you consider to be working and broken, which Git calls 'good' and 'bad':: -Often this is caused by udev/systemd. Check that first before blaming it -on the kernel. + git bisect start + git bisect good v6.0 + git bisect bad v6.1 -Finding patch that caused a bug -=============================== + Instead of Git tags like 'v6.0' and 'v6.1' you can specify commit-ids, too. -Using the provided tools with ``git`` makes finding bugs easy provided the bug -is reproducible. +1. Copy your prepared .config into the build directory and adjust it to the + needs of the codebase Git checked out for testing:: -Steps to do it: + cp ~/prepared_kernel_.config .config + make olddefconfig -- build the Kernel from its git source -- start bisect with [#f1]_:: - - $ git bisect start - -- mark the broken changeset with:: - - $ git bisect bad [commit] - -- mark a changeset where the code is known to work with:: - - $ git bisect good [commit] - -- rebuild the Kernel and test -- interact with git bisect by using either:: - - $ git bisect good - - or:: - - $ git bisect bad - - depending if the bug happened on the changeset you're testing -- After some interactions, git bisect will give you the changeset that - likely caused the bug. - -- For example, if you know that the current version is bad, and version - 4.8 is good, you could do:: - - $ git bisect start - $ git bisect bad # Current version is bad - $ git bisect good v4.8 - - -.. [#f1] You can, optionally, provide both good and bad arguments at git - start with ``git bisect start [BAD] [GOOD]`` - -For further references, please read: - -- The man page for ``git-bisect`` -- `Fighting regressions with git bisect <https://www.kernel.org/pub/software/scm/git/docs/git-bisect-lk2009.html>`_ -- `Fully automated bisecting with "git bisect run" <https://lwn.net/Articles/317154>`_ -- `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_ +2. Now build, install, and boot a kernel. This might fail for unrelated reasons, + for example, when a compile error happens at the current stage of the + bisection a later change resolves. In such cases run ``git bisect skip`` and + go back to step 1. + +3. Check if the functionality that regressed works in the kernel you just built. + + If it works, execute:: + + git bisect good + + If it is broken, run:: + + git bisect bad + + Note, getting this wrong just once will send the rest of the bisection + totally off course. To prevent having to start anew later you thus want to + ensure what you tell Git is correct; it is thus often wise to spend a few + minutes more on testing in case your reproducer is unreliable. + + After issuing one of these two commands, Git will usually check out another + bisection point and print something like 'Bisecting: 675 revisions left to + test after this (roughly 10 steps)'. In that case go back to step 1. + + If Git instead prints something like 'cafecaca0c0dacafecaca0c0dacafecaca0c0da + is the first bad commit', then you have finished the bisection. In that case + move to the next point below. Note, right after displaying that line Git will + show some details about the culprit including its patch description; this can + easily fill your terminal, so you might need to scroll up to see the message + mentioning the culprit's commit-id. + + In case you missed Git's output, you can always run ``git bisect log`` to + print the status: it will show how many steps remain or mention the result of + the bisection. + +* Recommended complementary task: put the bisection log and the current .config + file aside for the bug report; furthermore tell Git to reset the sources to + the state before the bisection:: + + git bisect log > ~/bisection-log + cp .config ~/bisection-config-culprit + git bisect reset + +* Recommended optional task: try reverting the culprit on top of the latest + codebase and check if that fixes your bug; if that is the case, it validates + the bisection and enables developers to resolve the regression through a + revert. + + To try this, update your clone and check out latest mainline. Then tell Git + to revert the change by specifying its commit-id:: + + git revert --no-edit cafec0cacaca0 + + Git might reject this, for example when the bisection landed on a merge + commit. In that case, abandon the attempt. Do the same, if Git fails to revert + the culprit on its own because later changes depend on it -- at least unless + you bisected a stable or longterm kernel series, in which case you want to + check out its latest codebase and try a revert there. + + If a revert succeeds, build and test another kernel to check if reverting + resolved your regression. + +With that the process is complete. Now report the regression as described by +Documentation/admin-guide/reporting-issues.rst. + +Bisecting linux-next +-------------------- + +If you face a problem only happening in linux-next, bisect between the +linux-next branches 'stable' and 'master'. The following commands will start +the process for a linux-next tree you added as a remote called 'next':: + + git bisect start + git bisect good next/stable + git bisect bad next/master + +The 'stable' branch refers to the state of linux-mainline that the current +linux-next release (found in the 'master' branch) is based on -- the former +thus should be free of any problems that show up in -next, but not in Linus' +tree. + +This will bisect across a wide range of changes, some of which you might have +used in earlier linux-next releases without problems. Sadly there is no simple +way to avoid checking them: bisecting from one linux-next release to a later +one (say between 'next-20241020' and 'next-20241021') is impossible, as they +share no common history. + +Additional reading material +--------------------------- + +* The `man page for 'git bisect' <https://git-scm.com/docs/git-bisect>`_ and + `fighting regressions with 'git bisect' <https://git-scm.com/docs/git-bisect-lk2009.html>`_ + in the Git documentation. +* `Working with git bisect <https://nathanchance.dev/posts/working-with-git-bisect/>`_ + from kernel developer Nathan Chancellor. +* `Using Git bisect to figure out when brokenness was introduced <http://webchick.net/node/99>`_. +* `Fully automated bisecting with 'git bisect run' <https://lwn.net/Articles/317154>`_. + +.. + end-of-content +.. + This document is maintained by Thorsten Leemhuis <linux@leemhuis.info>. If + you spot a typo or small mistake, feel free to let him know directly and + he'll fix it. You are free to do the same in a mostly informal way if you + want to contribute changes to the text -- but for copyright reasons please CC + linux-doc@vger.kernel.org and 'sign-off' your contribution as + Documentation/process/submitting-patches.rst explains in the section 'Sign + your work - the Developer's Certificate of Origin'. +.. + This text is available under GPL-2.0+ or CC-BY-4.0, as stated at the top + of the file. If you want to distribute this text under CC-BY-4.0 only, + please use 'The Linux kernel development community' for author attribution + and link this as source: + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/bug-bisect.rst + +.. + Note: Only the content of this RST file as found in the Linux kernel sources + is available under CC-BY-4.0, as versions of this text that were processed + (for example by the kernel's build system) might contain content taken from + files which use a more restrictive license. diff --git a/Documentation/admin-guide/bug-hunting.rst b/Documentation/admin-guide/bug-hunting.rst index 95299b08c405..ce6f4e8ca487 100644 --- a/Documentation/admin-guide/bug-hunting.rst +++ b/Documentation/admin-guide/bug-hunting.rst @@ -244,14 +244,14 @@ Reporting the bug Once you find where the bug happened, by inspecting its location, you could either try to fix it yourself or report it upstream. -In order to report it upstream, you should identify the mailing list -used for the development of the affected code. This can be done by using -the ``get_maintainer.pl`` script. +In order to report it upstream, you should identify the bug tracker, if any, or +mailing list used for the development of the affected code. This can be done by +using the ``get_maintainer.pl`` script. For example, if you find a bug at the gspca's sonixj.c file, you can get its maintainers with:: - $ ./scripts/get_maintainer.pl -f drivers/media/usb/gspca/sonixj.c + $ ./scripts/get_maintainer.pl --bug -f drivers/media/usb/gspca/sonixj.c Hans Verkuil <hverkuil@xs4all.nl> (odd fixer:GSPCA USB WEBCAM DRIVER,commit_signer:1/1=100%) Mauro Carvalho Chehab <mchehab@kernel.org> (maintainer:MEDIA INPUT INFRASTRUCTURE (V4L/DVB),commit_signer:1/1=100%) Tejun Heo <tj@kernel.org> (commit_signer:1/1=100%) @@ -267,11 +267,12 @@ Please notice that it will point to: - The driver maintainer (Hans Verkuil); - The subsystem maintainer (Mauro Carvalho Chehab); - The driver and/or subsystem mailing list (linux-media@vger.kernel.org); -- the Linux Kernel mailing list (linux-kernel@vger.kernel.org). +- The Linux Kernel mailing list (linux-kernel@vger.kernel.org); +- The bug reporting URIs for the driver/subsystem (none in the above example). -Usually, the fastest way to have your bug fixed is to report it to mailing -list used for the development of the code (linux-media ML) copying the -driver maintainer (Hans). +If the listing contains bug reporting URIs at the end, please prefer them over +email. Otherwise, please report bugs to the mailing list used for the +development of the code (linux-media ML) copying the driver maintainer (Hans). If you are totally stumped as to whom to send the report, and ``get_maintainer.pl`` didn't provide you anything useful, send it to @@ -367,12 +368,3 @@ processed by ``klogd``:: Aug 29 09:51:01 blizard kernel: Call Trace: [oops:_oops_ioctl+48/80] [_sys_ioctl+254/272] [_system_call+82/128] Aug 29 09:51:01 blizard kernel: Code: c7 00 05 00 00 00 eb 08 90 90 90 90 90 90 90 90 89 ec 5d c3 ---------------------------------------------------------------------------- - -:: - - Dr. G.W. Wettstein Oncology Research Div. Computing Facility - Roger Maris Cancer Center INTERNET: greg@wind.rmcc.com - 820 4th St. N. - Fargo, ND 58122 - Phone: 701-234-7556 diff --git a/Documentation/admin-guide/cgroup-v1/cgroups.rst b/Documentation/admin-guide/cgroup-v1/cgroups.rst index 9343148ee993..a3e2edb3d274 100644 --- a/Documentation/admin-guide/cgroup-v1/cgroups.rst +++ b/Documentation/admin-guide/cgroup-v1/cgroups.rst @@ -570,7 +570,7 @@ visible to cgroup_for_each_child/descendant_*() iterators. The subsystem may choose to fail creation by returning -errno. This callback can be used to implement reliable state sharing and propagation along the hierarchy. See the comment on -cgroup_for_each_descendant_pre() for details. +cgroup_for_each_live_descendant_pre() for details. ``void css_offline(struct cgroup *cgrp);`` (cgroup_mutex held by caller) diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 7d3415eea05d..f401af5e2f09 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -568,7 +568,7 @@ on the next tick. For some applications in special situation, waiting The 'cpuset.sched_relax_domain_level' file allows you to request changing this searching range as you like. This file takes int value which -indicates size of searching range in levels ideally as follows, +indicates size of searching range in levels approximately as follows, otherwise initial value -1 that indicates the cpuset has no request. ====== =========================================================== @@ -581,6 +581,11 @@ otherwise initial value -1 that indicates the cpuset has no request. 5 search system wide [on NUMA system] ====== =========================================================== +Not all levels can be present and values can change depending on the +system architecture and kernel configuration. Check +/sys/kernel/debug/sched/domains/cpu*/domain*/ for system-specific +details. + The system default is architecture dependent. The system default can be changed using the relax_domain_level= boot parameter. diff --git a/Documentation/admin-guide/cgroup-v1/memcg_test.rst b/Documentation/admin-guide/cgroup-v1/memcg_test.rst index 1f128458ddea..9f8e27355cba 100644 --- a/Documentation/admin-guide/cgroup-v1/memcg_test.rst +++ b/Documentation/admin-guide/cgroup-v1/memcg_test.rst @@ -102,7 +102,7 @@ Under below explanation, we assume CONFIG_SWAP=y. The logic is very clear. (About migration, see below) Note: - __remove_from_page_cache() is called by remove_from_page_cache() + __filemap_remove_folio() is called by filemap_remove_folio() and __remove_mapping(). 6. Shmem(tmpfs) Page Cache diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index ca7d9402f6be..286d16fc22eb 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -78,18 +78,22 @@ Brief summary of control files. memory.memsw.max_usage_in_bytes show max memory+Swap usage recorded memory.soft_limit_in_bytes set/show soft limit of memory usage This knob is not available on CONFIG_PREEMPT_RT systems. + This knob is deprecated and shouldn't be + used. memory.stat show various statistics memory.use_hierarchy set/show hierarchical account enabled This knob is deprecated and shouldn't be used. memory.force_empty trigger forced page reclaim memory.pressure_level set memory pressure notifications + This knob is deprecated and shouldn't be + used. memory.swappiness set/show swappiness parameter of vmscan (See sysctl's vm.swappiness) - memory.move_charge_at_immigrate set/show controls of moving charges + memory.move_charge_at_immigrate This knob is deprecated. + memory.oom_control set/show oom controls. This knob is deprecated and shouldn't be used. - memory.oom_control set/show oom controls. memory.numa_stat show the number of memory usage per numa node memory.kmem.limit_in_bytes Deprecated knob to set and read the kernel @@ -105,10 +109,18 @@ Brief summary of control files. memory.kmem.max_usage_in_bytes show max kernel memory usage recorded memory.kmem.tcp.limit_in_bytes set/show hard limit for tcp buf memory + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.usage_in_bytes show current tcp buf memory allocation + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.failcnt show the number of tcp buf memory usage hits limits + This knob is deprecated and shouldn't be + used. memory.kmem.tcp.max_usage_in_bytes show max tcp buf memory usage recorded + This knob is deprecated and shouldn't be + used. ==================================== ========================================== 1. History @@ -229,10 +241,6 @@ behind this approach is that a cgroup that aggressively uses a shared page will eventually get charged for it (once it is uncharged from the cgroup that brought it in -- this will happen on memory pressure). -But see :ref:`section 8.2 <cgroup-v1-memory-movable-charges>` when moving a -task to another cgroup, its pages may be recharged to the new cgroup, if -move_charge_at_immigrate has been chosen. - 2.4 Swap Extension -------------------------------------- @@ -300,14 +308,14 @@ When oom event notifier is registered, event will be delivered. Lock order is as follows:: - Page lock (PG_locked bit of page->flags) + folio_lock mm->page_table_lock or split pte_lock folio_memcg_lock (memcg->move_lock) mapping->i_pages lock lruvec->lru_lock. Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by -lruvec->lru_lock; PG_lru bit of page->flags is cleared before +lruvec->lru_lock; the folio LRU flag is cleared before isolating a page from its LRU under lruvec->lru_lock. .. _cgroup-v1-memory-kernel-extension: @@ -693,8 +701,10 @@ For compatibility reasons writing 1 to memory.use_hierarchy will always pass:: # echo 1 > memory.use_hierarchy -7. Soft limits -============== +7. Soft limits (DEPRECATED) +=========================== + +THIS IS DEPRECATED! Soft limits allow for greater sharing of memory. The idea behind soft limits is to allow control groups to use as much of the memory as needed, provided @@ -740,78 +750,8 @@ If we want to change this to 1G, we can at any time use:: THIS IS DEPRECATED! -It's expensive and unreliable! It's better practice to launch workload -tasks directly from inside their target cgroup. Use dedicated workload -cgroups to allow fine-grained policy adjustments without having to -move physical pages between control domains. - -Users can move charges associated with a task along with task migration, that -is, uncharge task's pages from the old cgroup and charge them to the new cgroup. -This feature is not supported in !CONFIG_MMU environments because of lack of -page tables. - -8.1 Interface -------------- - -This feature is disabled by default. It can be enabled (and disabled again) by -writing to memory.move_charge_at_immigrate of the destination cgroup. - -If you want to enable it:: - - # echo (some positive value) > memory.move_charge_at_immigrate - -.. note:: - Each bits of move_charge_at_immigrate has its own meaning about what type - of charges should be moved. See :ref:`section 8.2 - <cgroup-v1-memory-movable-charges>` for details. - -.. note:: - Charges are moved only when you move mm->owner, in other words, - a leader of a thread group. - -.. note:: - If we cannot find enough space for the task in the destination cgroup, we - try to make space by reclaiming memory. Task migration may fail if we - cannot make enough space. - -.. note:: - It can take several seconds if you move charges much. - -And if you want disable it again:: - - # echo 0 > memory.move_charge_at_immigrate - -.. _cgroup-v1-memory-movable-charges: - -8.2 Type of charges which can be moved --------------------------------------- - -Each bit in move_charge_at_immigrate has its own meaning about what type of -charges should be moved. But in any case, it must be noted that an account of -a page or a swap can be moved only when it is charged to the task's current -(old) memory cgroup. - -+---+--------------------------------------------------------------------------+ -|bit| what type of charges would be moved ? | -+===+==========================================================================+ -| 0 | A charge of an anonymous page (or swap of it) used by the target task. | -| | You must enable Swap Extension (see 2.4) to enable move of swap charges. | -+---+--------------------------------------------------------------------------+ -| 1 | A charge of file pages (normal file, tmpfs file (e.g. ipc shared memory) | -| | and swaps of tmpfs file) mmapped by the target task. Unlike the case of | -| | anonymous pages, file pages (and swaps) in the range mmapped by the task | -| | will be moved even if the task hasn't done page fault, i.e. they might | -| | not be the task's "RSS", but other task's "RSS" that maps the same file. | -| | And mapcount of the page is ignored (the page can be moved even if | -| | page_mapcount(page) > 1). You must enable Swap Extension (see 2.4) to | -| | enable move of swap charges. | -+---+--------------------------------------------------------------------------+ - -8.3 TODO --------- - -- All of moving charge operations are done under cgroup_mutex. It's not good - behavior to hold the mutex too long, so we may need some trick. +Reading memory.move_charge_at_immigrate will always return 0 and writing +to it will always return -EINVAL. 9. Memory thresholds ==================== @@ -834,8 +774,10 @@ It's applicable for root and non-root cgroup. .. _cgroup-v1-memory-oom-control: -10. OOM Control -=============== +10. OOM Control (DEPRECATED) +============================ + +THIS IS DEPRECATED! memory.oom_control file is for OOM notification and other controls. @@ -882,8 +824,10 @@ At reading, current status of OOM is shown. The number of processes belonging to this cgroup killed by any kind of OOM killer. -11. Memory Pressure -=================== +11. Memory Pressure (DEPRECATED) +================================ + +THIS IS DEPRECATED! The pressure level notifications can be used to monitor the memory allocation cost; based on the pressure, applications can implement diff --git a/Documentation/admin-guide/cgroup-v1/pids.rst b/Documentation/admin-guide/cgroup-v1/pids.rst index 6acebd9e72c8..0f9f9a7b1f6c 100644 --- a/Documentation/admin-guide/cgroup-v1/pids.rst +++ b/Documentation/admin-guide/cgroup-v1/pids.rst @@ -36,7 +36,8 @@ superset of parent/child/pids.current. The pids.events file contains event counters: - - max: Number of times fork failed because limit was hit. + - max: Number of times fork failed in the cgroup because limit was hit in + self or ancestors. Example ------- diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 17e6e9565156..cb1b4e759b7e 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-6. Device 5-7. RDMA 5-7-1. RDMA Interface Files - 5-8. HugeTLB - 5.8-1. HugeTLB Interface Files - 5-9. Misc - 5.9-1 Miscellaneous cgroup Interface Files - 5.9-2 Migration and Ownership - 5-10. Others - 5-10-1. perf_event + 5-8. DMEM + 5-9. HugeTLB + 5.9-1. HugeTLB Interface Files + 5-10. Misc + 5.10-1 Miscellaneous cgroup Interface Files + 5.10-2 Migration and Ownership + 5-11. Others + 5-11-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -239,6 +240,13 @@ cgroup v2 currently supports the following mount options. will not be tracked by the memory controller (even if cgroup v2 is remounted later on). + pids_localevents + The option restores v1-like behavior of pids.events:max, that is only + local (inside cgroup proper) fork failures are counted. Without this + option pids.events.max represents any pids.max enforcemnt across + cgroup's subtree. + + Organizing Processes and Threads -------------------------------- @@ -526,10 +534,12 @@ cgroup namespace on namespace creation. Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them. For the first method, this is -achieved by not granting access to these files. For the second, the -kernel rejects writes to all files other than "cgroup.procs" and -"cgroup.subtree_control" on a namespace root from inside the -namespace. +achieved by not granting access to these files. For the second, files +outside the namespace should be hidden from the delegatee by the means +of at least mount namespacing, and the kernel rejects writes to all +files on a namespace root from inside the cgroup namespace, except for +those files listed in "/sys/kernel/cgroup/delegate" (including +"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.). The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, @@ -974,6 +984,14 @@ All cgroup core files are prefixed with "cgroup." A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + nr_subsys_<cgroup_subsys> + Total number of live cgroup subsystems (e.g memory + cgroup) at and beneath the current cgroup. + + nr_dying_subsys_<cgroup_subsys> + Total number of dying cgroup subsystems (e.g. memory + cgroup) at and beneath the current cgroup. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". @@ -1058,12 +1076,15 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 doesn't yet support control of realtime processes and -the cpu controller can only be enabled when all RT processes are in -the root cgroup. Be aware that system management software may already -have placed RT processes into nonroot cgroups during the system boot -process, and these processes may need to be moved to the root cgroup -before the cpu controller can be enabled. +WARNING: cgroup2 doesn't yet support control of realtime processes. For +a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group +scheduling of realtime processes, the cpu controller can only be enabled +when all RT processes are in the root cgroup. This limitation does +not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system +management software may already have placed RT processes into nonroot +cgroups during the system boot process, and these processes may need +to be moved to the root cgroup before the cpu controller can be enabled +with a CONFIG_RT_GROUP_SCHED enabled kernel. CPU Interface Files @@ -1296,17 +1317,10 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1318,12 +1332,26 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + memory.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. - The max memory usage recorded for the cgroup and its - descendants since the creation of the cgroup. + The max memory usage recorded for the cgroup and its descendants since + either the creation of the cgroup or the most recent reset for that FD. + + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.oom.group A read-write single value file which exists on non-root @@ -1432,7 +1460,7 @@ PAGE_SIZE multiple when read back. sec_pagetables Amount of memory allocated for secondary page tables, this currently includes KVM mmu allocations on x86 - and arm64. + and arm64 and IOMMU page tables. percpu (npn) Amount of memory used for storing per-cpu kernel @@ -1572,6 +1600,24 @@ PAGE_SIZE multiple when read back. pglazyfreed (npn) Amount of reclaimed lazyfree pages + swpin_zero + Number of pages swapped into memory and filled with zero, where I/O + was optimized out because the page content was detected to be zero + during swapout. + + swpout_zero + Number of zero-filled pages swapped out with I/O skipped due to the + content being detected as zero. + + zswpin + Number of pages moved in to memory from zswap. + + zswpout + Number of pages moved out of memory to zswap. + + zswpwb + Number of pages written from zswap to swap. + thp_fault_alloc (npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE @@ -1591,6 +1637,30 @@ PAGE_SIZE multiple when read back. Usually because failed to allocate some continuous swap space for the huge page. + numa_pages_migrated (npn) + Number of pages migrated by NUMA balancing. + + numa_pte_updates (npn) + Number of pages whose page table entries are modified by + NUMA balancing to produce NUMA hinting faults on access. + + numa_hint_faults (npn) + Number of NUMA hinting faults. + + pgdemote_kswapd + Number of pages demoted by kswapd. + + pgdemote_direct + Number of pages demoted directly. + + pgdemote_khugepaged + Number of pages demoted by khugepaged. + + hugetlb + Amount of memory used by hugetlb pages. This metric only shows + up if hugetlb usage is accounted for in memory.current (i.e. + cgroup is mounted with the memory_hugetlb_accounting option). + memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups. @@ -1640,11 +1710,14 @@ PAGE_SIZE multiple when read back. Healthy workloads are not expected to reach this limit. memory.swap.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max swap usage recorded for the cgroup and its descendants since + the creation of the cgroup or the most recent reset for that FD. - The max swap usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.swap.max A read-write single value file which exists on non-root @@ -1694,9 +1767,10 @@ PAGE_SIZE multiple when read back. entries fault back in or are written out to disk. memory.zswap.writeback - A read-write single value file. The default value is "1". The - initial value of the root cgroup is 1, and when a new cgroup is - created, it inherits the current value of its parent. + A read-write single value file. The default value is "1". + Note that this setting is hierarchical, i.e. the writeback would be + implicitly disabled for child cgroups if the upper hierarchy + does so. When this is set to 0, all swapping attempts to swapping devices are disabled. This included both zswap writebacks, and swapping due @@ -1707,6 +1781,8 @@ PAGE_SIZE multiple when read back. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be written to the zswap pool. + This setting has no effect if zswap is disabled, and swapping + is allowed unless memory.swap.max is set to 0. memory.pressure A read-only nested-keyed file. @@ -2181,11 +2257,31 @@ PID Interface Files Hard limit of number of processes. pids.current - A read-only single value file which exists on all cgroups. + A read-only single value file which exists on non-root cgroups. The number of processes currently in the cgroup and its descendants. + pids.peak + A read-only single value file which exists on non-root cgroups. + + The maximum value that the number of processes in the cgroup and its + descendants has ever reached. + + pids.events + A read-only flat-keyed file which exists on non-root cgroups. Unless + specified otherwise, a value change in this file generates a file + modified event. The following entries are defined. + + max + The number of times the cgroup's total number of processes hit the pids.max + limit (see also pids_localevents). + + pids.events.local + Similar to pids.events but the fields in the file are local + to the cgroup i.e. not hierarchical. The file modified event + generated on this file reflects only the local events. + Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough @@ -2320,8 +2416,12 @@ Cpuset Interface Files is always a subset of it. Users can manually set it to a value that is different from - "cpuset.cpus". The only constraint in setting it is that the - list of CPUs must be exclusive with respect to its sibling. + "cpuset.cpus". One constraint in setting it is that the list of + CPUs must be exclusive with respect to "cpuset.cpus.exclusive" + of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup + isn't set, its "cpuset.cpus" value, if set, cannot be a subset + of it to leave at least one CPU available when the exclusive + CPUs are taken away. For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an @@ -2337,8 +2437,8 @@ Cpuset Interface Files cpuset-enabled cgroups. This file shows the effective set of exclusive CPUs that - can be used to create a partition root. The content of this - file will always be a subset of "cpuset.cpus" and its parent's + can be used to create a partition root. The content + of this file will always be a subset of its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" if it is set. If "cpuset.cpus.exclusive" is not set, it is @@ -2527,6 +2627,49 @@ RDMA Interface Files mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 +DMEM +---- + +The "dmem" controller regulates the distribution and accounting of +device memory regions. Because each memory region may have its own page size, +which does not have to be equal to the system page size, the units are always bytes. + +DMEM Interface Files +~~~~~~~~~~~~~~~~~~~~ + + dmem.max, dmem.min, dmem.low + A readwrite nested-keyed file that exists for all the cgroups + except root that describes current configured resource limit + for a region. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 1073741824 + drm/0000:03:00.0/stolen max + + The semantics are the same as for the memory cgroup controller, and are + calculated in the same way. + + dmem.capacity + A read-only file that describes maximum region capacity. + It only exists on the root cgroup. Not all memory can be + allocated by cgroups, as the kernel reserves some for + internal use. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 8514437120 + drm/0000:03:00.0/stolen 67108864 + + dmem.current + A read-only file that describes current resource usage. + It exists for all the cgroup except root. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 12550144 + drm/0000:03:00.0/stolen 8650752 + HugeTLB ------- @@ -2599,6 +2742,15 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_ res_a 3 res_b 0 + misc.peak + A read-only flat-keyed file shown in all cgroups. It shows the + historical maximum usage of the resources in the cgroup and its + children.:: + + $ cat misc.peak + res_a 10 + res_b 8 + misc.max A read-write flat-keyed file shown in the non root cgroups. Allowed maximum usage of the resources in the cgroup and its children.:: @@ -2628,6 +2780,11 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_ The number of times the cgroup's resource usage was about to go over the max boundary. + misc.events.local + Similar to misc.events but the fields in the file are local to the + cgroup i.e. not hierarchical. The file modified event generated on + this file reflects only the local events. + Migration and Ownership ~~~~~~~~~~~~~~~~~~~~~~~ @@ -2846,7 +3003,7 @@ following two functions. a queue (device) has been associated with the bio and before submission. - wbc_account_cgroup_owner(@wbc, @page, @bytes) + wbc_account_cgroup_owner(@wbc, @folio, @bytes) Should be called for each data segment being written out. While this function doesn't care exactly when it's called during the writeback session, it's the easiest and most @@ -2878,8 +3035,8 @@ Deprecated v1 Core Features - "cgroup.clone_children" is removed. -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file - at the root instead. +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or + "cgroup.stat" files at the root instead. Issues with v1 and Rationales for v2 diff --git a/Documentation/admin-guide/cifs/usage.rst b/Documentation/admin-guide/cifs/usage.rst index aa8290a29dc8..c09674a75a9e 100644 --- a/Documentation/admin-guide/cifs/usage.rst +++ b/Documentation/admin-guide/cifs/usage.rst @@ -723,40 +723,26 @@ Configuration pseudo-files: ======================= ======================================================= SecurityFlags Flags which control security negotiation and also packet signing. Authentication (may/must) - flags (e.g. for NTLM and/or NTLMv2) may be combined with + flags (e.g. for NTLMv2) may be combined with the signing flags. Specifying two different password hashing mechanisms (as "must use") on the other hand does not make much sense. Default flags are:: - 0x07007 - - (NTLM, NTLMv2 and packet signing allowed). The maximum - allowable flags if you want to allow mounts to servers - using weaker password hashes is 0x37037 (lanman, - plaintext, ntlm, ntlmv2, signing allowed). Some - SecurityFlags require the corresponding menuconfig - options to be enabled. Enabling plaintext - authentication currently requires also enabling - lanman authentication in the security flags - because the cifs module only supports sending - laintext passwords using the older lanman dialect - form of the session setup SMB. (e.g. for authentication - using plain text passwords, set the SecurityFlags - to 0x30030):: + 0x00C5 + + (NTLMv2 and packet signing allowed). Some SecurityFlags + may require enabling a corresponding menuconfig option. may use packet signing 0x00001 must use packet signing 0x01001 - may use NTLM (most common password hash) 0x00002 - must use NTLM 0x02002 may use NTLMv2 0x00004 must use NTLMv2 0x04004 - may use Kerberos security 0x00008 - must use Kerberos 0x08008 - may use lanman (weak) password hash 0x00010 - must use lanman password hash 0x10010 - may use plaintext passwords 0x00020 - must use plaintext passwords 0x20020 - (reserved for future packet encryption) 0x00040 + may use Kerberos security (krb5) 0x00008 + must use Kerberos 0x08008 + may use NTLMSSP 0x00080 + must use NTLMSSP 0x80080 + seal (packet encryption) 0x00040 + must seal 0x40040 cifsFYI If set to non-zero value, additional debug information will be logged to the system error log. This field diff --git a/Documentation/admin-guide/device-mapper/delay.rst b/Documentation/admin-guide/device-mapper/delay.rst index 917ba8c33359..4d667228e744 100644 --- a/Documentation/admin-guide/device-mapper/delay.rst +++ b/Documentation/admin-guide/device-mapper/delay.rst @@ -3,29 +3,52 @@ dm-delay ======== Device-Mapper's "delay" target delays reads and/or writes -and maps them to different devices. +and/or flushs and optionally maps them to different devices. -Parameters:: +Arguments:: <device> <offset> <delay> [<write_device> <write_offset> <write_delay> [<flush_device> <flush_offset> <flush_delay>]] -With separate write parameters, the first set is only used for reads. +Table line has to either have 3, 6 or 9 arguments: + +3: apply offset and delay to read, write and flush operations on device + +6: apply offset and delay to device, also apply write_offset and write_delay + to write and flush operations on optionally different write_device with + optionally different sector offset + +9: same as 6 arguments plus define flush_offset and flush_delay explicitely + on/with optionally different flush_device/flush_offset. + Offsets are specified in sectors. + Delays are specified in milliseconds. + Example scripts =============== :: - #!/bin/sh - # Create device delaying rw operation for 500ms - echo "0 `blockdev --getsz $1` delay $1 0 500" | dmsetup create delayed + # + # Create mapped device named "delayed" delaying read, write and flush operations for 500ms. + # + dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 500" :: + #!/bin/sh + # + # Create mapped device delaying write and flush operations for 400ms and + # splitting reads to device $1 but writes and flushs to different device $2 + # to different offsets of 2048 and 4096 sectors respectively. + # + dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 2048 0 $2 4096 400" +:: #!/bin/sh - # Create device delaying only write operation for 500ms and - # splitting reads and writes to different devices $1 $2 - echo "0 `blockdev --getsz $1` delay $1 0 0 $2 0 500" | dmsetup create delayed + # + # Create mapped device delaying reads for 50ms, writes for 100ms and flushs for 333ms + # onto the same backing device at offset 0 sectors. + # + dmsetup create delayed --table "0 `blockdev --getsz $1` delay $1 0 50 $2 0 100 $1 0 333" diff --git a/Documentation/admin-guide/device-mapper/dm-crypt.rst b/Documentation/admin-guide/device-mapper/dm-crypt.rst index aa2d04d95df6..9f8139ff97d6 100644 --- a/Documentation/admin-guide/device-mapper/dm-crypt.rst +++ b/Documentation/admin-guide/device-mapper/dm-crypt.rst @@ -113,6 +113,11 @@ same_cpu_crypt The default is to use an unbound workqueue so that encryption work is automatically balanced between available CPUs. +high_priority + Set dm-crypt workqueues and the writer thread to high priority. This + improves throughput and latency of dm-crypt while degrading general + responsiveness of the system. + submit_from_crypt_cpus Disable offloading writes to a separate thread after encryption. There are some situations where offloading write bios from the @@ -155,6 +160,27 @@ iv_large_sectors The <iv_offset> must be multiple of <sector_size> (in 512 bytes units) if this flag is specified. +integrity_key_size:<bytes> + Use an integrity key of <bytes> size instead of using an integrity key size + of the digest size of the used HMAC algorithm. + + +Module parameters:: + max_read_size + Maximum size of read requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + + max_write_size + Maximum size of write requests. When a request larger than this size + is received, dm-crypt will split the request. The splitting improves + concurrency (the split requests could be encrypted in parallel by multiple + cores), but it also causes overhead. The user should tune this parameters to + fit the actual workload. + + Example scripts =============== LUKS (Linux Unified Key Setup) is now the preferred way to set up disk diff --git a/Documentation/admin-guide/device-mapper/vdo.rst b/Documentation/admin-guide/device-mapper/vdo.rst index 7e1ecafdf91e..a14e6d3e787c 100644 --- a/Documentation/admin-guide/device-mapper/vdo.rst +++ b/Documentation/admin-guide/device-mapper/vdo.rst @@ -241,6 +241,7 @@ Messages All vdo devices accept messages in the form: :: + dmsetup message <target-name> 0 <message-name> <message-parameters> The messages are: @@ -250,7 +251,12 @@ The messages are: by the vdostats userspace program to interpret the output buffer. - dump: + config: + Outputs useful vdo configuration information. Mostly used + by users who want to recreate a similar VDO volume and + want to know the creation configuration used. + + dump: Dumps many internal structures to the system log. This is not always safe to run, so it should only be used to debug a hung vdo. Optional parameters to specify structures to diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index 0e9b48daf690..7c036590cd07 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -26,6 +26,11 @@ Dynamic debug provides: - format string - class name (as known/declared by each module) +NOTE: To actually get the debug-print output on the console, you may +need to adjust the kernel ``loglevel=``, or use ``ignore_loglevel``. +Read about these kernel parameters in +Documentation/admin-guide/kernel-parameters.rst. + Viewing Dynamic Debug Behaviour =============================== diff --git a/Documentation/admin-guide/ext4.rst b/Documentation/admin-guide/ext4.rst index 5740d85439ff..2418b0c2d3df 100644 --- a/Documentation/admin-guide/ext4.rst +++ b/Documentation/admin-guide/ext4.rst @@ -212,16 +212,6 @@ When mounting an ext4 filesystem, the following option are accepted: that ext4's inode table readahead algorithm will pre-read into the buffer cache. The default value is 32 blocks. - nouser_xattr - Disables Extended User Attributes. See the attr(5) manual page for - more information about extended attributes. - - noacl - This option disables POSIX Access Control List support. If ACL support - is enabled in the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL - is enabled by default on mount. See the acl(5) manual page for more - information about acl. - bsddf (*) Make 'df' act like BSD. diff --git a/Documentation/admin-guide/gpio/gpio-virtuser.rst b/Documentation/admin-guide/gpio/gpio-virtuser.rst new file mode 100644 index 000000000000..2aca70db9f3b --- /dev/null +++ b/Documentation/admin-guide/gpio/gpio-virtuser.rst @@ -0,0 +1,177 @@ +.. SPDX-License-Identifier: GPL-2.0-only + +Virtual GPIO Consumer +===================== + +The virtual GPIO Consumer module allows users to instantiate virtual devices +that request GPIOs and then control their behavior over debugfs. Virtual +consumer devices can be instantiated from device-tree or over configfs. + +A virtual consumer uses the driver-facing GPIO APIs and allows to cover it with +automated tests driven by user-space. The GPIOs are requested using +``gpiod_get_array()`` and so we support multiple GPIOs per connector ID. + +Creating GPIO consumers +----------------------- + +The gpio-consumer module registers a configfs subsystem called +``'gpio-virtuser'``. For details of the configfs filesystem, please refer to +the configfs documentation. + +The user can create a hierarchy of configfs groups and items as well as modify +values of exposed attributes. Once the consumer is instantiated, this hierarchy +will be translated to appropriate device properties. The general structure is: + +**Group:** ``/config/gpio-virtuser`` + +This is the top directory of the gpio-consumer configfs tree. + +**Group:** ``/config/gpio-consumer/example-name`` + +**Attribute:** ``/config/gpio-consumer/example-name/live`` + +**Attribute:** ``/config/gpio-consumer/example-name/dev_name`` + +This is a directory representing a GPIO consumer device. + +The read-only ``dev_name`` attribute exposes the name of the device as it will +appear in the system on the platform bus. This is useful for locating the +associated debugfs directory under +``/sys/kernel/debug/gpio-virtuser/$dev_name``. + +The ``'live'`` attribute allows to trigger the actual creation of the device +once it's fully configured. The accepted values are: ``'1'`` to enable the +virtual device and ``'0'`` to disable and tear it down. + +Creating GPIO lookup tables +--------------------------- + +Users can create a number of configfs groups under the device group: + +**Group:** ``/config/gpio-consumer/example-name/con_id`` + +The ``'con_id'`` directory represents a single GPIO lookup and its value maps +to the ``'con_id'`` argument of the ``gpiod_get()`` function. For example: +``con_id`` == ``'reset'`` maps to the ``reset-gpios`` device property. + +Users can assign a number of GPIOs to each lookup. Each GPIO is a sub-directory +with a user-defined name under the ``'con_id'`` group. + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/key`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/offset`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/drive`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/pull`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/active_low`` + +**Attribute:** ``/config/gpio-consumer/example-name/con_id/0/transitory`` + +This is a group describing a single GPIO in the ``con_id-gpios`` property. + +For virtual consumers created using configfs we use machine lookup tables so +this group can be considered as a mapping between the filesystem and the fields +of a single entry in ``'struct gpiod_lookup'``. + +The ``'key'`` attribute represents either the name of the chip this GPIO +belongs to or the GPIO line name. This depends on the value of the ``'offset'`` +attribute: if its value is >= 0, then ``'key'`` represents the label of the +chip to lookup while ``'offset'`` represents the offset of the line in that +chip. If ``'offset'`` is < 0, then ``'key'`` represents the name of the line. + +The remaining attributes map to the ``'flags'`` field of the GPIO lookup +struct. The first two take string values as arguments: + +**``'drive'``:** ``'push-pull'``, ``'open-drain'``, ``'open-source'`` +**``'pull'``:** ``'pull-up'``, ``'pull-down'``, ``'pull-disabled'``, ``'as-is'`` + +``'active_low'`` and ``'transitory'`` are boolean attributes. + +Activating GPIO consumers +------------------------- + +Once the confiuration is complete, the ``'live'`` attribute must be set to 1 in +order to instantiate the consumer. It can be set back to 0 to destroy the +virtual device. The module will synchronously wait for the new simulated device +to be successfully probed and if this doesn't happen, writing to ``'live'`` will +result in an error. + +Device-tree +----------- + +Virtual GPIO consumers can also be defined in device-tree. The compatible string +must be: ``"gpio-virtuser"`` with at least one property following the +standardized GPIO pattern. + +An example device-tree code defining a virtual GPIO consumer: + +.. code-block :: none + + gpio-virt-consumer { + compatible = "gpio-virtuser"; + + foo-gpios = <&gpio0 5 GPIO_ACTIVE_LOW>, <&gpio1 2 0>; + bar-gpios = <&gpio0 6 0>; + }; + +Controlling virtual GPIO consumers +---------------------------------- + +Once active, the device will export debugfs attributes for controlling GPIO +arrays as well as each requested GPIO line separately. Let's consider the +following device property: ``foo-gpios = <&gpio0 0 0>, <&gpio0 4 0>;``. + +The following debugfs attribute groups will be created: + +**Group:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/`` + +This is the group that will contain the attributes for the entire GPIO array. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/values`` + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo/values_atomic`` + +Both attributes allow to read and set arrays of GPIO values. User must pass +exactly the number of values that the array contains in the form of a string +containing zeroes and ones representing inactive and active GPIO states +respectively. In this example: ``echo 11 > values``. + +The ``values_atomic`` attribute works the same as ``values`` but the kernel +will execute the GPIO driver callbacks in interrupt context. + +**Group:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/`` + +This is a group that represents a single GPIO with ``$index`` being its offset +in the array. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/consumer`` + +Allows to set and read the consumer label of the GPIO line. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/debounce`` + +Allows to set and read the debounce period of the GPIO line. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/direction`` + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/direction_atomic`` + +These two attributes allow to set the direction of the GPIO line. They accept +"input" and "output" as values. The atomic variant executes the driver callback +in interrupt context. + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/interrupts`` + +If the line is requested in input mode, writing ``1`` to this attribute will +make the module listen for edge interrupts on the GPIO. Writing ``0`` disables +the monitoring. Reading this attribute returns the current number of registered +interrupts (both edges). + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/value`` + +**Attribute:** ``/sys/kernel/debug/gpio-virtuser/$dev_name/gpiod:foo:$index/value_atomic`` + +Both attributes allow to read and set values of individual requested GPIO lines. +They accept the following values: ``1`` and ``0``. diff --git a/Documentation/admin-guide/gpio/index.rst b/Documentation/admin-guide/gpio/index.rst index 460afd29617e..712f379731cb 100644 --- a/Documentation/admin-guide/gpio/index.rst +++ b/Documentation/admin-guide/gpio/index.rst @@ -10,6 +10,7 @@ GPIO Character Device Userspace API <../../userspace-api/gpio/chardev> gpio-aggregator gpio-sim + gpio-virtuser Obsolete APIs <obsolete> .. only:: subproject and html diff --git a/Documentation/admin-guide/hw-vuln/core-scheduling.rst b/Documentation/admin-guide/hw-vuln/core-scheduling.rst index cf1eeefdfc32..a92e10ec402e 100644 --- a/Documentation/admin-guide/hw-vuln/core-scheduling.rst +++ b/Documentation/admin-guide/hw-vuln/core-scheduling.rst @@ -67,8 +67,8 @@ arg4: will be performed for all tasks in the task group of ``pid``. arg5: - userspace pointer to an unsigned long for storing the cookie returned by - ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. + userspace pointer to an unsigned long long for storing the cookie returned + by ``PR_SCHED_CORE_GET`` command. Should be 0 for all other commands. In order for a process to push a cookie to, or pull a cookie from a process, it is required to have the ptrace access mode: `PTRACE_MODE_READ_REALCREDS` to the diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 25a04cda4c2c..132e0bc6007e 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -592,85 +592,19 @@ Spectre variant 2 Mitigation control on the kernel command line --------------------------------------------- -Spectre variant 2 mitigation can be disabled or force enabled at the -kernel command line. +In general the kernel selects reasonable default mitigations for the +current CPU. - nospectre_v1 +Spectre default mitigations can be disabled or changed at the kernel +command line with the following options: - [X86,PPC] Disable mitigations for Spectre Variant 1 - (bounds check bypass). With this option data leaks are - possible in the system. + - nospectre_v1 + - nospectre_v2 + - spectre_v2={option} + - spectre_v2_user={option} + - spectre_bhi={option} - nospectre_v2 - - [X86] Disable all mitigations for the Spectre variant 2 - (indirect branch prediction) vulnerability. System may - allow data leaks with this option, which is equivalent - to spectre_v2=off. - - - spectre_v2= - - [X86] Control mitigation of Spectre variant 2 - (indirect branch speculation) vulnerability. - The default operation protects the kernel from - user space attacks. - - on - unconditionally enable, implies - spectre_v2_user=on - off - unconditionally disable, implies - spectre_v2_user=off - auto - kernel detects whether your CPU model is - vulnerable - - Selecting 'on' will, and 'auto' may, choose a - mitigation method at run time according to the - CPU, the available microcode, the setting of the - CONFIG_MITIGATION_RETPOLINE configuration option, - and the compiler with which the kernel was built. - - Selecting 'on' will also enable the mitigation - against user space to user space task attacks. - - Selecting 'off' will disable both the kernel and - the user space protections. - - Specific mitigations can also be selected manually: - - retpoline auto pick between generic,lfence - retpoline,generic Retpolines - retpoline,lfence LFENCE; indirect branch - retpoline,amd alias for retpoline,lfence - eibrs Enhanced/Auto IBRS - eibrs,retpoline Enhanced/Auto IBRS + Retpolines - eibrs,lfence Enhanced/Auto IBRS + LFENCE - ibrs use IBRS to protect kernel - - Not specifying this option is equivalent to - spectre_v2=auto. - - In general the kernel by default selects - reasonable mitigations for the current CPU. To - disable Spectre variant 2 mitigations, boot with - spectre_v2=off. Spectre variant 1 mitigations - cannot be disabled. - - spectre_bhi= - - [X86] Control mitigation of Branch History Injection - (BHI) vulnerability. This setting affects the deployment - of the HW BHI control and the SW BHB clearing sequence. - - on - (default) Enable the HW or SW mitigation as - needed. - off - Disable the mitigation. - -For spectre_v2_user see Documentation/admin-guide/kernel-parameters.txt +For more details on the available options, refer to Documentation/admin-guide/kernel-parameters.txt Mitigation selection guide -------------------------- diff --git a/Documentation/admin-guide/hw-vuln/srso.rst b/Documentation/admin-guide/hw-vuln/srso.rst index e715bfc09879..2ad1c05b8c88 100644 --- a/Documentation/admin-guide/hw-vuln/srso.rst +++ b/Documentation/admin-guide/hw-vuln/srso.rst @@ -135,7 +135,7 @@ and does not want to suffer the performance impact, one can always disable the mitigation with spec_rstack_overflow=off. Similarly, 'Mitigation: IBPB' is another full mitigation type employing -an indrect branch prediction barrier after having applied the required +an indirect branch prediction barrier after having applied the required microcode patch for one's system. This mitigation comes also at a performance cost. @@ -158,3 +158,72 @@ poisoned BTB entry and using that safe one for all function returns. In older Zen1 and Zen2, this is accomplished using a reinterpretation technique similar to Retbleed one: srso_untrain_ret() and srso_safe_ret(). + +Checking the safe RET mitigation actually works +----------------------------------------------- + +In case one wants to validate whether the SRSO safe RET mitigation works +on a kernel, one could use two performance counters + +* PMC_0xc8 - Count of RET/RET lw retired +* PMC_0xc9 - Count of RET/RET lw retired mispredicted + +and compare the number of RETs retired properly vs those retired +mispredicted, in kernel mode. Another way of specifying those events +is:: + + # perf list ex_ret_near_ret + + List of pre-defined events (to be used in -e or -M): + + core: + ex_ret_near_ret + [Retired Near Returns] + ex_ret_near_ret_mispred + [Retired Near Returns Mispredicted] + +Either the command using the event mnemonics:: + + # perf stat -e ex_ret_near_ret:k -e ex_ret_near_ret_mispred:k sleep 10s + +or using the raw PMC numbers:: + + # perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + +should give the same amount. I.e., every RET retired should be +mispredicted:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 137,167 cpu/event=0xc8,umask=0/k + 137,173 cpu/event=0xc9,umask=0/k + + 10.004110303 seconds time elapsed + + 0.000000000 seconds user + 0.004462000 seconds sys + +vs the case when the mitigation is disabled (spec_rstack_overflow=off) +or not functioning properly, showing usually a lot smaller number of +mispredicted retired RETs vs the overall count of retired RETs during +a workload:: + + [root@brent: ~/kernel/linux/tools/perf> ./perf stat -e cpu/event=0xc8,umask=0/k -e cpu/event=0xc9,umask=0/k sleep 10s + + Performance counter stats for 'sleep 10s': + + 201,627 cpu/event=0xc8,umask=0/k + 4,074 cpu/event=0xc9,umask=0/k + + 10.003267252 seconds time elapsed + + 0.002729000 seconds user + 0.000000000 seconds sys + +Also, there is a selftest which performs the above, go to +tools/testing/selftests/x86/ and do:: + + make srso + ./srso diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 32ea52f1d150..c8af32a8f800 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -7,6 +7,9 @@ added to the kernel over time. There is, as yet, little overall order or organization here — this material was not written to be a single, coherent document! With luck things will improve quickly over time. +General guides to kernel administration +--------------------------------------- + This initial section contains overall information, including the README file describing the kernel as a whole, documentation on kernel parameters, etc. @@ -15,19 +18,44 @@ etc. :maxdepth: 1 README - kernel-parameters devices - sysctl/index - abi features -This section describes CPU vulnerabilities and their mitigations. +A big part of the kernel's administrative interface is the /proc and sysfs +virtual filesystems; these documents describe how to interact with tem + +.. toctree:: + :maxdepth: 1 + + sysfs-rules + sysctl/index + cputopology + abi + +Security-related documentation: .. toctree:: :maxdepth: 1 hw-vuln/index + LSM/index + perf-security + +Booting the kernel +------------------ + +.. toctree:: + :maxdepth: 1 + + bootconfig + kernel-parameters + efi-stub + initrd + + +Tracking down and identifying problems +-------------------------------------- Here is a set of documents aimed at users who are trying to track down problems and bugs in particular. @@ -48,95 +76,120 @@ problems and bugs in particular. kdump/index perf/index pstore-blk + clearing-warn-once + kernel-per-CPU-kthreads + lockup-watchdogs + RAS/index + sysrq -This is the beginning of a section with information of interest to -application developers. Documents covering various aspects of the kernel -ABI will be found here. + +Core-kernel subsystems +---------------------- + +These documents describe core-kernel administration interfaces that are +likely to be of interest on almost any system. .. toctree:: :maxdepth: 1 - sysfs-rules + cgroup-v2 + cgroup-v1/index + cpu-load + mm/index + module-signing + namespaces/index + numastat + pm/index + syscall-user-dispatch -This is the beginning of a section with information of interest to -application developers and system integrators doing analysis of the -Linux kernel for safety critical applications. Documents supporting -analysis of kernel interactions with applications, and key kernel -subsystems expectations will be found here. +Support for non-native binary formats. Note that some of these +documents are ... old ... .. toctree:: :maxdepth: 1 - workload-tracing + binfmt-misc + java + mono + -The rest of this manual consists of various unordered guides on how to -configure specific aspects of kernel behavior to your liking. +Block-layer and filesystem administration +----------------------------------------- .. toctree:: :maxdepth: 1 - acpi/index - aoe/index - auxdisplay/index bcache binderfs - binfmt-misc blockdev/index - bootconfig - braille-console - btmrvl - cgroup-v1/index - cgroup-v2 cifs/index - clearing-warn-once - cpu-load - cputopology - dell_rbu device-mapper/index - edid - efi-stub ext4 filesystem-monitoring nfs/index - gpio/index - highuid - hw_random - initrd iostats - java jfs - kernel-per-CPU-kthreads + md + ufs + xfs + +Device-specific guides +---------------------- + +How to configure your hardware within your Linux system. + +.. toctree:: + :maxdepth: 1 + + acpi/index + aoe/index + auxdisplay/index + braille-console + btmrvl + dell_rbu + edid + gpio/index + hw_random laptops/index lcd-panel-cgram - ldm - lockup-watchdogs - LSM/index - md media/index - mm/index - module-signing - mono - namespaces/index - numastat + nvme-multipath parport - perf-security - pm/index - pmf pnp rapidio - RAS/index rtc serial-console svga - syscall-user-dispatch - sysrq thermal/index thunderbolt - ufs - unicode vga-softcursor video-output - xfs + +Workload analysis +----------------- + +This is the beginning of a section with information of interest to +application developers and system integrators doing analysis of the +Linux kernel for safety critical applications. Documents supporting +analysis of kernel interactions with applications, and key kernel +subsystems expectations will be found here. + +.. toctree:: + :maxdepth: 1 + + workload-tracing + +Everything else +--------------- + +A few hard-to-categorize and generally obsolete documents. + +.. toctree:: + :maxdepth: 1 + + highuid + ldm + unicode .. only:: subproject and html diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 0302a93b1d40..5376890adbeb 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -136,10 +136,6 @@ System kernel config options CONFIG_KEXEC_CORE=y - Subsequently, CRASH_CORE is selected by KEXEC_CORE:: - - CONFIG_CRASH_CORE=y - 2) Enable "sysfs file system support" in "Filesystem" -> "Pseudo filesystems." This is usually enabled by default:: @@ -168,6 +164,10 @@ Dump-capture kernel config options (Arch Independent) CONFIG_CRASH_DUMP=y + And this will select VMCORE_INFO and CRASH_RESERVE:: + CONFIG_VMCORE_INFO=y + CONFIG_CRASH_RESERVE=y + 2) Enable "/proc/vmcore support" under "Filesystems" -> "Pseudo filesystems":: CONFIG_PROC_VMCORE=y diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst index e8bdf5e86a9b..39d0e7ff0965 100644 --- a/Documentation/admin-guide/kernel-parameters.rst +++ b/Documentation/admin-guide/kernel-parameters.rst @@ -27,6 +27,16 @@ kernel command line (/proc/cmdline) and collects module parameters when it loads a module, so the kernel command line can be used for loadable modules too. +This document may not be entirely up to date and comprehensive. The command +"modinfo -p ${modulename}" shows a current list of all parameters of a loadable +module. Loadable modules, after being loaded into the running kernel, also +reveal their parameters in /sys/module/${modulename}/parameters/. Some of these +parameters may be changed at runtime by the command +``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``. + +Special handling +---------------- + Hyphens (dashes) and underscores are equivalent in parameter names, so:: log_buf_len=1M print-fatal-signals=1 @@ -39,8 +49,8 @@ Double-quotes can be used to protect spaces in values, e.g.:: param="spaces in here" -cpu lists: ----------- +cpu lists +~~~~~~~~~ Some kernel parameters take a list of CPUs as a value, e.g. isolcpus, nohz_full, irqaffinity, rcu_nocbs. The format of this list is: @@ -82,12 +92,17 @@ so that "nohz_full=all" is the equivalent of "nohz_full=0-N". The semantics of "N" and "all" is supported on a level of bitmaps and holds for all users of bitmap_parselist(). -This document may not be entirely up to date and comprehensive. The command -"modinfo -p ${modulename}" shows a current list of all parameters of a loadable -module. Loadable modules, after being loaded into the running kernel, also -reveal their parameters in /sys/module/${modulename}/parameters/. Some of these -parameters may be changed at runtime by the command -``echo -n ${value} > /sys/module/${modulename}/parameters/${parm}``. +Metric suffixes +~~~~~~~~~~~~~~~ + +The [KMG] suffix is commonly described after a number of kernel +parameter values. 'K', 'M', 'G', 'T', 'P', and 'E' suffixes are allowed. +These letters represent the _binary_ multipliers 'Kilo', 'Mega', 'Giga', +'Tera', 'Peta', and 'Exa', equaling 2^10, 2^20, 2^30, 2^40, 2^50, and +2^60 bytes respectively. Such letter suffixes can also be entirely omitted. + +Kernel Build Options +-------------------- The parameters listed below are only valid if certain kernel build options were enabled and if respective hardware is present. This list should be kept @@ -118,7 +133,6 @@ is applicable:: HIBERNATION HIBERNATION is enabled. HW Appropriate hardware is enabled. HYPER_V HYPERV support is enabled. - IA-64 IA-64 architecture is enabled. IMA Integrity measurement architecture is enabled. IP_PNP IP DHCP, BOOTP, or RARP is enabled. IPV6 IPv6 support is enabled. @@ -160,6 +174,7 @@ is applicable:: SCSI Appropriate SCSI support is enabled. A lot of drivers have their options described inside the Documentation/scsi/ sub-directory. + SDW SoundWire support is enabled. SECURITY Different security models are enabled. SELINUX SELinux support is enabled. SERIAL Serial support is enabled. @@ -179,8 +194,6 @@ is applicable:: WDT Watchdog support is enabled. X86-32 X86-32, aka i386 architecture is enabled. X86-64 X86-64 architecture is enabled. - More X86-64 boot options can be found in - Documentation/arch/x86/x86_64/boot-options.rst. X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64) X86_UV SGI UV support is enabled. XEN Xen support is enabled @@ -198,7 +211,6 @@ Do not modify the syntax of boot loader parameters without extreme need or coordination with <Documentation/arch/x86/boot.rst>. There are also arch-specific kernel-parameters not documented here. -See for example <Documentation/arch/x86/x86_64/boot-options.rst>. Note that ALL kernel parameters listed below are CASE SENSITIVE, and that a trailing = on the name of any parameter states that that parameter will @@ -212,10 +224,5 @@ a fixed number of characters. This limit depends on the architecture and is between 256 and 4096 characters. It is defined in the file ./include/uapi/asm-generic/setup.h as COMMAND_LINE_SIZE. -Finally, the [KMG] suffix is commonly described after a number of kernel -parameter values. These 'K', 'M', and 'G' letters represent the _binary_ -multipliers 'Kilo', 'Mega', and 'Giga', equaling 2^10, 2^20, and 2^30 -bytes respectively. Such letter suffixes can also be entirely omitted: - .. include:: kernel-parameters.txt :literal: diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 213d0719e2b7..fb8752b42ec8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -12,7 +12,7 @@ acpi= [HW,ACPI,X86,ARM64,RISCV64,EARLY] Advanced Configuration and Power Interface Format: { force | on | off | strict | noirq | rsdt | - copy_dsdt } + copy_dsdt | nospcr } force -- enable ACPI if default was off on -- enable ACPI but allow fallback to DT [arm64,riscv64] off -- disable ACPI if default was on @@ -21,8 +21,16 @@ strictly ACPI specification compliant. rsdt -- prefer RSDT over (default) XSDT copy_dsdt -- copy DSDT to memory - For ARM64 and RISCV64, ONLY "acpi=off", "acpi=on" or - "acpi=force" are available + nocmcff -- Disable firmware first mode for corrected + errors. This disables parsing the HEST CMC error + source to check if firmware has set the FF flag. This + may result in duplicate corrected error reports. + nospcr -- disable console in ACPI SPCR table as + default _serial_ console on ARM64 + For ARM64, ONLY "acpi=off", "acpi=on", "acpi=force" or + "acpi=nospcr" are available + For RISCV64, ONLY "acpi=off", "acpi=on" or "acpi=force" + are available See also Documentation/power/runtime_pm.rst, pci=noacpi @@ -329,12 +337,17 @@ allowed anymore to lift isolation requirements as needed. This option does not override iommu=pt - force_enable - Force enable the IOMMU on platforms known - to be buggy with IOMMU enabled. Use this - option with care. - pgtbl_v1 - Use v1 page table for DMA-API (Default). - pgtbl_v2 - Use v2 page table for DMA-API. - irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + force_enable - Force enable the IOMMU on platforms known + to be buggy with IOMMU enabled. Use this + option with care. + pgtbl_v1 - Use v1 page table for DMA-API (Default). + pgtbl_v2 - Use v2 page table for DMA-API. + irtcachedis - Disable Interrupt Remapping Table (IRT) caching. + nohugepages - Limit page-sizes used for v1 page-tables + to 4 KiB. + v2_pgsizes_only - Limit page-sizes used for v1 page-tables + to 4KiB/2Mib/1GiB. + amd_iommu_dump= [HW,X86-64] Enable AMD IOMMU driver option to dump the ACPI table @@ -396,6 +409,8 @@ not play well with APC CPU idle - disable it if you have APC and your system crashes randomly. + apic [APIC,X86-64] Use IO-APIC. Default. + apic= [APIC,X86,EARLY] Advanced Programmable Interrupt Controller Change the output verbosity while booting Format: { quiet (default) | verbose | debug } @@ -415,6 +430,10 @@ useful so that a dump capture kernel won't be shot down by NMI + apicpmtimer Do APIC timer calibration using the pmtimer. Implies + apicmaintimer. Useful when your PIT timer is totally + broken. + autoconf= [IPV6] See Documentation/networking/ipv6.rst. @@ -431,9 +450,15 @@ arcrimi= [HW,NET] ARCnet - "RIM I" (entirely mem-mapped) cards Format: <io>,<irq>,<nodeID> + arm64.no32bit_el0 [ARM64] Unconditionally disable the execution of + 32 bit applications. + arm64.nobti [ARM64] Unconditionally disable Branch Target Identification support + arm64.nogcs [ARM64] Unconditionally disable Guarded Control Stack + support + arm64.nomops [ARM64] Unconditionally disable Memory Copy and Memory Set instructions support @@ -510,6 +535,18 @@ Format: <io>,<irq>,<mode> See header of drivers/net/hamradio/baycom_ser_hdx.c. + bdev_allow_write_mounted= + Format: <bool> + Control the ability to open a mounted block device + for writing, i.e., allow / disallow writes that bypass + the FS. This was implemented as a means to prevent + fuzzers from crashing the kernel by overwriting the + metadata underneath a mounted FS without its awareness. + This also prevents destructive formatting of mounted + filesystems by naive storage tooling that don't use + O_EXCL. Default is Y and can be changed through the + Kconfig option CONFIG_BLK_DEV_WRITE_MOUNTED. + bert_disable [ACPI] Disable BERT OS support on buggy BIOSes. @@ -785,6 +822,25 @@ Documentation/networking/netconsole.rst for an alternative. + <DEVNAME>:<n>.<n>[,options] + Use the specified serial port on the serial core bus. + The addressing uses DEVNAME of the physical serial port + device, followed by the serial core controller instance, + and the serial port instance. The options are the same + as documented for the ttyS addressing above. + + The mapping of the serial ports to the tty instances + can be viewed with: + + $ ls -d /sys/bus/serial-base/devices/*:*.*/tty/* + /sys/bus/serial-base/devices/00:04:0.0/tty/ttyS0 + + In the above example, the console can be addressed with + console=00:04:0.0. Note that a console addressed this + way will only get added when the related device driver + is ready. The use of an earlycon parameter in addition to + the console may be desired for console output early on. + uart[8250],io,<addr>[,options] uart[8250],mmio,<addr>[,options] uart[8250],mmio16,<addr>[,options] @@ -875,12 +931,16 @@ the parameter has no effect. crash_kexec_post_notifiers - Run kdump after running panic-notifiers and dumping - kmsg. This only for the users who doubt kdump always - succeeds in any situation. - Note that this also increases risks of kdump failure, - because some panic notifiers can make the crashed - kernel more unstable. + Only jump to kdump kernel after running the panic + notifiers and dumping kmsg. This option increases + the risks of a kdump failure, since some panic + notifiers can make the crashed kernel more unstable. + In configurations where kdump may not be reliable, + running the panic notifiers could allow collecting + more data on dmesg, like stack traces from other CPUS + or extra data dumped by panic_print. Note that some + configurations enable this option unconditionally, + like Hyper-V, PowerPC (fadump) and AMD SEV-SNP. crashkernel=size[KMG][@offset[KMG]] [KNL,EARLY] Using kexec, Linux can switch to a 'crash kernel' @@ -1428,27 +1488,6 @@ you are really sure that your UEFI does sane gc and fulfills the spec otherwise your board may brick. - efi_fake_mem= nn[KMG]@ss[KMG]:aa[,nn[KMG]@ss[KMG]:aa,..] [EFI,X86,EARLY] - Add arbitrary attribute to specific memory range by - updating original EFI memory map. - Region of memory which aa attribute is added to is - from ss to ss+nn. - - If efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000 - is specified, EFI_MEMORY_MORE_RELIABLE(0x10000) - attribute is added to range 0x100000000-0x180000000 and - 0x10a0000000-0x1120000000. - - If efi_fake_mem=8G@9G:0x40000 is specified, the - EFI_MEMORY_SP(0x40000) attribute is added to - range 0x240000000-0x43fffffff. - - Using this parameter you can do debugging of EFI memmap - related features. For example, you can do debugging of - Address Range Mirroring feature even if your box - doesn't support it, or mark specific memory as - "soft reserved". - efivar_ssdt= [EFI; X86] Name of an EFI variable that contains an SSDT that is to be dynamically loaded by Linux. If there are multiple variables with the same name but with different @@ -1524,6 +1563,7 @@ failslab= fail_usercopy= fail_page_alloc= + fail_skb_realloc= fail_make_request=[KNL] General fault injection mechanism. Format: <interval>,<probability>,<space>,<times> @@ -1696,6 +1736,8 @@ off: Disable GDS mitigation. + gbpages [X86] Use GB pages for kernel direct mappings. + gcov_persist= [GCOV] When non-zero (default), profiling data for kernel modules is saved and remains accessible via debugfs, even when the module is unloaded/reloaded. @@ -1756,8 +1798,6 @@ for 64-bit NUMA, off otherwise. Format: 0 | 1 (for off | on) - hcl= [IA-64] SGI's Hardware Graph compatibility layer - hd= [EIDE] (E)IDE hard drive subsystem geometry Format: <cyl>,<head>,<sect> @@ -1899,6 +1939,28 @@ Format: <bus_id>,<clkrate> + i2c_touchscreen_props= [HW,ACPI,X86] + Set device-properties for ACPI-enumerated I2C-attached + touchscreen, to e.g. fix coordinates of upside-down + mounted touchscreens. If you need this option please + submit a drivers/platform/x86/touchscreen_dmi.c patch + adding a DMI quirk for this. + + Format: + <ACPI_HW_ID>:<prop_name>=<val>[:prop_name=val][:...] + Where <val> is one of: + Omit "=<val>" entirely Set a boolean device-property + Unsigned number Set a u32 device-property + Anything else Set a string device-property + + Examples (split over multiple lines): + i2c_touchscreen_props=GDIX1001:touchscreen-inverted-x: + touchscreen-inverted-y + + i2c_touchscreen_props=MSSL1680:touchscreen-size-x=1920: + touchscreen-size-y=1080:touchscreen-inverted-y: + firmware-name=gsl1680-vendor-model.fw:silead,home-button + i8042.debug [HW] Toggle i8042 debug mode i8042.unmask_kbd_data [HW] Enable printing of interrupt data from the KBD port @@ -1958,12 +2020,21 @@ idle= [X86,EARLY] Format: idle=poll, idle=halt, idle=nomwait - Poll forces a polling idle loop that can slightly - improve the performance of waking up a idle CPU, but - will use a lot of power and make the system run hot. - Not recommended. + + idle=poll: Don't do power saving in the idle loop + using HLT, but poll for rescheduling event. This will + make the CPUs eat a lot more power, but may be useful + to get slightly better performance in multiprocessor + benchmarks. It also makes some profiling using + performance counters more accurate. Please note that + on systems with MONITOR/MWAIT support (like Intel + EM64T CPUs) this option has no performance advantage + over the normal idle loop. It may also interact badly + with hyperthreading. + idle=halt: Halt is forced to be used for CPU idle. In such case C2/C3 won't be used again. + idle=nomwait: Disable mwait for CPU C-states idxd.sva= [HW] @@ -1978,7 +2049,7 @@ for the device. By default it is set to false (0). ieee754= [MIPS] Select IEEE Std 754 conformance mode - Format: { strict | legacy | 2008 | relaxed } + Format: { strict | legacy | 2008 | relaxed | emulated } Default: strict Choose which programs will be accepted for execution @@ -1998,6 +2069,8 @@ by the FPU relaxed accept any binaries regardless of whether supported by the FPU + emulated accept any binaries but enable FPU emulator + if binary mode is unsupported by the FPU. The FPU emulator is always able to support both NaN encodings, so if no FPU hardware is present or it has @@ -2251,26 +2324,81 @@ no_x2apic_optout BIOS x2APIC opt-out request will be ignored nopost disable Interrupt Posting + posted_msi + enable MSIs delivered as posted interrupts iomem= Disable strict checking of access to MMIO memory strict regions from userspace. relaxed iommu= [X86,EARLY] + off + Don't initialize and use any kind of IOMMU. + force + Force the use of the hardware IOMMU even when + it is not actually needed (e.g. because < 3 GB + memory). + noforce + Don't force hardware IOMMU usage when it is not + needed. (default). + biomerge panic nopanic merge nomerge + soft - pt [X86] - nopt [X86] - nobypass [PPC/POWERNV] + Use software bounce buffering (SWIOTLB) (default for + Intel machines). This can be used to prevent the usage + of an available hardware IOMMU. + + [X86] + pt + [X86] + nopt + [PPC/POWERNV] + nobypass Disable IOMMU bypass, using IOMMU for PCI devices. + [X86] + AMD Gart HW IOMMU-specific options: + + <size> + Set the size of the remapping area in bytes. + + allowed + Overwrite iommu off workarounds for specific chipsets + + fullflush + Flush IOMMU on each allocation (default). + + nofullflush + Don't use IOMMU fullflush. + + memaper[=<order>] + Allocate an own aperture over RAM with size + 32MB<<order. (default: order=1, i.e. 64MB) + + merge + Do scatter-gather (SG) merging. Implies "force" + (experimental). + + nomerge + Don't do scatter-gather (SG) merging. + + noaperture + Ask the IOMMU not to touch the aperture for AGP. + + noagp + Don't initialize the AGP driver and use full aperture. + + panic + Always panic when IOMMU overflows. + iommu.forcedac= [ARM64,X86,EARLY] Control IOVA allocation for PCI devices. Format: { "0" | "1" } 0 - Try to allocate a 32-bit DMA address first, before @@ -2321,6 +2449,18 @@ ipcmni_extend [KNL,EARLY] Extend the maximum number of unique System V IPC identifiers from 32,768 to 16,777,216. + ipe.enforce= [IPE] + Format: <bool> + Determine whether IPE starts in permissive (0) or + enforce (1) mode. The default is enforce. + + ipe.success_audit= + [IPE] + Format: <bool> + Start IPE with success auditing enabled, emitting + an audit event when a binary is allowed. The default + is 0. + irqaffinity= [SMP] Set the default irq affinity mask The argument is a cpu list, as described above. @@ -2366,7 +2506,9 @@ specified in the flag list (default: domain): nohz - Disable the tick when a single task runs. + Disable the tick when a single task runs as well as + disabling other kernel noises like having RCU callbacks + offloaded. This is equivalent to the nohz_full parameter. A residual 1Hz tick is offloaded to workqueues, which you need to affine to housekeeping through the global @@ -2492,7 +2634,7 @@ keepinitrd [HW,ARM] See retain_initrd. - kernelcore= [KNL,X86,IA-64,PPC,EARLY] + kernelcore= [KNL,X86,PPC,EARLY] Format: nn[KMGTPE] | nn% | "mirror" This parameter specifies the amount of memory usable by the kernel for non-movable allocations. The requested @@ -2619,6 +2761,23 @@ Default is Y (on). + kvm.enable_virt_at_load=[KVM,ARM64,LOONGARCH,MIPS,RISCV,X86] + If enabled, KVM will enable virtualization in hardware + when KVM is loaded, and disable virtualization when KVM + is unloaded (if KVM is built as a module). + + If disabled, KVM will dynamically enable and disable + virtualization on-demand when creating and destroying + VMs, i.e. on the 0=>1 and 1=>0 transitions of the + number of VMs. + + Enabling virtualization at module load avoids potential + latency for creation of the 0=>1 VM, as KVM serializes + virtualization enabling across all online CPUs. The + "cost" of enabling virtualization when KVM is loaded, + is that doing so may interfere with using out-of-tree + hypervisors that want to "own" virtualization hardware. + kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface. Default is false (don't support). @@ -2665,17 +2824,21 @@ nvhe: Standard nVHE-based mode, without support for protected guests. - protected: nVHE-based mode with support for guests whose - state is kept private from the host. + protected: Mode with support for guests whose state is + kept private from the host, using VHE or + nVHE depending on HW support. nested: VHE-based mode with support for nested - virtualization. Requires at least ARMv8.3 - hardware. + virtualization. Requires at least ARMv8.4 + hardware (with FEAT_NV2). Defaults to VHE/nVHE based on hardware support. Setting mode to "protected" will disable kexec and hibernation - for the host. "nested" is experimental and should be - used with extreme caution. + for the host. To force nVHE on VHE hardware, add + "arm64_sw.hvhe=0 id_aa64mmfr1.vh=0" to the + command-line. + "nested" is experimental and should be used with + extreme caution. kvm-arm.vgic_v3_group0_trap= [KVM,ARM,EARLY] Trap guest accesses to GICv3 group-0 @@ -2693,6 +2856,24 @@ [KVM,ARM,EARLY] Allow use of GICv4 for direct injection of LPIs. + kvm-arm.wfe_trap_policy= + [KVM,ARM] Control when to set WFE instruction trap for + KVM VMs. Traps are allowed but not guaranteed by the + CPU architecture. + + trap: set WFE instruction trap + + notrap: clear WFE instruction trap + + kvm-arm.wfi_trap_policy= + [KVM,ARM] Control when to set WFI instruction trap for + KVM VMs. Traps are allowed but not guaranteed by the + CPU architecture. + + trap: set WFI instruction trap + + notrap: clear WFI instruction trap + kvm_cma_resv_ratio=n [PPC,EARLY] Reserves given percentage from system memory area for contiguous memory allocation for KVM hash pagetable @@ -3132,26 +3313,16 @@ unlikely, in the extreme case this might damage your hardware. - ltpc= [NET] - Format: <io>,<irq>,<dma> - lsm.debug [SECURITY] Enable LSM initialization debugging output. lsm=lsm1,...,lsmN [SECURITY] Choose order of LSM initialization. This overrides CONFIG_LSM, and the "security=" parameter. - machvec= [IA-64] Force the use of a particular machine-vector - (machvec) in a generic kernel. - Example: machvec=hpzx1 - machtype= [Loongson] Share the same kernel image file between different yeeloong laptops. Example: machtype=lemote-yeeloong-2f-7inch - max_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory greater - than or equal to this physical address is ignored. - maxcpus= [SMP,EARLY] Maximum number of processors that an SMP kernel will bring up during bootup. maxcpus=n : n >= 0 limits the kernel to bring up 'n' processors. Surely after @@ -3168,9 +3339,77 @@ devices can be requested on-demand with the /dev/loop-control interface. - mce [X86-32] Machine Check Exception + mce= [X86-{32,64}] + + Please see Documentation/arch/x86/x86_64/machinecheck.rst for sysfs runtime tunables. + + off + disable machine check + + no_cmci + disable CMCI(Corrected Machine Check Interrupt) that + Intel processor supports. Usually this disablement is + not recommended, but it might be handy if your + hardware is misbehaving. + + Note that you'll get more problems without CMCI than + with due to the shared banks, i.e. you might get + duplicated error logs. + + dont_log_ce + don't make logs for corrected errors. All events + reported as corrected are silently cleared by OS. This + option will be useful if you have no interest in any + of corrected errors. + + ignore_ce + disable features for corrected errors, e.g. + polling timer and CMCI. All events reported as + corrected are not cleared by OS and remained in its + error banks. + + Usually this disablement is not recommended, however + if there is an agent checking/clearing corrected + errors (e.g. BIOS or hardware monitoring + applications), conflicting with OS's error handling, + and you cannot deactivate the agent, then this option + will be a help. + + no_lmce + do not opt-in to Local MCE delivery. Use legacy method + to broadcast MCEs. + + bootlog + enable logging of machine checks left over from + booting. Disabled by default on AMD Fam10h and older + because some BIOS leave bogus ones. + + If your BIOS doesn't do that it's a good idea to + enable though to make sure you log even machine check + events that result in a reboot. On Intel systems it is + enabled by default. + + nobootlog + disable boot machine check logging. + + monarchtimeout (number) + sets the time in us to wait for other CPUs on machine + checks. 0 to disable. + + bios_cmci_threshold + don't overwrite the bios-set CMCI threshold. This boot + option prevents Linux from overwriting the CMCI + threshold set by the bios. Without this option, Linux + always sets the CMCI threshold to 1. Enabling this may + make memory predictive failure analysis less effective + if the bios sets thresholds for memory errors since we + will not see details for all errors. + + recovery + force-enable recoverable machine check code paths + + Everything else is in sysfs now. - mce=option [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst md= [HW] RAID subsystems devices and level See Documentation/admin-guide/md.rst. @@ -3260,8 +3499,8 @@ [KNL] Set the initial state for the memory hotplug onlining policy. If not specified, the default value is set according to the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel config + options. See Documentation/admin-guide/mm/memory-hotplug.rst. memmap=exactmap [KNL,X86,EARLY] Enable setting of an exact @@ -3377,10 +3616,6 @@ deep - Suspend-To-RAM or equivalent (if supported) See Documentation/admin-guide/pm/sleep-states.rst. - mfgpt_irq= [IA-32] Specify the IRQ to use for the - Multi-Function General Purpose Timers on AMD Geode - platforms. - mfgptfix [X86-32] Fix MFGPT timers on AMD Geode platforms when the BIOS has incorrectly applied a workaround. TinyBIOS version 0.98 is known to be affected, 0.99 fixes the @@ -3393,9 +3628,6 @@ Enable or disable the microcode minimal revision enforcement for the runtime microcode loader. - min_addr=nn[KMG] [KNL,BOOT,IA-64] All physical memory below this - physical address is ignored. - mini2440= [ARM,HW,KNL] Format:[0..2][b][c][t] Default: "0tb" @@ -3560,7 +3792,7 @@ mousedev.yres= [MOUSE] Vertical screen resolution, used for devices reporting absolute coordinates, such as tablets - movablecore= [KNL,X86,IA-64,PPC,EARLY] + movablecore= [KNL,X86,PPC,EARLY] Format: nn[KMGTPE] | nn% This parameter is the complement to kernelcore=, it specifies the amount of memory used for migratable @@ -3586,11 +3818,6 @@ mtdparts= [MTD] See drivers/mtd/parsers/cmdlinepart.c - mtdset= [ARM] - ARM/S3C2412 JIVE boot control - - See arch/arm/mach-s3c/mach-jive.c - mtouchusb.raw_coordinates= [HW] Make the MicroTouch USB driver use raw coordinates ('y', default) or cooked coordinates ('n') @@ -3776,10 +4003,12 @@ Format: [state][,regs][,debounce][,die] nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels - Format: [panic,][nopanic,][num] + Format: [panic,][nopanic,][rNNN,][num] Valid num: 0 or 1 0 - turn hardlockup detector in nmi_watchdog off 1 - turn hardlockup detector in nmi_watchdog on + rNNN - configure the watchdog with raw perf event 0xNNN + When panic is specified, panic when an NMI watchdog timeout occurs (or 'nopanic' to not panic on an NMI watchdog, if CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is set) @@ -3803,12 +4032,11 @@ noalign [KNL,ARM] - noaltinstr [S390,EARLY] Disables alternative instructions - patching (CPU alternatives feature). - noapic [SMP,APIC,EARLY] Tells the kernel to not make use of any IOAPICs that may be present in the system. + noapictimer [APIC,X86] Don't set up the APIC timer + noautogroup Disable scheduler automatic task group creation. nocache [ARM,EARLY] @@ -3837,8 +4065,6 @@ no_entry_flush [PPC,EARLY] Don't flush the L1-D cache when entering the kernel. - noexec [IA-64] - noexec32 [X86-64] This affects only 32-bit executables. noexec32=on: enable non-executable mappings (default) @@ -3858,12 +4084,7 @@ register save and restore. The kernel will only save legacy floating-point registers on task switch. - nohalt [IA-64] Tells the kernel not to use the power saving - function PAL_HALT_LIGHT when idle. This increases - power-consumption. On the positive side, it reduces - interrupt wake-up latency, which may improve performance - in certain environments such as networked servers or - real-time systems. + nogbpages [X86] Do not use GB pages for kernel direct mappings. no_hash_pointers [KNL,EARLY] @@ -3882,7 +4103,7 @@ nohibernate [HIBERNATION] Disable hibernation and resume. - nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,SH] Forces the kernel to + nohlt [ARM,ARM64,MICROBLAZE,MIPS,PPC,RISCV,SH] Forces the kernel to busy wait in do_idle() and not use the arch_cpu_idle() implementation; requires CONFIG_GENERIC_IDLE_POLL_SETUP to be effective. This is useful on platforms where the @@ -3891,6 +4112,8 @@ the impact of the sleep instructions. This is also useful when using JTAG debugger. + nohpet [X86] Don't use the HPET timer. + nohugeiomap [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge I/O mappings. nohugevmalloc [KNL,X86,PPC,ARM64,EARLY] Disable kernel huge vmalloc mappings. @@ -3919,8 +4142,6 @@ remapping. [Deprecated - use intremap=off] - nointroute [IA-64] - noinvpcid [X86,EARLY] Disable the INVPCID cpu feature. noiotrap [SH] Disables trapped I/O port accesses. @@ -3930,8 +4151,6 @@ noisapnp [ISAPNP] Disables ISA PnP code. - nojitter [IA-64] Disables jitter checking for ITC timers. - nokaslr [KNL,EARLY] When CONFIG_RANDOMIZE_BASE is set, this disables kernel and module base offset ASLR (Address Space @@ -3946,8 +4165,6 @@ nolapic_timer [X86-32,APIC,EARLY] Do not use the local APIC timer. - nomca [IA-64] Disable machine check abort handling - nomce [X86-32] Disable Machine Check Exception nomfgpt [X86-32] Disable Multi-Function General Purpose @@ -3999,8 +4216,6 @@ noresume [SWSUSP] Disables resume and restores original swap space. - nosbagart [IA-64] - no-scroll [VGA] Disables scrollback. This is required for the Braillex ib80-piezo Braille reader made by F.H. Papenmeier (Germany). @@ -4044,14 +4259,16 @@ prediction) vulnerability. System may allow data leaks with this option. - no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV,EARLY] Disable - paravirtualized steal time accounting. steal time is - computed, but won't influence scheduler behaviour + no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES,RISCV,LOONGARCH,EARLY] + Disable paravirtualized steal time accounting. steal time + is computed, but won't influence scheduler behaviour nosync [HW,M68K] Disables sync negotiation for all devices. - no_timer_check [X86,APIC] Disables the code which tests for - broken timer IRQ sources. + no_timer_check [X86,APIC] Disables the code which tests for broken + timer IRQ sources, i.e., the IO-APIC timer. This can + work around problems with incorrect timer + initialization on some boards. no_uaccess_flush [PPC,EARLY] Don't flush the L1-D cache after accessing user data. @@ -4101,19 +4318,6 @@ parameter, xsave area per process might occupy more memory on xsaves enabled systems. - nps_mtm_hs_ctr= [KNL,ARC] - This parameter sets the maximum duration, in - cycles, each HW thread of the CTOP can run - without interruptions, before HW switches it. - The actual maximum duration is 16 times this - parameter's value. - Format: integer between 1 and 255 - Default: 255 - - nptcg= [IA-64] Override max number of concurrent global TLB - purges which is reported from either PAL_VM_SUMMARY or - SAL PALO. - nr_cpus= [SMP,EARLY] Maximum number of processors that an SMP kernel could support. nr_cpus=n : n >= 1 limits the kernel to support 'n' processors. It could be larger than the @@ -4129,6 +4333,26 @@ Disable NUMA, Only set up a single NUMA node spanning all memory. + numa=fake=<size>[MG] + [KNL, ARM64, RISCV, X86, EARLY] + If given as a memory unit, fills all system RAM with + nodes of size interleaved over physical nodes. + + numa=fake=<N> + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer, fills all system RAM with N + fake nodes interleaved over physical nodes. + + numa=fake=<N>U + [KNL, ARM64, RISCV, X86, EARLY] + If given as an integer followed by 'U', it will + divide each physical node into N emulated nodes. + + numa=noacpi [X86] Don't parse the SRAT table for NUMA setup + + numa=nohmat [X86] Don't parse the HMAT table for NUMA setup, or + soft-reserved memory partitioning. + numa_balancing= [KNL,ARM64,PPC,RISCV,S390,X86] Enable or disable automatic NUMA balancing. Allowed values are enable and disable @@ -4173,13 +4397,11 @@ page_alloc.shuffle= [KNL] Boolean flag to control whether the page allocator - should randomize its free lists. The randomization may - be automatically enabled if the kernel detects it is - running on a platform with a direct-mapped memory-side - cache, and this parameter can be used to - override/disable that behavior. The state of the flag - can be read from sysfs at: + should randomize its free lists. This parameter can be + used to enable/disable page randomization. The state of + the flag can be read from sysfs at: /sys/module/page_alloc/parameters/shuffle. + This parameter is only available if CONFIG_SHUFFLE_PAGE_ALLOCATOR=y. page_owner= [KNL,EARLY] Boot-time page_owner enabling option. Storage of the information about who allocated @@ -4589,14 +4811,51 @@ bridges without forcing it upstream. Note: this removes isolation between devices and may put more devices in an IOMMU group. + config_acs= + Format: + <ACS flags>@<pci_dev>[; ...] + Specify one or more PCI devices (in the format + specified above) optionally prepended with flags + and separated by semicolons. The respective + capabilities will be enabled, disabled or + unchanged based on what is specified in + flags. + + ACS Flags is defined as follows: + bit-0 : ACS Source Validation + bit-1 : ACS Translation Blocking + bit-2 : ACS P2P Request Redirect + bit-3 : ACS P2P Completion Redirect + bit-4 : ACS Upstream Forwarding + bit-5 : ACS P2P Egress Control + bit-6 : ACS Direct Translated P2P + Each bit can be marked as: + '0' – force disabled + '1' – force enabled + 'x' – unchanged + For example, + pci=config_acs=10x@pci:0:0 + would configure all devices that support + ACS to enable P2P Request Redirect, disable + Translation Blocking, and leave Source + Validation unchanged from whatever power-up + or firmware set it to. + + Note: this may remove isolation between devices + and may put more devices in an IOMMU group. force_floating [S390] Force usage of floating interrupts. nomio [S390] Do not use MIO instructions. norid [S390] ignore the RID field and force use of one PCI domain per PCI function + notph [PCIE] If the PCIE_TPH kernel config parameter + is enabled, this kernel boot option can be used + to disable PCIe TLP Processing Hints support + system-wide. - pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power + pcie_aspm= [PCIE] Forcibly enable or ignore PCIe Active State Power Management. - off Disable ASPM. + off Don't touch ASPM configuration at all. Leave any + configuration done by firmware unchanged. force Enable ASPM even on devices that claim not to support it. WARNING: Forcing ASPM on may cause system lockups. @@ -4721,7 +4980,14 @@ none - Limited to cond_resched() calls voluntary - Limited to cond_resched() and might_sleep() calls full - Any section that isn't explicitly preempt disabled - can be preempted anytime. + can be preempted anytime. Tasks will also yield + contended spinlocks (if the critical section isn't + explicitly preempt disabled beyond the lock itself). + lazy - Scheduler controlled. Similar to full but instead + of preempting the task immediately, the task gets + one HZ tick time to yield itself before the + preemption will be forced. One preemption is when the + task returns to user space. print-fatal-signals= [KNL] debug: print fatal signals @@ -4761,6 +5027,16 @@ printk.time= Show timing data prefixed to each printk message line Format: <bool> (1/Y/y=enable, 0/N/n=disable) + proc_mem.force_override= [KNL] + Format: {always | ptrace | never} + Traditionally /proc/pid/mem allows memory permissions to be + overridden without restrictions. This option may be set to + restrict that. Can be one of: + - 'always': traditional behavior always allows mem overrides. + - 'ptrace': only allow mem overrides for active ptracers. + - 'never': never allow mem overrides. + If not specified, default is the CONFIG_PROC_MEM_* choice. + processor.max_cstate= [HW,ACPI] Limit processor to maximum C-state max_cstate=9 overrides any DMI blacklist limit. @@ -4771,11 +5047,9 @@ profile= [KNL] Enable kernel profiling via /proc/profile Format: [<profiletype>,]<number> - Param: <profiletype>: "schedule", "sleep", or "kvm" + Param: <profiletype>: "schedule" or "kvm" [defaults to kernel profiling] Param: "schedule" - profile schedule points. - Param: "sleep" - profile D-state sleeping (millisecs). - Requires CONFIG_SCHEDSTATS Param: "kvm" - profile VM exits. Param: <number> - step/bucket size as a power of 2 for statistical time based profiling. @@ -4784,7 +5058,9 @@ prot_virt= [S390] enable hosting protected virtual machines isolated from the hypervisor (if hardware supports - that). + that). If enabled, the default kernel base address + might be overridden even when Kernel Address Space + Layout Randomization is disabled. Format: <bool> psi= [KNL] Enable or disable pressure stall information @@ -4908,6 +5184,10 @@ Set maximum number of finished RCU callbacks to process in one batch. + rcutree.csd_lock_suppress_rcu_stall= [KNL] + Do only a one-line RCU CPU stall warning when + there is an ongoing too-long CSD-lock wait. + rcutree.do_rcu_barrier= [KNL] Request a call to rcu_barrier(). This is throttled so that userspace tests can safely @@ -4985,6 +5265,14 @@ the ->nocb_bypass queue. The definition of "too many" is supplied by this kernel boot parameter. + rcutree.nohz_full_patience_delay= [KNL] + On callback-offloaded (rcu_nocbs) CPUs, avoid + disturbing RCU unless the grace period has + reached the specified age in milliseconds. + Defaults to zero. Large values will be capped + at five seconds. All values will be rounded down + to the nearest value representable by jiffies. + rcutree.qhimark= [KNL] Set threshold of queued RCU callbacks beyond which batch limiting is disabled. @@ -5095,6 +5383,20 @@ delay, memory pressure or callback list growing too big. + rcutree.rcu_normal_wake_from_gp= [KNL] + Reduces a latency of synchronize_rcu() call. This approach + maintains its own track of synchronize_rcu() callers, so it + does not interact with regular callbacks because it does not + use a call_rcu[_hurry]() path. Please note, this is for a + normal grace period. + + How to enable it: + + echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp + or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" + + Default is 0. + rcuscale.gp_async= [KNL] Measure performance of asynchronous grace-period primitives such as call_rcu(). @@ -5226,7 +5528,42 @@ rcutorture.gp_cond= [KNL] Use conditional/asynchronous update-side - primitives, if available. + normal-grace-period primitives, if available. + + rcutorture.gp_cond_exp= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives, if available. + + rcutorture.gp_cond_full= [KNL] + Use conditional/asynchronous update-side + normal-grace-period primitives that also take + concurrent expedited grace periods into account, + if available. + + rcutorture.gp_cond_exp_full= [KNL] + Use conditional/asynchronous update-side + expedited-grace-period primitives that also take + concurrent normal grace periods into account, + if available. + + rcutorture.gp_cond_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_cond and gp_cond_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_cond_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_cond_exp and gp_cond_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. rcutorture.gp_exp= [KNL] Use expedited update-side primitives, if available. @@ -5235,6 +5572,43 @@ Use normal (non-expedited) asynchronous update-side primitives, if available. + rcutorture.gp_poll= [KNL] + Use polled update-side normal-grace-period + primitives, if available. + + rcutorture.gp_poll_exp= [KNL] + Use polled update-side expedited-grace-period + primitives, if available. + + rcutorture.gp_poll_full= [KNL] + Use polled update-side normal-grace-period + primitives that also take concurrent expedited + grace periods into account, if available. + + rcutorture.gp_poll_exp_full= [KNL] + Use polled update-side expedited-grace-period + primitives that also take concurrent normal + grace periods into account, if available. + + rcutorture.gp_poll_wi= [KNL] + Nominal wait interval for normal conditional + grace periods (specified by rcutorture's + gp_poll and gp_poll_full module parameters), + in microseconds. The actual wait interval will + be randomly selected to nanosecond granularity up + to this wait interval. Defaults to 16 jiffies, + for example, 16,000 microseconds on a system + with HZ=1000. + + rcutorture.gp_poll_wi_exp= [KNL] + Nominal wait interval for expedited conditional + grace periods (specified by rcutorture's + gp_poll_exp and gp_poll_exp_full module + parameters), in microseconds. The actual wait + interval will be randomly selected to nanosecond + granularity up to this wait interval. Defaults to + 128 microseconds. + rcutorture.gp_sync= [KNL] Use normal (non-expedited) synchronous update-side primitives, if available. If all @@ -5288,10 +5662,21 @@ Set time (jiffies) between CPU-hotplug operations, or zero to disable CPU-hotplug testing. - rcutorture.read_exit= [KNL] - Set the number of read-then-exit kthreads used - to test the interaction of RCU updaters and - task-exit processing. + rcutorture.preempt_duration= [KNL] + Set duration (in milliseconds) of preemptions + by a high-priority FIFO real-time task. Set to + zero (the default) to disable. The CPUs to + preempt are selected randomly from the set that + are online at a given point in time. Races with + CPUs going offline are ignored, with that attempt + at preemption skipped. + + rcutorture.preempt_interval= [KNL] + Set interval (in milliseconds, defaulting to one + second) between preemptions by a high-priority + FIFO real-time task. This delay is mediated + by an hrtimer and is further fuzzed to avoid + inadvertent synchronizations. rcutorture.read_exit_burst= [KNL] The number of times in a given read-then-exit @@ -5302,6 +5687,14 @@ The delay, in seconds, between successive read-then-exit testing episodes. + rcutorture.reader_flavor= [KNL] + A bit mask indicating which readers to use. + If there is more than one bit set, the readers + are entered from low-order bit up, and are + exited in the opposite order. For SRCU, the + 0x1 bit is normal readers, 0x2 NMI-safe readers, + and 0x4 light-weight readers. + rcutorture.shuffle_interval= [KNL] Set task-shuffle interval (s). Shuffling tasks allows some CPUs to go into dyntick-idle mode @@ -5333,7 +5726,13 @@ Time to wait (s) after boot before inducing stall. rcutorture.stall_cpu_irqsoff= [KNL] - Disable interrupts while stalling if set. + Disable interrupts while stalling if set, but only + on the first stall in the set. + + rcutorture.stall_cpu_repeat= [KNL] + Number of times to repeat the stall sequence, + so that rcutorture.stall_cpu_repeat=3 will result + in four stall sequences. rcutorture.stall_gp_kthread= [KNL] Duration (s) of forced sleep within RCU @@ -5521,14 +5920,6 @@ of zero will disable batching. Batching is always disabled for synchronize_rcu_tasks(). - rcupdate.rcu_tasks_rude_lazy_ms= [KNL] - Set timeout in milliseconds RCU Tasks - Rude asynchronous callback batching for - call_rcu_tasks_rude(). A negative value - will take the default. A value of zero will - disable batching. Batching is always disabled - for synchronize_rcu_tasks_rude(). - rcupdate.rcu_tasks_trace_lazy_ms= [KNL] Set timeout in milliseconds RCU Tasks Trace asynchronous callback batching for @@ -5573,6 +5964,55 @@ reboot_cpu is s[mp]#### with #### being the processor to be used for rebooting. + acpi + Use the ACPI RESET_REG in the FADT. If ACPI is not + configured or the ACPI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + bios + Use the CPU reboot vector for warm reset + + cold + Set the cold reboot flag + + default + There are some built-in platform specific "quirks" + - you may see: "reboot: <name> series board detected. + Selecting <type> for reboots." In the case where you + think the quirk is in error (e.g. you have newer BIOS, + or newer board) using this option will ignore the + built-in quirk table, and use the generic default + reboot actions. + + efi + Use efi reset_system runtime service. If EFI is not + configured or the EFI reset does not work, the reboot + path attempts the reset using the keyboard controller. + + force + Don't stop other CPUs on reboot. This can make reboot + more reliable in some cases. + + kbd + Use the keyboard controller. cold reset (default) + + pci + Use a write to the PCI config space register 0xcf9 to + trigger reboot. + + triple + Force a triple fault (init) + + warm + Don't set the cold reboot flag + + Using warm reset will be much faster especially on big + memory systems because the BIOS will not go through + the memory check. Disadvantage is that not all + hardware will be completely reinitialized on reboot so + there may be boot problems on some systems. + + refscale.holdoff= [KNL] Set test-start holdoff period. The purpose of this parameter is to delay the start of the @@ -5641,6 +6081,28 @@ them. If <base> is less than 0x10000, the region is assumed to be I/O ports; otherwise it is memory. + reserve_mem= [RAM] + Format: nn[KNG]:<align>:<label> + Reserve physical memory and label it with a name that + other subsystems can use to access it. This is typically + used for systems that do not wipe the RAM, and this command + line will try to reserve the same physical memory on + soft reboots. Note, it is not guaranteed to be the same + location. For example, if anything about the system changes + or if booting a different kernel. It can also fail if KASLR + places the kernel at the location of where the RAM reservation + was from a previous boot, the new reservation will be at a + different location. + Any subsystem using this feature must add a way to verify + that the contents of the physical memory is from a previous + boot, as there may be cases where the memory will not be + located at the same location. + + The format is size:align:label for example, to request + 12 megabytes of 4096 alignment for ramoops: + + reserve_mem=12M:4096:oops ramoops.mem_name=oops + reservetop= [X86-32,EARLY] Format: nn[KMG] Reserves a hole at the top of the kernel virtual @@ -5719,9 +6181,6 @@ 2 The "airplane mode" button toggles between everything blocked and everything unblocked. - rhash_entries= [KNL,NET] - Set number of hash buckets for route cache - ring3mwait=disable [KNL] Disable ring 3 MONITOR/MWAIT feature on supported CPUs. @@ -5811,6 +6270,7 @@ but is useful for debugging and performance tuning. sched_thermal_decay_shift= + [Deprecated] [KNL, SMP] Set a decay shift for scheduler thermal pressure signal. Thermal pressure signal follows the default decay period of other scheduler pelt @@ -5918,6 +6378,10 @@ non-zero "wait" parameter. See weight_single and weight_many. + sdw_mclk_divider=[SDW] + Specify the MCLK divider for Intel SoundWire buses in + case the BIOS does not provide the clock rate properly. + skew_tick= [KNL,EARLY] Offset the periodic timer tick per cpu to mitigate xtime_lock contention on larger systems, and/or RCU lock contention on all systems with CONFIG_MAXSMP set. @@ -5940,7 +6404,16 @@ serialnumber [BUGS=X86-32] - sev=option[,option...] [X86-64] See Documentation/arch/x86/x86_64/boot-options.rst + sev=option[,option...] [X86-64] + + debug + Enable debug messages. + + nosnp + Do not enable SEV-SNP (applies to host/hypervisor + only). Setting 'nosnp' avoids the RMP check overhead + in memory accesses when users do not want to run + SEV-SNP guests. shapers= [NET] Maximal number of shapers. @@ -5954,9 +6427,6 @@ apic=verbose is specified. Example: apic=debug show_lapic=all - simeth= [IA-64] - simscsi= - slab_debug[=options[,slabs][;[options[,slabs]]...] [MM] Enabling slab_debug allows one to determine the culprit if slab objects become corrupted. Enabling @@ -6008,6 +6478,16 @@ For more information see Documentation/mm/slub.rst. (slub_nomerge legacy name also accepted for now) + slab_strict_numa [MM] + Support memory policies on a per object level + in the slab allocator. The default is for memory + policies to be applied at the folio level when + a new folio is needed or a partial folio is + retrieved from the lists. Increases overhead + in the slab fastpaths but gains more accurate + NUMA kernel object placement which helps with slow + interconnects in NUMA systems. + slram= [HW,MTD] smart2= [HW] @@ -6072,9 +6552,15 @@ deployment of the HW BHI control and the SW BHB clearing sequence. - on - (default) Enable the HW or SW mitigation - as needed. - off - Disable the mitigation. + on - (default) Enable the HW or SW mitigation as + needed. This protects the kernel from + both syscalls and VMs. + vmexit - On systems which don't have the HW mitigation + available, enable the SW mitigation on vmexit + ONLY. On such systems, the host kernel is + protected from VM-originated BHI attacks, but + may still be vulnerable to syscall attacks. + off - Disable the mitigation. spectre_v2= [X86,EARLY] Control mitigation of Spectre variant 2 (indirect branch speculation) vulnerability. @@ -6218,11 +6704,6 @@ Not specifying this option is equivalent to spec_store_bypass_disable=auto. - spia_io_base= [HW,MTD] - spia_fio_base= - spia_pedr= - spia_peddr= - split_lock_detect= [X86] Enable split lock detection or bus lock detection @@ -6478,7 +6959,7 @@ This parameter controls use of the Protected Execution Facility on pSeries. - swiotlb= [ARM,IA-64,PPC,MIPS,X86,EARLY] + swiotlb= [ARM,PPC,MIPS,X86,S390,EARLY] Format: { <int> [,<int>] | force | noforce } <int> -- Number of I/O TLB slabs <int> -- Second integer after comma. Number of swiotlb @@ -6547,10 +7028,29 @@ <deci-seconds>: poll all this frequency 0: no polling (default) + thp_anon= [KNL] + Format: <size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state> + state is one of "always", "madvise", "never" or "inherit". + Control the default behavior of the system with respect + to anonymous transparent hugepages. + Can be used multiple times for multiple anon THP sizes. + See Documentation/admin-guide/mm/transhuge.rst for more + details. + threadirqs [KNL,EARLY] Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD. + thp_shmem= [KNL] + Format: <size>[KMG],<size>[KMG]:<policy>;<size>[KMG]-<size>[KMG]:<policy> + Control the default policy of each hugepage size for the + internal shmem mount. <policy> is one of policies available + for the shmem mount ("always", "inherit", "never", "within_size", + and "advise"). + It can be used multiple times for multiple shmem THP sizes. + See Documentation/admin-guide/mm/transhuge.rst for more + details. + topology= [S390,EARLY] Format: {off | on} Specify if the kernel should make use of the cpu @@ -6559,12 +7059,6 @@ e.g. base its process migration decisions on it. Default is on. - topology_updates= [KNL, PPC, NUMA] - Format: {off} - Specify if the kernel should ignore (off) - topology updates sent by the hypervisor to this - LPAR. - torture.disable_onoff_at_boot= [KNL] Prevent the CPU-hotplug component of torturing until after init has spawned. @@ -6584,7 +7078,14 @@ torture.verbose_sleep_duration= [KNL] Duration of each verbose-printk() sleep in jiffies. - tp720= [HW,PS2] + tpm.disable_pcr_integrity= [HW,TPM] + Do not protect PCR registers from unintended physical + access, or interposers in the bus by the means of + having an integrity protected session wrapped around + TPM2_PCR_Extend command. Consider this in a situation + where TPM is heavily utilized by IMA, thus protection + causing a major performance hit, and the space where + machines are deployed is by other means guarded. tpm_suspend_pcr=[HW,TPM] Format: integer pcr id @@ -6664,6 +7165,14 @@ comma-separated list of trace events to enable. See also Documentation/trace/events.rst + To enable modules, use :mod: keyword: + + trace_event=:mod:<module> + + The value before :mod: will only enable specific events + that are part of the module. See the above mentioned + document for more information. + trace_instance=[instance-info] [FTRACE] Create a ring buffer instance early in boot up. This will be listed in: @@ -6684,6 +7193,57 @@ the same thing would happen if it was left off). The irq_handler_entry event, and all events under the "initcall" system. + Flags can be added to the instance to modify its behavior when it is + created. The flags are separated by '^'. + + The available flags are: + + traceoff - Have the tracing instance tracing disabled after it is created. + traceprintk - Have trace_printk() write into this trace instance + (note, "printk" and "trace_printk" can also be used) + + trace_instance=foo^traceoff^traceprintk,sched,irq + + The flags must come before the defined events. + + If memory has been reserved (see memmap for x86), the instance + can use that memory: + + memmap=12M$0x284500000 trace_instance=boot_map@0x284500000:12M + + The above will create a "boot_map" instance that uses the physical + memory at 0x284500000 that is 12Megs. The per CPU buffers of that + instance will be split up accordingly. + + Alternatively, the memory can be reserved by the reserve_mem option: + + reserve_mem=12M:4096:trace trace_instance=boot_map@trace + + This will reserve 12 megabytes at boot up with a 4096 byte alignment + and place the ring buffer in this memory. Note that due to KASLR, the + memory may not be the same location each time, which will not preserve + the buffer content. + + Also note that the layout of the ring buffer data may change between + kernel versions where the validator will fail and reset the ring buffer + if the layout is not the same as the previous kernel. + + If the ring buffer is used for persistent bootups and has events enabled, + it is recommend to disable tracing so that events from a previous boot do not + mix with events of the current boot (unless you are debugging a random crash + at boot up). + + reserve_mem=12M:4096:trace trace_instance=boot_map^traceoff^traceprintk@trace,sched,irq + + Note, saving the trace buffer across reboots does require that the system + is set up to not wipe memory. For instance, CONFIG_RESET_ATTACK_MITIGATION + can force a memory reset on boot which will clear any trace that was stored. + This is just one of many ways that can clear memory. Make sure your system + keeps the content of memory across reboots before relying on this option. + + See also Documentation/trace/debugging.rst + + trace_options=[option-list] [FTRACE] Enable or disable tracer options at boot. The option-list is a comma delimited list of options @@ -6740,6 +7300,20 @@ See Documentation/admin-guide/mm/transhuge.rst for more details. + transparent_hugepage_shmem= [KNL] + Format: [always|within_size|advise|never|deny|force] + Can be used to control the hugepage allocation policy for + the internal shmem mount. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + + transparent_hugepage_tmpfs= [KNL] + Format: [always|within_size|advise|never] + Can be used to control the default hugepage allocation policy + for the tmpfs mount. + See Documentation/admin-guide/mm/transhuge.rst + for more details. + trusted.source= [KEYS] Format: <string> This parameter identifies the trust source as a backend @@ -6748,6 +7322,7 @@ - "tpm" - "tee" - "caam" + - "dcp" If not specified then it defaults to iterating through the trust source list starting with TPM and assigns the first trust source as a backend which is initialized @@ -6763,6 +7338,18 @@ If not specified, "default" is used. In this case, the RNG's choice is left to each individual trust source. + trusted.dcp_use_otp_key + This is intended to be used in combination with + trusted.source=dcp and will select the DCP OTP key + instead of the DCP UNIQUE key blob encryption. + + trusted.dcp_skip_zk_test + This is intended to be used in combination with + trusted.source=dcp and will disable the check if the + blob key is all zeros. This is helpful for situations where + having this key zero'ed is acceptable. E.g. in testing + scenarios. + tsc= Disable clocksource stability checks for TSC. Format: <string> [x86] reliable: mark tsc clocksource as reliable, this @@ -7016,6 +7603,9 @@ usb-storage.delay_use= [UMS] The delay in seconds before a new device is scanned for Logical Units (default 1). + Optionally the delay in milliseconds if the value has + suffix with "ms". + Example: delay_use=2567ms usb-storage.quirks= [UMS] A list of quirks entries to supplement or @@ -7109,9 +7699,6 @@ Try vdso32=0 if you encounter an error that says: dl_main: Assertion `(void *) ph->p_vaddr == _rtld_local._dl_sysinfo_dso' failed! - vector= [IA-64,SMP] - vector=percpu: enable percpu vector domain - video= [FB,EARLY] Frame buffer configuration See Documentation/fb/modedb.rst. @@ -7162,9 +7749,12 @@ vmalloc=nn[KMG] [KNL,BOOT,EARLY] Forces the vmalloc area to have an exact size of <nn>. This can be used to increase - the minimum size (128MB on x86). It can also be - used to decrease the size and leave more room - for directly mapped kernel RAM. + the minimum size (128MB on x86, arm32 platforms). + It can also be used to decrease the size and leave more room + for directly mapped kernel RAM. Note that this parameter does + not exist on many other platforms (including arm64, alpha, + loongarch, arc, csky, hexagon, microblaze, mips, nios2, openrisc, + parisc, m64k, powerpc, riscv, sh, um, xtensa, s390, sparc). vmcp_cma=nn[MG] [KNL,S390,EARLY] Sets the memory size reserved for contiguous memory @@ -7206,7 +7796,7 @@ vt.cur_default= [VT] Default cursor shape. Format: 0xCCBBAA, where AA, BB, and CC are the same as the parameters of the <Esc>[?A;B;Cc escape sequence; - see VGA-softcursor.txt. Default: 2 = underline. + see vga-softcursor.rst. Default: 2 = underline. vt.default_blu= [VT] Format: <blue0>,<blue1>,<blue2>,...,<blue15> @@ -7277,6 +7867,13 @@ it can be updated at runtime by writing to the corresponding sysfs file. + workqueue.panic_on_stall=<uint> + Panic when workqueue stall is detected by + CONFIG_WQ_WATCHDOG. It sets the number times of the + stall to trigger panic. + + The default is 0, which disables the panic on stall. + workqueue.cpu_intensive_thresh_us= Per-cpu work items which run for longer than this threshold are automatically considered CPU intensive @@ -7323,7 +7920,7 @@ This can be changed after boot by writing to the matching /sys/module/workqueue/parameters file. All workqueues with the "default" affinity scope will be - updated accordignly. + updated accordingly. workqueue.debug_force_rr_cpu Workqueue used to implicitly guarantee that work @@ -7369,17 +7966,18 @@ Crash from Xen panic notifier, without executing late panic() code such as dumping handler. + xen_mc_debug [X86,XEN,EARLY] + Enable multicall debugging when running as a Xen PV guest. + Enabling this feature will reduce performance a little + bit, so it should only be enabled for obtaining extended + debug data in case of multicall errors. + xen_msr_safe= [X86,XEN,EARLY] Format: <bool> Select whether to always use non-faulting (safe) MSR access functions when running as Xen PV guest. The default value is controlled by CONFIG_XEN_PV_MSR_SAFE. - xen_nopvspin [X86,XEN,EARLY] - Disables the qspinlock slowpath using Xen PV optimizations. - This parameter is obsoleted by "nopvspin" parameter, which - has equivalent effect for XEN platform. - xen_nopv [X86] Disables the PV optimizations forcing the HVM guest to run as generic HVM guest with no PV drivers. @@ -7467,4 +8065,3 @@ memory, and other data can't be written using xmon commands. off xmon is disabled. - diff --git a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst index b6aeae3327ce..ea7fa2a8bbf0 100644 --- a/Documentation/admin-guide/kernel-per-CPU-kthreads.rst +++ b/Documentation/admin-guide/kernel-per-CPU-kthreads.rst @@ -315,7 +315,7 @@ To reduce its OS jitter, do at least one of the following: to do. Name: - rcuop/%d and rcuos/%d + rcuop/%d, rcuos/%d, and rcuog/%d Purpose: Offload RCU callbacks from the corresponding CPU. diff --git a/Documentation/admin-guide/laptops/thinkpad-acpi.rst b/Documentation/admin-guide/laptops/thinkpad-acpi.rst index 7f674a6cfa8a..4ab0fef7d440 100644 --- a/Documentation/admin-guide/laptops/thinkpad-acpi.rst +++ b/Documentation/admin-guide/laptops/thinkpad-acpi.rst @@ -445,8 +445,10 @@ event code Key Notes 0x1008 0x07 FN+F8 IBM: toggle screen expand Lenovo: configure UltraNav, or toggle screen expand. - On newer platforms (2024+) - replaced by 0x131f (see below) + On 2024 platforms replaced by + 0x131f (see below) and on newer + platforms (2025 +) keycode is + replaced by 0x1401 (see below). 0x1009 0x08 FN+F9 - @@ -506,9 +508,11 @@ event code Key Notes 0x1019 0x18 unknown -0x131f ... FN+F8 Platform Mode change. +0x131f ... FN+F8 Platform Mode change (2024 systems). Implemented in driver. +0x1401 ... FN+F8 Platform Mode change (2025 + systems). + Implemented in driver. ... ... ... 0x1020 0x1F unknown diff --git a/Documentation/admin-guide/media/building.rst b/Documentation/admin-guide/media/building.rst index a06473429916..7a413ba07f93 100644 --- a/Documentation/admin-guide/media/building.rst +++ b/Documentation/admin-guide/media/building.rst @@ -15,7 +15,7 @@ Please notice, however, that, if: you should use the main media development tree ``master`` branch: - https://git.linuxtv.org/media_tree.git/ + https://git.linuxtv.org/media.git/ In this case, you may find some useful information at the `LinuxTv wiki pages <https://linuxtv.org/wiki>`_: diff --git a/Documentation/admin-guide/media/cec.rst b/Documentation/admin-guide/media/cec.rst index 6b30e355cf23..92690e1f2183 100644 --- a/Documentation/admin-guide/media/cec.rst +++ b/Documentation/admin-guide/media/cec.rst @@ -42,10 +42,14 @@ dongles): ``persistent_config``: by default this is off, but when set to 1 the driver will store the current settings to the device's internal eeprom and restore it the next time the device is connected to the USB port. + - RainShadow Tech. Note: this driver does not support the persistent_config module option of the Pulse-Eight driver. The hardware supports it, but I have no plans to add this feature. But I accept patches :-) +- Extron DA HD 4K PLUS HDMI Distribution Amplifier. See + :ref:`extron_da_hd_4k_plus` for more information. + Miscellaneous: - vivid: emulates a CEC receiver and CEC transmitter. @@ -378,3 +382,86 @@ it later using ``--analyze-pin``. You can also use this as a full-fledged CEC device by configuring it using ``cec-ctl --tv -p0.0.0.0`` or ``cec-ctl --playback -p1.0.0.0``. + +.. _extron_da_hd_4k_plus: + +Extron DA HD 4K PLUS CEC Adapter driver +======================================= + +This driver is for the Extron DA HD 4K PLUS series of HDMI Distribution +Amplifiers: https://www.extron.com/product/dahd4kplusseries + +The 2, 4 and 6 port models are supported. + +Firmware version 1.02.0001 or higher is required. + +Note that older Extron hardware revisions have a problem with the CEC voltage, +which may mean that CEC will not work. This is fixed in hardware revisions +E34814 and up. + +The CEC support has two modes: the first is a manual mode where userspace has +to manually control CEC for the HDMI Input and all HDMI Outputs. While this gives +full control, it is also complicated. + +The second mode is an automatic mode, which is selected if the module option +``vendor_id`` is set. In that case the driver controls CEC and CEC messages +received in the input will be distributed to the outputs. It is still possible +to use the /dev/cecX devices to talk to the connected devices directly, but it is +the driver that configures everything and deals with things like Hotplug Detect +changes. + +The driver also takes care of the EDIDs: /dev/videoX devices are created to +read the EDIDs and (for the HDMI Input port) to set the EDID. + +By default userspace is responsible to set the EDID for the HDMI Input +according to the EDIDs of the connected displays. But if the ``manufacturer_name`` +module option is set, then the driver will take care of setting the EDID +of the HDMI Input based on the supported resolutions of the connected displays. +Currently the driver only supports resolutions 1080p60 and 4kp60: if all connected +displays support 4kp60, then it will advertise 4kp60 on the HDMI input, otherwise +it will fall back to an EDID that just reports 1080p60. + +The status of the Extron is reported in ``/sys/kernel/debug/cec/cecX/status``. + +The extron-da-hd-4k-plus driver implements the following module options: + +``debug`` +--------- + +If set to 1, then all serial port traffic is shown. + +``vendor_id`` +------------- + +The CEC Vendor ID to report to connected displays. + +If set, then the driver will take care of distributing CEC messages received +on the input to the HDMI outputs. This is done for the following CEC messages: + +- <Standby> +- <Image View On> and <Text View On> +- <Give Device Power Status> +- <Set System Audio Mode> +- <Request Current Latency> + +If not set, then userspace is responsible for this, and it will have to +configure the CEC devices for HDMI Input and the HDMI Outputs manually. + +``manufacturer_name`` +--------------------- + +A three character manufacturer name that is used in the EDID for the HDMI +Input. If not set, then userspace is reponsible for configuring an EDID. +If set, then the driver will update the EDID automatically based on the +resolutions supported by the connected displays, and it will not be possible +anymore to manually set the EDID for the HDMI Input. + +``hpd_never_low`` +----------------- + +If set, then the Hotplug Detect pin of the HDMI Input will always be high, +even if nothing is connected to the HDMI Outputs. If not set (the default) +then the Hotplug Detect pin of the HDMI input will go low if all the detected +Hotplug Detect pins of the HDMI Outputs are also low. + +This option may be changed dynamically. diff --git a/Documentation/admin-guide/media/em28xx-cardlist.rst b/Documentation/admin-guide/media/em28xx-cardlist.rst index ace65718ea22..7dac07986d91 100644 --- a/Documentation/admin-guide/media/em28xx-cardlist.rst +++ b/Documentation/admin-guide/media/em28xx-cardlist.rst @@ -438,3 +438,11 @@ EM28xx cards list - MyGica iGrabber - em2860 - 1f4d:1abe + * - 106 + - Hauppauge USB QuadHD ATSC + - em28274 + - 2040:846d + * - 107 + - MyGica UTV3 Analog USB2.0 TV Box + - em2860 + - eb1a:2860 diff --git a/Documentation/admin-guide/media/index.rst b/Documentation/admin-guide/media/index.rst index be7e0e4482ca..b11737ae6c04 100644 --- a/Documentation/admin-guide/media/index.rst +++ b/Documentation/admin-guide/media/index.rst @@ -20,6 +20,11 @@ Documentation/driver-api/media/index.rst - for driver development information and Kernel APIs used by media devices; +Documentation/process/debugging/media_specific_debugging_guide.rst + + - for advice about essential tools and techniques to debug drivers on this + subsystem + .. toctree:: :caption: Table of Contents :maxdepth: 2 diff --git a/Documentation/admin-guide/media/ipu3.rst b/Documentation/admin-guide/media/ipu3.rst index 83b3cd03b35c..9c190942932e 100644 --- a/Documentation/admin-guide/media/ipu3.rst +++ b/Documentation/admin-guide/media/ipu3.rst @@ -98,7 +98,7 @@ frames in packed raw Bayer format to IPU3 CSI2 receiver. # and that ov5670 sensor is connected to i2c bus 10 with address 0x36 export SDEV=$(media-ctl -d $MDEV -e "ov5670 10-0036") - # Establish the link for the media devices using media-ctl [#f3]_ + # Establish the link for the media devices using media-ctl media-ctl -d $MDEV -l "ov5670:0 -> ipu3-csi2 0:0[1]" # Set the format for the media devices @@ -589,12 +589,8 @@ preserved. References ========== -.. [#f5] drivers/staging/media/ipu3/include/uapi/intel-ipu3.h - .. [#f1] https://github.com/intel/nvt .. [#f2] http://git.ideasonboard.org/yavta.git -.. [#f3] http://git.ideasonboard.org/?p=media-ctl.git;a=summary - .. [#f4] ImgU limitation requires an additional 16x16 for all input resolutions diff --git a/Documentation/admin-guide/media/ipu6-isys.rst b/Documentation/admin-guide/media/ipu6-isys.rst new file mode 100644 index 000000000000..d05086824a74 --- /dev/null +++ b/Documentation/admin-guide/media/ipu6-isys.rst @@ -0,0 +1,161 @@ +.. SPDX-License-Identifier: GPL-2.0 + +.. include:: <isonum.txt> + +======================================================== +Intel Image Processing Unit 6 (IPU6) Input System driver +======================================================== + +Copyright |copy| 2023--2024 Intel Corporation + +Introduction +============ + +This file documents the Intel IPU6 (6th generation Image Processing Unit) +Input System (MIPI CSI2 receiver) drivers located under +drivers/media/pci/intel/ipu6. + +The Intel IPU6 can be found in certain Intel SoCs but not in all SKUs: + +* Tiger Lake +* Jasper Lake +* Alder Lake +* Raptor Lake +* Meteor Lake + +Intel IPU6 is made up of two components - Input System (ISYS) and Processing +System (PSYS). + +The Input System mainly works as MIPI CSI-2 receiver which receives and +processes the image data from the sensors and outputs the frames to memory. + +There are 2 driver modules - intel-ipu6 and intel-ipu6-isys. intel-ipu6 is an +IPU6 common driver which does PCI configuration, firmware loading and parsing, +firmware authentication, DMA mapping and IPU-MMU (internal Memory mapping Unit) +configuration. intel_ipu6_isys implements V4L2, Media Controller and V4L2 +sub-device interfaces. The IPU6 ISYS driver supports camera sensors connected +to the IPU6 ISYS through V4L2 sub-device sensor drivers. + +.. Note:: See Documentation/driver-api/media/drivers/ipu6.rst for more + information about the IPU6 hardware. + +Input system driver +=================== + +The Input System driver mainly configures CSI-2 D-PHY, constructs the firmware +stream configuration, sends commands to firmware, gets response from hardware +and firmware and then returns buffers to user. The ISYS is represented as +several V4L2 sub-devices as well as video nodes. + +.. kernel-figure:: ipu6_isys_graph.svg + :alt: ipu6 isys media graph with multiple streams support + + IPU6 ISYS media graph with multiple streams support + +The graph has been produced using the following command: + +.. code-block:: none + + fdp -Gsplines=true -Tsvg < dot > dot.svg + +Capturing frames with IPU6 ISYS +------------------------------- + +IPU6 ISYS is used to capture frames from the camera sensors connected to the +CSI2 ports. The supported input formats of ISYS are listed in table below: + +.. tabularcolumns:: |p{0.8cm}|p{4.0cm}|p{4.0cm}| + +.. flat-table:: + :header-rows: 1 + + * - IPU6 ISYS supported input formats + + * - RGB565, RGB888 + + * - UYVY8, YUYV8 + + * - RAW8, RAW10, RAW12 + +.. _ipu6_isys_capture_examples: + +Examples +~~~~~~~~ + +Here is an example of IPU6 ISYS raw capture on Dell XPS 9315 laptop. On this +machine, ov01a10 sensor is connected to IPU ISYS CSI-2 port 2, which can +generate images at sBGGR10 with resolution 1280x800. + +Using the media controller APIs, we can configure ov01a10 sensor by +media-ctl [#f1]_ and yavta [#f2]_ to transmit frames to IPU6 ISYS. + +.. code-block:: none + + # Example 1 capture frame from ov01a10 camera sensor + # This example assumes /dev/media0 as the IPU ISYS media device + export MDEV=/dev/media0 + + # Establish the link for the media devices using media-ctl + media-ctl -d $MDEV -l "\"ov01a10 3-0036\":0 -> \"Intel IPU6 CSI2 2\":0[1]" + + # Set the format for the media devices + media-ctl -d $MDEV -V "ov01a10:0 [fmt:SBGGR10/1280x800]" + media-ctl -d $MDEV -V "Intel IPU6 CSI2 2:0 [fmt:SBGGR10/1280x800]" + media-ctl -d $MDEV -V "Intel IPU6 CSI2 2:1 [fmt:SBGGR10/1280x800]" + +Once the media pipeline is configured, desired sensor specific settings +(such as exposure and gain settings) can be set, using the yavta tool. + +e.g + +.. code-block:: none + + # and that ov01a10 sensor is connected to i2c bus 3 with address 0x36 + export SDEV=$(media-ctl -d $MDEV -e "ov01a10 3-0036") + + yavta -w 0x009e0903 400 $SDEV + yavta -w 0x009e0913 1000 $SDEV + yavta -w 0x009e0911 2000 $SDEV + +Once the desired sensor settings are set, frame captures can be done as below. + +e.g + +.. code-block:: none + + yavta --data-prefix -u -c10 -n5 -I -s 1280x800 --file=/tmp/frame-#.bin \ + -f SBGGR10 $(media-ctl -d $MDEV -e "Intel IPU6 ISYS Capture 0") + +With the above command, 10 frames are captured at 1280x800 resolution with +sBGGR10 format. The captured frames are available as /tmp/frame-#.bin files. + +Here is another example of IPU6 ISYS RAW and metadata capture from camera +sensor ov2740 on Lenovo X1 Yoga laptop. + +.. code-block:: none + + media-ctl -l "\"ov2740 14-0036\":0 -> \"Intel IPU6 CSI2 1\":0[1]" + media-ctl -l "\"Intel IPU6 CSI2 1\":1 -> \"Intel IPU6 ISYS Capture 0\":0[1]" + media-ctl -l "\"Intel IPU6 CSI2 1\":2 -> \"Intel IPU6 ISYS Capture 1\":0[1]" + + # set routing + media-ctl -R "\"Intel IPU6 CSI2 1\" [0/0->1/0[1],0/1->2/1[1]]" + + media-ctl -V "\"Intel IPU6 CSI2 1\":0/0 [fmt:SGRBG10/1932x1092]" + media-ctl -V "\"Intel IPU6 CSI2 1\":0/1 [fmt:GENERIC_8/97x1]" + media-ctl -V "\"Intel IPU6 CSI2 1\":1/0 [fmt:SGRBG10/1932x1092]" + media-ctl -V "\"Intel IPU6 CSI2 1\":2/1 [fmt:GENERIC_8/97x1]" + + CAPTURE_DEV=$(media-ctl -e "Intel IPU6 ISYS Capture 0") + ./yavta --data-prefix -c100 -n5 -I -s1932x1092 --file=/tmp/frame-#.bin \ + -f SGRBG10 ${CAPTURE_DEV} + + CAPTURE_META=$(media-ctl -e "Intel IPU6 ISYS Capture 1") + ./yavta --data-prefix -c100 -n5 -I -s97x1 -B meta-capture \ + --file=/tmp/meta-#.bin -f GENERIC_8 ${CAPTURE_META} + +References +========== + +.. [#f1] https://git.ideasonboard.org/media-ctl.git +.. [#f2] https://git.ideasonboard.org/yavta.git diff --git a/Documentation/admin-guide/media/ipu6_isys_graph.svg b/Documentation/admin-guide/media/ipu6_isys_graph.svg new file mode 100644 index 000000000000..c8539ef320d2 --- /dev/null +++ b/Documentation/admin-guide/media/ipu6_isys_graph.svg @@ -0,0 +1,548 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" + "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> +<!-- Generated by graphviz version 2.43.0 (0) + --> +<!-- Title: board Pages: 1 --> +<svg width="1703pt" height="1473pt" + viewBox="0.00 0.00 1703.00 1473.00" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink"> +<g id="graph0" class="graph" transform="scale(1 1) rotate(0) translate(4 1469)"> +<title>board</title> +<polygon fill="white" stroke="transparent" points="-4,4 -4,-1469 1699,-1469 1699,4 -4,4"/> +<!-- n00000001 --> +<g id="node1" class="node"> +<title>n00000001</title> +<polygon fill="yellow" stroke="black" points="832.99,-750.08 629.99,-750.08 629.99,-712.08 832.99,-712.08 832.99,-750.08"/> +<text text-anchor="middle" x="731.49" y="-734.88" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 0</text> +<text text-anchor="middle" x="731.49" y="-719.88" font-family="Times,serif" font-size="14.00">/dev/video0</text> +</g> +<!-- n00000005 --> +<g id="node2" class="node"> +<title>n00000005</title> +<polygon fill="yellow" stroke="black" points="1396.59,-771.88 1193.59,-771.88 1193.59,-733.88 1396.59,-733.88 1396.59,-771.88"/> +<text text-anchor="middle" x="1295.09" y="-756.68" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 1</text> +<text text-anchor="middle" x="1295.09" y="-741.68" font-family="Times,serif" font-size="14.00">/dev/video1</text> +</g> +<!-- n00000009 --> +<g id="node3" class="node"> +<title>n00000009</title> +<polygon fill="yellow" stroke="black" points="1118.52,-690.47 915.52,-690.47 915.52,-652.47 1118.52,-652.47 1118.52,-690.47"/> +<text text-anchor="middle" x="1017.02" y="-675.27" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 2</text> +<text text-anchor="middle" x="1017.02" y="-660.27" font-family="Times,serif" font-size="14.00">/dev/video2</text> +</g> +<!-- n0000000d --> +<g id="node4" class="node"> +<title>n0000000d</title> +<polygon fill="yellow" stroke="black" points="1105.89,-838.84 902.89,-838.84 902.89,-800.84 1105.89,-800.84 1105.89,-838.84"/> +<text text-anchor="middle" x="1004.39" y="-823.64" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 3</text> +<text text-anchor="middle" x="1004.39" y="-808.64" font-family="Times,serif" font-size="14.00">/dev/video3</text> +</g> +<!-- n00000011 --> +<g id="node5" class="node"> +<title>n00000011</title> +<polygon fill="yellow" stroke="black" points="1279.22,-992.95 1076.22,-992.95 1076.22,-954.95 1279.22,-954.95 1279.22,-992.95"/> +<text text-anchor="middle" x="1177.72" y="-977.75" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 4</text> +<text text-anchor="middle" x="1177.72" y="-962.75" font-family="Times,serif" font-size="14.00">/dev/video4</text> +</g> +<!-- n00000015 --> +<g id="node6" class="node"> +<title>n00000015</title> +<polygon fill="yellow" stroke="black" points="939.18,-984.91 736.18,-984.91 736.18,-946.91 939.18,-946.91 939.18,-984.91"/> +<text text-anchor="middle" x="837.68" y="-969.71" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 5</text> +<text text-anchor="middle" x="837.68" y="-954.71" font-family="Times,serif" font-size="14.00">/dev/video5</text> +</g> +<!-- n00000019 --> +<g id="node7" class="node"> +<title>n00000019</title> +<polygon fill="yellow" stroke="black" points="957.87,-527.99 754.87,-527.99 754.87,-489.99 957.87,-489.99 957.87,-527.99"/> +<text text-anchor="middle" x="856.37" y="-512.79" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 6</text> +<text text-anchor="middle" x="856.37" y="-497.79" font-family="Times,serif" font-size="14.00">/dev/video6</text> +</g> +<!-- n0000001d --> +<g id="node8" class="node"> +<title>n0000001d</title> +<polygon fill="yellow" stroke="black" points="1291.02,-542.15 1088.02,-542.15 1088.02,-504.15 1291.02,-504.15 1291.02,-542.15"/> +<text text-anchor="middle" x="1189.52" y="-526.95" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 7</text> +<text text-anchor="middle" x="1189.52" y="-511.95" font-family="Times,serif" font-size="14.00">/dev/video7</text> +</g> +<!-- n00000021 --> +<g id="node9" class="node"> +<title>n00000021</title> +<polygon fill="yellow" stroke="black" points="202.74,-611.46 -0.26,-611.46 -0.26,-573.46 202.74,-573.46 202.74,-611.46"/> +<text text-anchor="middle" x="101.24" y="-596.26" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 8</text> +<text text-anchor="middle" x="101.24" y="-581.26" font-family="Times,serif" font-size="14.00">/dev/video8</text> +</g> +<!-- n00000025 --> +<g id="node10" class="node"> +<title>n00000025</title> +<polygon fill="yellow" stroke="black" points="764.86,-637.89 561.86,-637.89 561.86,-599.89 764.86,-599.89 764.86,-637.89"/> +<text text-anchor="middle" x="663.36" y="-622.69" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 9</text> +<text text-anchor="middle" x="663.36" y="-607.69" font-family="Times,serif" font-size="14.00">/dev/video9</text> +</g> +<!-- n00000029 --> +<g id="node11" class="node"> +<title>n00000029</title> +<polygon fill="yellow" stroke="black" points="358.62,-519.5 146.62,-519.5 146.62,-481.5 358.62,-481.5 358.62,-519.5"/> +<text text-anchor="middle" x="252.62" y="-504.3" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 10</text> +<text text-anchor="middle" x="252.62" y="-489.3" font-family="Times,serif" font-size="14.00">/dev/video10</text> +</g> +<!-- n0000002d --> +<g id="node12" class="node"> +<title>n0000002d</title> +<polygon fill="yellow" stroke="black" points="481.4,-662.59 269.4,-662.59 269.4,-624.59 481.4,-624.59 481.4,-662.59"/> +<text text-anchor="middle" x="375.4" y="-647.39" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 11</text> +<text text-anchor="middle" x="375.4" y="-632.39" font-family="Times,serif" font-size="14.00">/dev/video11</text> +</g> +<!-- n00000031 --> +<g id="node13" class="node"> +<title>n00000031</title> +<polygon fill="yellow" stroke="black" points="637.17,-837.47 425.17,-837.47 425.17,-799.47 637.17,-799.47 637.17,-837.47"/> +<text text-anchor="middle" x="531.17" y="-822.27" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 12</text> +<text text-anchor="middle" x="531.17" y="-807.27" font-family="Times,serif" font-size="14.00">/dev/video12</text> +</g> +<!-- n00000035 --> +<g id="node14" class="node"> +<title>n00000035</title> +<polygon fill="yellow" stroke="black" points="337.75,-833.67 125.75,-833.67 125.75,-795.67 337.75,-795.67 337.75,-833.67"/> +<text text-anchor="middle" x="231.75" y="-818.47" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 13</text> +<text text-anchor="middle" x="231.75" y="-803.47" font-family="Times,serif" font-size="14.00">/dev/video13</text> +</g> +<!-- n00000039 --> +<g id="node15" class="node"> +<title>n00000039</title> +<polygon fill="yellow" stroke="black" points="393.07,-317.96 181.07,-317.96 181.07,-279.96 393.07,-279.96 393.07,-317.96"/> +<text text-anchor="middle" x="287.07" y="-302.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 14</text> +<text text-anchor="middle" x="287.07" y="-287.76" font-family="Times,serif" font-size="14.00">/dev/video14</text> +</g> +<!-- n0000003d --> +<g id="node16" class="node"> +<title>n0000003d</title> +<polygon fill="yellow" stroke="black" points="701.46,-391.04 489.46,-391.04 489.46,-353.04 701.46,-353.04 701.46,-391.04"/> +<text text-anchor="middle" x="595.46" y="-375.84" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 15</text> +<text text-anchor="middle" x="595.46" y="-360.84" font-family="Times,serif" font-size="14.00">/dev/video15</text> +</g> +<!-- n00000041 --> +<g id="node17" class="node"> +<title>n00000041</title> +<polygon fill="yellow" stroke="black" points="212.45,-1228.8 0.45,-1228.8 0.45,-1190.8 212.45,-1190.8 212.45,-1228.8"/> +<text text-anchor="middle" x="106.45" y="-1213.6" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 16</text> +<text text-anchor="middle" x="106.45" y="-1198.6" font-family="Times,serif" font-size="14.00">/dev/video16</text> +</g> +<!-- n00000045 --> +<g id="node18" class="node"> +<title>n00000045</title> +<polygon fill="yellow" stroke="black" points="784.86,-1252.38 572.86,-1252.38 572.86,-1214.38 784.86,-1214.38 784.86,-1252.38"/> +<text text-anchor="middle" x="678.86" y="-1237.18" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 17</text> +<text text-anchor="middle" x="678.86" y="-1222.18" font-family="Times,serif" font-size="14.00">/dev/video17</text> +</g> +<!-- n00000049 --> +<g id="node19" class="node"> +<title>n00000049</title> +<polygon fill="yellow" stroke="black" points="503.14,-1169.96 291.14,-1169.96 291.14,-1131.96 503.14,-1131.96 503.14,-1169.96"/> +<text text-anchor="middle" x="397.14" y="-1154.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 18</text> +<text text-anchor="middle" x="397.14" y="-1139.76" font-family="Times,serif" font-size="14.00">/dev/video18</text> +</g> +<!-- n0000004d --> +<g id="node20" class="node"> +<title>n0000004d</title> +<polygon fill="yellow" stroke="black" points="492.62,-1319.4 280.62,-1319.4 280.62,-1281.4 492.62,-1281.4 492.62,-1319.4"/> +<text text-anchor="middle" x="386.62" y="-1304.2" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 19</text> +<text text-anchor="middle" x="386.62" y="-1289.2" font-family="Times,serif" font-size="14.00">/dev/video19</text> +</g> +<!-- n00000051 --> +<g id="node21" class="node"> +<title>n00000051</title> +<polygon fill="yellow" stroke="black" points="680.74,-1464.66 468.74,-1464.66 468.74,-1426.66 680.74,-1426.66 680.74,-1464.66"/> +<text text-anchor="middle" x="574.74" y="-1449.46" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 20</text> +<text text-anchor="middle" x="574.74" y="-1434.46" font-family="Times,serif" font-size="14.00">/dev/video20</text> +</g> +<!-- n00000055 --> +<g id="node22" class="node"> +<title>n00000055</title> +<polygon fill="yellow" stroke="black" points="302.42,-1452.56 90.42,-1452.56 90.42,-1414.56 302.42,-1414.56 302.42,-1452.56"/> +<text text-anchor="middle" x="196.42" y="-1437.36" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 21</text> +<text text-anchor="middle" x="196.42" y="-1422.36" font-family="Times,serif" font-size="14.00">/dev/video21</text> +</g> +<!-- n00000059 --> +<g id="node23" class="node"> +<title>n00000059</title> +<polygon fill="yellow" stroke="black" points="319.89,-1018.32 107.89,-1018.32 107.89,-980.32 319.89,-980.32 319.89,-1018.32"/> +<text text-anchor="middle" x="213.89" y="-1003.12" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 22</text> +<text text-anchor="middle" x="213.89" y="-988.12" font-family="Times,serif" font-size="14.00">/dev/video22</text> +</g> +<!-- n0000005d --> +<g id="node24" class="node"> +<title>n0000005d</title> +<polygon fill="yellow" stroke="black" points="692.62,-1031.39 480.62,-1031.39 480.62,-993.39 692.62,-993.39 692.62,-1031.39"/> +<text text-anchor="middle" x="586.62" y="-1016.19" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 23</text> +<text text-anchor="middle" x="586.62" y="-1001.19" font-family="Times,serif" font-size="14.00">/dev/video23</text> +</g> +<!-- n00000061 --> +<g id="node25" class="node"> +<title>n00000061</title> +<polygon fill="yellow" stroke="black" points="1122.45,-248.8 910.45,-248.8 910.45,-210.8 1122.45,-210.8 1122.45,-248.8"/> +<text text-anchor="middle" x="1016.45" y="-233.6" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 24</text> +<text text-anchor="middle" x="1016.45" y="-218.6" font-family="Times,serif" font-size="14.00">/dev/video24</text> +</g> +<!-- n00000065 --> +<g id="node26" class="node"> +<title>n00000065</title> +<polygon fill="yellow" stroke="black" points="1694.86,-272.38 1482.86,-272.38 1482.86,-234.38 1694.86,-234.38 1694.86,-272.38"/> +<text text-anchor="middle" x="1588.86" y="-257.18" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 25</text> +<text text-anchor="middle" x="1588.86" y="-242.18" font-family="Times,serif" font-size="14.00">/dev/video25</text> +</g> +<!-- n00000069 --> +<g id="node27" class="node"> +<title>n00000069</title> +<polygon fill="yellow" stroke="black" points="1413.14,-189.96 1201.14,-189.96 1201.14,-151.96 1413.14,-151.96 1413.14,-189.96"/> +<text text-anchor="middle" x="1307.14" y="-174.76" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 26</text> +<text text-anchor="middle" x="1307.14" y="-159.76" font-family="Times,serif" font-size="14.00">/dev/video26</text> +</g> +<!-- n0000006d --> +<g id="node28" class="node"> +<title>n0000006d</title> +<polygon fill="yellow" stroke="black" points="1402.62,-339.4 1190.62,-339.4 1190.62,-301.4 1402.62,-301.4 1402.62,-339.4"/> +<text text-anchor="middle" x="1296.62" y="-324.2" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 27</text> +<text text-anchor="middle" x="1296.62" y="-309.2" font-family="Times,serif" font-size="14.00">/dev/video27</text> +</g> +<!-- n00000071 --> +<g id="node29" class="node"> +<title>n00000071</title> +<polygon fill="yellow" stroke="black" points="1590.74,-484.66 1378.74,-484.66 1378.74,-446.66 1590.74,-446.66 1590.74,-484.66"/> +<text text-anchor="middle" x="1484.74" y="-469.46" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 28</text> +<text text-anchor="middle" x="1484.74" y="-454.46" font-family="Times,serif" font-size="14.00">/dev/video28</text> +</g> +<!-- n00000075 --> +<g id="node30" class="node"> +<title>n00000075</title> +<polygon fill="yellow" stroke="black" points="1212.42,-472.56 1000.42,-472.56 1000.42,-434.56 1212.42,-434.56 1212.42,-472.56"/> +<text text-anchor="middle" x="1106.42" y="-457.36" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 29</text> +<text text-anchor="middle" x="1106.42" y="-442.36" font-family="Times,serif" font-size="14.00">/dev/video29</text> +</g> +<!-- n00000079 --> +<g id="node31" class="node"> +<title>n00000079</title> +<polygon fill="yellow" stroke="black" points="1229.89,-38.32 1017.89,-38.32 1017.89,-0.32 1229.89,-0.32 1229.89,-38.32"/> +<text text-anchor="middle" x="1123.89" y="-23.12" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 30</text> +<text text-anchor="middle" x="1123.89" y="-8.12" font-family="Times,serif" font-size="14.00">/dev/video30</text> +</g> +<!-- n0000007d --> +<g id="node32" class="node"> +<title>n0000007d</title> +<polygon fill="yellow" stroke="black" points="1602.62,-51.39 1390.62,-51.39 1390.62,-13.39 1602.62,-13.39 1602.62,-51.39"/> +<text text-anchor="middle" x="1496.62" y="-36.19" font-family="Times,serif" font-size="14.00">Intel IPU6 ISYS Capture 31</text> +<text text-anchor="middle" x="1496.62" y="-21.19" font-family="Times,serif" font-size="14.00">/dev/video31</text> +</g> +<!-- n00000081 --> +<g id="node33" class="node"> +<title>n00000081</title> +<path fill="green" stroke="black" d="M924.28,-700.28C924.28,-700.28 1108.28,-700.28 1108.28,-700.28 1114.28,-700.28 1120.28,-706.28 1120.28,-712.28 1120.28,-712.28 1120.28,-772.28 1120.28,-772.28 1120.28,-778.28 1114.28,-784.28 1108.28,-784.28 1108.28,-784.28 924.28,-784.28 924.28,-784.28 918.28,-784.28 912.28,-778.28 912.28,-772.28 912.28,-772.28 912.28,-712.28 912.28,-712.28 912.28,-706.28 918.28,-700.28 924.28,-700.28"/> +<text text-anchor="middle" x="1016.28" y="-769.08" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="912.28,-761.28 1120.28,-761.28 "/> +<text text-anchor="middle" x="1016.28" y="-746.08" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 0</text> +<text text-anchor="middle" x="1016.28" y="-731.08" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev0</text> +<polyline fill="none" stroke="black" points="912.28,-723.28 1120.28,-723.28 "/> +<text text-anchor="middle" x="925.28" y="-708.08" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="938.28,-700.28 938.28,-723.28 "/> +<text text-anchor="middle" x="951.28" y="-708.08" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="964.28,-700.28 964.28,-723.28 "/> +<text text-anchor="middle" x="977.28" y="-708.08" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="990.28,-700.28 990.28,-723.28 "/> +<text text-anchor="middle" x="1003.28" y="-708.08" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="1016.28,-700.28 1016.28,-723.28 "/> +<text text-anchor="middle" x="1029.28" y="-708.08" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="1042.28,-700.28 1042.28,-723.28 "/> +<text text-anchor="middle" x="1055.28" y="-708.08" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="1068.28,-700.28 1068.28,-723.28 "/> +<text text-anchor="middle" x="1081.28" y="-708.08" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="1094.28,-700.28 1094.28,-723.28 "/> +<text text-anchor="middle" x="1107.28" y="-708.08" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n00000081->n00000001 --> +<g id="edge1" class="edge"> +<title>n00000081:port1->n00000001</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M912.28,-711.28C912.28,-711.28 880.33,-714.78 843.28,-718.84"/> +<polygon fill="black" stroke="black" points="842.81,-715.37 833.25,-719.94 843.57,-722.33 842.81,-715.37"/> +</g> +<!-- n00000081->n00000005 --> +<g id="edge2" class="edge"> +<title>n00000081:port2->n00000005</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M951.38,-700.28C951.38,-700.28 1086.18,-688.61 1123.48,-697.08 1155.93,-704.45 1158.99,-719.67 1190.39,-730.68 1190.49,-730.71 1190.59,-730.75 1190.69,-730.78"/> +<polygon fill="black" stroke="black" points="1189.45,-734.06 1200.05,-733.86 1191.64,-727.41 1189.45,-734.06"/> +</g> +<!-- n00000081->n00000009 --> +<g id="edge3" class="edge"> +<title>n00000081:port3->n00000009</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M977.28,-700.28C977.28,-700.28 979.31,-698.81 982.45,-696.54"/> +<polygon fill="black" stroke="black" points="984.7,-699.23 990.74,-690.53 980.59,-693.56 984.7,-699.23"/> +</g> +<!-- n00000081->n0000000d --> +<g id="edge4" class="edge"> +<title>n00000081:port4->n0000000d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1003.38,-700.26C1003.38,-700.26 916.62,-689.8 909.08,-697.08 880.2,-725.01 885.68,-754.82 909.08,-787.48 910.88,-789.99 918.96,-793.59 929.7,-797.47"/> +<polygon fill="black" stroke="black" points="928.69,-800.82 939.28,-800.79 930.98,-794.21 928.69,-800.82"/> +</g> +<!-- n00000081->n00000011 --> +<g id="edge5" class="edge"> +<title>n00000081:port5->n00000011</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1029.19,-700.26C1029.19,-700.26 1115.28,-690.56 1123.48,-697.08 1198.37,-756.64 1190.55,-886.51 1182.64,-944.71"/> +<polygon fill="black" stroke="black" points="1179.16,-944.31 1181.18,-954.71 1186.09,-945.32 1179.16,-944.31"/> +</g> +<!-- n00000081->n00000015 --> +<g id="edge6" class="edge"> +<title>n00000081:port6->n00000015</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1055.18,-700.28C1055.18,-700.28 915.57,-692.2 909.08,-697.08 834.02,-753.51 831.79,-879.34 835.06,-936.56"/> +<polygon fill="black" stroke="black" points="831.58,-936.99 835.74,-946.73 838.56,-936.52 831.58,-936.99"/> +</g> +<!-- n00000081->n00000019 --> +<g id="edge7" class="edge"> +<title>n00000081:port7->n00000019</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1081.28,-700.28C1081.28,-700.28 916.04,-696.54 912.32,-693.67 864.52,-656.73 856.3,-580.22 855.62,-538.2"/> +<polygon fill="black" stroke="black" points="859.11,-538.05 855.59,-528.06 852.11,-538.07 859.11,-538.05"/> +</g> +<!-- n00000081->n0000001d --> +<g id="edge8" class="edge"> +<title>n00000081:port8->n0000001d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1107.28,-700.28C1107.28,-700.28 1119.29,-696.23 1121.72,-693.67 1159.76,-653.62 1177.38,-589.6 1184.78,-552.46"/> +<polygon fill="black" stroke="black" points="1188.29,-552.76 1186.69,-542.29 1181.41,-551.47 1188.29,-552.76"/> +</g> +<!-- n0000008b --> +<g id="node34" class="node"> +<title>n0000008b</title> +<path fill="green" stroke="black" d="M293.1,-532.08C293.1,-532.08 477.1,-532.08 477.1,-532.08 483.1,-532.08 489.1,-538.08 489.1,-544.08 489.1,-544.08 489.1,-604.08 489.1,-604.08 489.1,-610.08 483.1,-616.08 477.1,-616.08 477.1,-616.08 293.1,-616.08 293.1,-616.08 287.1,-616.08 281.1,-610.08 281.1,-604.08 281.1,-604.08 281.1,-544.08 281.1,-544.08 281.1,-538.08 287.1,-532.08 293.1,-532.08"/> +<text text-anchor="middle" x="385.1" y="-600.88" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="281.1,-593.08 489.1,-593.08 "/> +<text text-anchor="middle" x="385.1" y="-577.88" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 1</text> +<text text-anchor="middle" x="385.1" y="-562.88" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev1</text> +<polyline fill="none" stroke="black" points="281.1,-555.08 489.1,-555.08 "/> +<text text-anchor="middle" x="294.1" y="-539.88" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="307.1,-532.08 307.1,-555.08 "/> +<text text-anchor="middle" x="320.1" y="-539.88" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="333.1,-532.08 333.1,-555.08 "/> +<text text-anchor="middle" x="346.1" y="-539.88" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="359.1,-532.08 359.1,-555.08 "/> +<text text-anchor="middle" x="372.1" y="-539.88" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="385.1,-532.08 385.1,-555.08 "/> +<text text-anchor="middle" x="398.1" y="-539.88" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="411.1,-532.08 411.1,-555.08 "/> +<text text-anchor="middle" x="424.1" y="-539.88" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="437.1,-532.08 437.1,-555.08 "/> +<text text-anchor="middle" x="450.1" y="-539.88" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="463.1,-532.08 463.1,-555.08 "/> +<text text-anchor="middle" x="476.1" y="-539.88" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n0000008b->n00000021 --> +<g id="edge9" class="edge"> +<title>n0000008b:port1->n00000021</title> +<path fill="none" stroke="black" d="M281.1,-543.08C281.1,-543.08 240.1,-560.51 205.94,-570.26 205.35,-570.43 204.77,-570.59 204.18,-570.76"/> +<polygon fill="black" stroke="black" points="203.2,-567.39 194.47,-573.39 205.03,-574.15 203.2,-567.39"/> +</g> +<!-- n0000008b->n00000025 --> +<g id="edge10" class="edge"> +<title>n0000008b:port2->n00000025</title> +<path fill="none" stroke="black" d="M320.2,-532.07C320.2,-532.07 456.9,-514.37 492.3,-528.88 528.42,-543.68 522.86,-571.78 556.11,-594.53"/> +<polygon fill="black" stroke="black" points="554.54,-597.67 564.9,-599.88 558.18,-591.69 554.54,-597.67"/> +</g> +<!-- n0000008b->n00000029 --> +<g id="edge11" class="edge"> +<title>n0000008b:port3->n00000029</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M346.1,-532.08C346.1,-532.08 333.93,-527.96 318.37,-522.71"/> +<polygon fill="black" stroke="black" points="319.48,-519.39 308.88,-519.5 317.24,-526.02 319.48,-519.39"/> +</g> +<!-- n0000008b->n0000002d --> +<g id="edge12" class="edge"> +<title>n0000008b:port4->n0000002d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M372.19,-532.05C372.19,-532.05 292.97,-514.3 277.9,-528.88 249.01,-556.8 253.16,-587.62 277.9,-619.28 278.34,-619.85 280.33,-620.69 283.45,-621.71"/> +<polygon fill="black" stroke="black" points="282.71,-625.14 293.29,-624.58 284.67,-618.42 282.71,-625.14"/> +</g> +<!-- n0000008b->n00000031 --> +<g id="edge13" class="edge"> +<title>n0000008b:port5->n00000031</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M398,-532.05C398,-532.05 476.28,-515.34 492.3,-528.88 568.49,-593.29 550.55,-729.67 538.14,-789.41"/> +<polygon fill="black" stroke="black" points="534.69,-788.79 535.99,-799.31 541.53,-790.28 534.69,-788.79"/> +</g> +<!-- n0000008b->n00000035 --> +<g id="edge14" class="edge"> +<title>n0000008b:port6->n00000035</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M424,-532.07C424,-532.07 290.37,-518.48 277.9,-528.88 202.27,-591.86 215.34,-725.69 225.66,-785.15"/> +<polygon fill="black" stroke="black" points="222.29,-786.14 227.54,-795.35 229.17,-784.88 222.29,-786.14"/> +</g> +<!-- n0000008b->n00000039 --> +<g id="edge15" class="edge"> +<title>n0000008b:port7->n00000039</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M450.1,-532.08C450.1,-532.08 395.22,-528.13 383.45,-518.65 375.46,-512.21 322.64,-385.46 298.76,-327.47"/> +<polygon fill="black" stroke="black" points="301.96,-326.05 294.92,-318.14 295.49,-328.72 301.96,-326.05"/> +</g> +<!-- n0000008b->n0000003d --> +<g id="edge16" class="edge"> +<title>n0000008b:port8->n0000003d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M476.1,-532.08C476.1,-532.08 522.37,-522.39 526.85,-518.65 563.15,-488.33 581.38,-434.52 589.6,-401.2"/> +<polygon fill="black" stroke="black" points="593.08,-401.69 591.93,-391.16 586.26,-400.11 593.08,-401.69"/> +</g> +<!-- n00000095 --> +<g id="node35" class="node"> +<title>n00000095</title> +<path fill="green" stroke="black" d="M301.38,-1180.11C301.38,-1180.11 485.38,-1180.11 485.38,-1180.11 491.38,-1180.11 497.38,-1186.11 497.38,-1192.11 497.38,-1192.11 497.38,-1252.11 497.38,-1252.11 497.38,-1258.11 491.38,-1264.11 485.38,-1264.11 485.38,-1264.11 301.38,-1264.11 301.38,-1264.11 295.38,-1264.11 289.38,-1258.11 289.38,-1252.11 289.38,-1252.11 289.38,-1192.11 289.38,-1192.11 289.38,-1186.11 295.38,-1180.11 301.38,-1180.11"/> +<text text-anchor="middle" x="393.38" y="-1248.91" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="289.38,-1241.11 497.38,-1241.11 "/> +<text text-anchor="middle" x="393.38" y="-1225.91" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 2</text> +<text text-anchor="middle" x="393.38" y="-1210.91" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev2</text> +<polyline fill="none" stroke="black" points="289.38,-1203.11 497.38,-1203.11 "/> +<text text-anchor="middle" x="302.38" y="-1187.91" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="315.38,-1180.11 315.38,-1203.11 "/> +<text text-anchor="middle" x="328.38" y="-1187.91" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="341.38,-1180.11 341.38,-1203.11 "/> +<text text-anchor="middle" x="354.38" y="-1187.91" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="367.38,-1180.11 367.38,-1203.11 "/> +<text text-anchor="middle" x="380.38" y="-1187.91" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="393.38,-1180.11 393.38,-1203.11 "/> +<text text-anchor="middle" x="406.38" y="-1187.91" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="419.38,-1180.11 419.38,-1203.11 "/> +<text text-anchor="middle" x="432.38" y="-1187.91" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="445.38,-1180.11 445.38,-1203.11 "/> +<text text-anchor="middle" x="458.38" y="-1187.91" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="471.38,-1180.11 471.38,-1203.11 "/> +<text text-anchor="middle" x="484.38" y="-1187.91" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n00000095->n00000041 --> +<g id="edge17" class="edge"> +<title>n00000095:port1->n00000041</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M289.38,-1191.11C289.38,-1191.11 258.94,-1194.22 222.89,-1197.91"/> +<polygon fill="black" stroke="black" points="222.19,-1194.46 212.6,-1198.96 222.9,-1201.42 222.19,-1194.46"/> +</g> +<!-- n00000095->n00000045 --> +<g id="edge18" class="edge"> +<title>n00000095:port2->n00000045</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M328.48,-1180.11C328.48,-1180.11 463.26,-1168.53 500.58,-1176.91 534.02,-1184.43 537.24,-1200.06 569.66,-1211.18 569.76,-1211.22 569.86,-1211.25 569.96,-1211.29"/> +<polygon fill="black" stroke="black" points="568.86,-1214.61 579.45,-1214.34 571,-1207.95 568.86,-1214.61"/> +</g> +<!-- n00000095->n00000049 --> +<g id="edge19" class="edge"> +<title>n00000095:port3->n00000049</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M354.38,-1180.11C354.38,-1180.11 356.8,-1178.46 360.49,-1175.94"/> +<polygon fill="black" stroke="black" points="362.56,-1178.77 368.86,-1170.24 358.62,-1172.98 362.56,-1178.77"/> +</g> +<!-- n00000095->n0000004d --> +<g id="edge20" class="edge"> +<title>n00000095:port4->n0000004d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M380.47,-1180.09C380.47,-1180.09 293.71,-1169.63 286.18,-1176.91 257.29,-1204.84 262.63,-1234.76 286.18,-1267.31 288.16,-1270.05 297.33,-1273.96 309.38,-1278.13"/> +<polygon fill="black" stroke="black" points="308.49,-1281.53 319.09,-1281.36 310.7,-1274.88 308.49,-1281.53"/> +</g> +<!-- n00000095->n00000051 --> +<g id="edge21" class="edge"> +<title>n00000095:port5->n00000051</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M406.28,-1180.09C406.28,-1180.09 492.13,-1170.7 500.58,-1176.91 576.41,-1232.66 579.83,-1358.79 577.09,-1416.2"/> +<polygon fill="black" stroke="black" points="573.59,-1416.23 576.51,-1426.41 580.58,-1416.63 573.59,-1416.23"/> +</g> +<!-- n00000095->n00000055 --> +<g id="edge22" class="edge"> +<title>n00000095:port6->n00000055</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M432.28,-1180.11C432.28,-1180.11 292.85,-1172.29 286.18,-1176.91 211.26,-1228.86 198.3,-1348.49 196.45,-1404.12"/> +<polygon fill="black" stroke="black" points="192.94,-1404.28 196.21,-1414.36 199.94,-1404.44 192.94,-1404.28"/> +</g> +<!-- n00000095->n00000059 --> +<g id="edge23" class="edge"> +<title>n00000095:port7->n00000059</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M458.38,-1180.11C458.38,-1180.11 291.84,-1175.85 287.94,-1173.16 239.87,-1139.96 222.85,-1068.83 216.94,-1028.6"/> +<polygon fill="black" stroke="black" points="220.39,-1028.06 215.6,-1018.61 213.46,-1028.98 220.39,-1028.06"/> +</g> +<!-- n00000095->n0000005d --> +<g id="edge24" class="edge"> +<title>n00000095:port8->n0000005d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M484.38,-1180.11C484.38,-1180.11 502.45,-1176.49 506.34,-1173.16 547.25,-1138.2 569.47,-1077.38 579.62,-1041.41"/> +<polygon fill="black" stroke="black" points="583.06,-1042.09 582.28,-1031.53 576.3,-1040.27 583.06,-1042.09"/> +</g> +<!-- n0000009f --> +<g id="node36" class="node"> +<title>n0000009f</title> +<path fill="green" stroke="black" d="M1211.38,-200.11C1211.38,-200.11 1395.38,-200.11 1395.38,-200.11 1401.38,-200.11 1407.38,-206.11 1407.38,-212.11 1407.38,-212.11 1407.38,-272.11 1407.38,-272.11 1407.38,-278.11 1401.38,-284.11 1395.38,-284.11 1395.38,-284.11 1211.38,-284.11 1211.38,-284.11 1205.38,-284.11 1199.38,-278.11 1199.38,-272.11 1199.38,-272.11 1199.38,-212.11 1199.38,-212.11 1199.38,-206.11 1205.38,-200.11 1211.38,-200.11"/> +<text text-anchor="middle" x="1303.38" y="-268.91" font-family="Times,serif" font-size="14.00">0</text> +<polyline fill="none" stroke="black" points="1199.38,-261.11 1407.38,-261.11 "/> +<text text-anchor="middle" x="1303.38" y="-245.91" font-family="Times,serif" font-size="14.00">Intel IPU6 CSI2 3</text> +<text text-anchor="middle" x="1303.38" y="-230.91" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev3</text> +<polyline fill="none" stroke="black" points="1199.38,-223.11 1407.38,-223.11 "/> +<text text-anchor="middle" x="1212.38" y="-207.91" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="1225.38,-200.11 1225.38,-223.11 "/> +<text text-anchor="middle" x="1238.38" y="-207.91" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="1251.38,-200.11 1251.38,-223.11 "/> +<text text-anchor="middle" x="1264.38" y="-207.91" font-family="Times,serif" font-size="14.00">3</text> +<polyline fill="none" stroke="black" points="1277.38,-200.11 1277.38,-223.11 "/> +<text text-anchor="middle" x="1290.38" y="-207.91" font-family="Times,serif" font-size="14.00">4</text> +<polyline fill="none" stroke="black" points="1303.38,-200.11 1303.38,-223.11 "/> +<text text-anchor="middle" x="1316.38" y="-207.91" font-family="Times,serif" font-size="14.00">5</text> +<polyline fill="none" stroke="black" points="1329.38,-200.11 1329.38,-223.11 "/> +<text text-anchor="middle" x="1342.38" y="-207.91" font-family="Times,serif" font-size="14.00">6</text> +<polyline fill="none" stroke="black" points="1355.38,-200.11 1355.38,-223.11 "/> +<text text-anchor="middle" x="1368.38" y="-207.91" font-family="Times,serif" font-size="14.00">7</text> +<polyline fill="none" stroke="black" points="1381.38,-200.11 1381.38,-223.11 "/> +<text text-anchor="middle" x="1394.38" y="-207.91" font-family="Times,serif" font-size="14.00">8</text> +</g> +<!-- n0000009f->n00000061 --> +<g id="edge25" class="edge"> +<title>n0000009f:port1->n00000061</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1199.38,-211.11C1199.38,-211.11 1168.94,-214.22 1132.89,-217.91"/> +<polygon fill="black" stroke="black" points="1132.19,-214.46 1122.6,-218.96 1132.9,-221.42 1132.19,-214.46"/> +</g> +<!-- n0000009f->n00000065 --> +<g id="edge26" class="edge"> +<title>n0000009f:port2->n00000065</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1238.48,-200.11C1238.48,-200.11 1373.26,-188.53 1410.58,-196.91 1444.02,-204.43 1447.24,-220.06 1479.66,-231.18 1479.76,-231.22 1479.86,-231.25 1479.96,-231.29"/> +<polygon fill="black" stroke="black" points="1478.86,-234.61 1489.45,-234.34 1481,-227.95 1478.86,-234.61"/> +</g> +<!-- n0000009f->n00000069 --> +<g id="edge27" class="edge"> +<title>n0000009f:port3->n00000069</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1264.38,-200.11C1264.38,-200.11 1266.8,-198.46 1270.49,-195.94"/> +<polygon fill="black" stroke="black" points="1272.56,-198.77 1278.86,-190.24 1268.62,-192.98 1272.56,-198.77"/> +</g> +<!-- n0000009f->n0000006d --> +<g id="edge28" class="edge"> +<title>n0000009f:port4->n0000006d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1290.47,-200.09C1290.47,-200.09 1203.71,-189.63 1196.18,-196.91 1167.29,-224.84 1172.63,-254.76 1196.18,-287.31 1198.16,-290.05 1207.33,-293.96 1219.38,-298.13"/> +<polygon fill="black" stroke="black" points="1218.49,-301.53 1229.09,-301.36 1220.7,-294.88 1218.49,-301.53"/> +</g> +<!-- n0000009f->n00000071 --> +<g id="edge29" class="edge"> +<title>n0000009f:port5->n00000071</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1316.28,-200.09C1316.28,-200.09 1402.13,-190.7 1410.58,-196.91 1486.41,-252.66 1489.83,-378.79 1487.09,-436.2"/> +<polygon fill="black" stroke="black" points="1483.59,-436.23 1486.51,-446.41 1490.58,-436.63 1483.59,-436.23"/> +</g> +<!-- n0000009f->n00000075 --> +<g id="edge30" class="edge"> +<title>n0000009f:port6->n00000075</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1342.28,-200.11C1342.28,-200.11 1202.85,-192.29 1196.18,-196.91 1121.26,-248.86 1108.3,-368.49 1106.45,-424.12"/> +<polygon fill="black" stroke="black" points="1102.94,-424.28 1106.21,-434.36 1109.94,-424.44 1102.94,-424.28"/> +</g> +<!-- n0000009f->n00000079 --> +<g id="edge31" class="edge"> +<title>n0000009f:port7->n00000079</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1368.38,-200.11C1368.38,-200.11 1201.84,-195.85 1197.94,-193.16 1149.87,-159.96 1132.85,-88.83 1126.94,-48.6"/> +<polygon fill="black" stroke="black" points="1130.39,-48.06 1125.6,-38.61 1123.46,-48.98 1130.39,-48.06"/> +</g> +<!-- n0000009f->n0000007d --> +<g id="edge32" class="edge"> +<title>n0000009f:port8->n0000007d</title> +<path fill="none" stroke="black" stroke-dasharray="5,2" d="M1394.38,-200.11C1394.38,-200.11 1412.45,-196.49 1416.34,-193.16 1457.25,-158.2 1479.47,-97.38 1489.62,-61.41"/> +<polygon fill="black" stroke="black" points="1493.06,-62.09 1492.28,-51.53 1486.3,-60.27 1493.06,-62.09"/> +</g> +<!-- n000000e9 --> +<g id="node37" class="node"> +<title>n000000e9</title> +<path fill="green" stroke="black" d="M398.65,-431.45C398.65,-431.45 511.65,-431.45 511.65,-431.45 517.65,-431.45 523.65,-437.45 523.65,-443.45 523.65,-443.45 523.65,-503.45 523.65,-503.45 523.65,-509.45 517.65,-515.45 511.65,-515.45 511.65,-515.45 398.65,-515.45 398.65,-515.45 392.65,-515.45 386.65,-509.45 386.65,-503.45 386.65,-503.45 386.65,-443.45 386.65,-443.45 386.65,-437.45 392.65,-431.45 398.65,-431.45"/> +<text text-anchor="middle" x="420.65" y="-500.25" font-family="Times,serif" font-size="14.00">1</text> +<polyline fill="none" stroke="black" points="454.65,-492.45 454.65,-515.45 "/> +<text text-anchor="middle" x="489.15" y="-500.25" font-family="Times,serif" font-size="14.00">2</text> +<polyline fill="none" stroke="black" points="386.65,-492.45 523.65,-492.45 "/> +<text text-anchor="middle" x="455.15" y="-477.25" font-family="Times,serif" font-size="14.00">ov2740 4-0036</text> +<text text-anchor="middle" x="455.15" y="-462.25" font-family="Times,serif" font-size="14.00">/dev/v4l-subdev4</text> +<polyline fill="none" stroke="black" points="386.65,-454.45 523.65,-454.45 "/> +<text text-anchor="middle" x="455.15" y="-439.25" font-family="Times,serif" font-size="14.00">0</text> +</g> +<!-- n000000e9->n0000008b --> +<g id="edge33" class="edge"> +<title>n000000e9:port0->n0000008b:port0</title> +<path fill="none" stroke="black" stroke-width="2" d="M386.14,-442.55C386.14,-442.55 361.11,-493.23 383.45,-518.65 391.47,-527.78 484.31,-519.72 492.3,-528.88 508.64,-547.6 499.26,-579.87 493.12,-595.68"/> +<polygon fill="black" stroke="black" stroke-width="2" points="489.86,-594.41 489.11,-604.98 496.29,-597.19 489.86,-594.41"/> +</g> +</g> +</svg> diff --git a/Documentation/admin-guide/media/mgb4.rst b/Documentation/admin-guide/media/mgb4.rst index 2977f74d7e26..b9da127c074d 100644 --- a/Documentation/admin-guide/media/mgb4.rst +++ b/Documentation/admin-guide/media/mgb4.rst @@ -1,8 +1,10 @@ .. SPDX-License-Identifier: GPL-2.0 -==================== -mgb4 sysfs interface -==================== +The mgb4 driver +=============== + +sysfs interface +--------------- The mgb4 driver provides a sysfs interface, that is used to configure video stream related parameters (some of them must be set properly before the v4l2 @@ -12,9 +14,8 @@ There are two types of parameters - global / PCI card related, found under ``/sys/class/video4linux/videoX/device`` and module specific found under ``/sys/class/video4linux/videoX``. - Global (PCI card) parameters -============================ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **module_type** (R): Module type. @@ -42,9 +43,8 @@ Global (PCI card) parameters where each component is a 8b number. - Common FPDL3/GMSL input parameters -================================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **input_id** (R): Input number ID, zero based. @@ -190,9 +190,8 @@ Common FPDL3/GMSL input parameters *Note: This parameter can not be changed while the input v4l2 device is open.* - Common FPDL3/GMSL output parameters -=================================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **output_id** (R): Output number ID, zero based. @@ -228,8 +227,13 @@ Common FPDL3/GMSL output parameters open.* **frame_rate** (RW): - Output video frame rate in frames per second. The default frame rate is - 60Hz. + Output video signal frame rate limit in frames per second. Due to + the limited output pixel clock steps, the card can not always generate + a frame rate perfectly matching the value required by the connected display. + Using this parameter one can limit the frame rate by "crippling" the signal + so that the lines are not equal (the porches of the last line differ) but + the signal appears like having the exact frame rate to the connected display. + The default frame rate limit is 60Hz. **hsync_polarity** (RW): HSYNC signal polarity. @@ -254,37 +258,36 @@ Common FPDL3/GMSL output parameters and there is a non-linear stepping between two consecutive allowed frequencies. The driver finds the nearest allowed frequency to the given value and sets it. When reading this property, you get the exact - frequency set by the driver. The default frequency is 70000kHz. + frequency set by the driver. The default frequency is 61150kHz. *Note: This parameter can not be changed while the output v4l2 device is open.* **hsync_width** (RW): - Width of the HSYNC signal in pixels. The default value is 16. + Width of the HSYNC signal in pixels. The default value is 40. **vsync_width** (RW): - Width of the VSYNC signal in video lines. The default value is 2. + Width of the VSYNC signal in video lines. The default value is 20. **hback_porch** (RW): Number of PCLK pulses between deassertion of the HSYNC signal and the first - valid pixel in the video line (marked by DE=1). The default value is 32. + valid pixel in the video line (marked by DE=1). The default value is 50. **hfront_porch** (RW): Number of PCLK pulses between the end of the last valid pixel in the video line (marked by DE=1) and assertion of the HSYNC signal. The default value - is 32. + is 50. **vback_porch** (RW): Number of video lines between deassertion of the VSYNC signal and the video - line with the first valid pixel (marked by DE=1). The default value is 2. + line with the first valid pixel (marked by DE=1). The default value is 31. **vfront_porch** (RW): Number of video lines between the end of the last valid pixel line (marked - by DE=1) and assertion of the VSYNC signal. The default value is 2. - + by DE=1) and assertion of the VSYNC signal. The default value is 30. FPDL3 specific input parameters -=============================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **fpdl3_input_width** (RW): Number of deserializer input lines. @@ -294,7 +297,7 @@ FPDL3 specific input parameters | 2 - dual FPDL3 specific output parameters -================================ +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **fpdl3_output_width** (RW): Number of serializer output lines. @@ -304,7 +307,7 @@ FPDL3 specific output parameters | 2 - dual GMSL specific input parameters -============================== +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **gmsl_mode** (RW): GMSL speed mode. @@ -328,10 +331,8 @@ GMSL specific input parameters | 0 - disabled | 1 - enabled (default) - -==================== -mgb4 mtd partitions -==================== +MTD partitions +-------------- The mgb4 driver creates a MTD device with two partitions: - mgb4-fw.X - FPGA firmware. @@ -344,9 +345,8 @@ also have a third partition named *mgb4-flash* available in the system. This partition represents the whole, unpartitioned, card's FLASH memory and one should not fiddle with it... -==================== -mgb4 iio (triggers) -==================== +IIO (triggers) +-------------- The mgb4 driver creates an Industrial I/O (IIO) device that provides trigger and signal level status capability. The following scan elements are available: diff --git a/Documentation/admin-guide/media/omap4_camera.rst b/Documentation/admin-guide/media/omap4_camera.rst deleted file mode 100644 index 2ada9b1e6897..000000000000 --- a/Documentation/admin-guide/media/omap4_camera.rst +++ /dev/null @@ -1,62 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -OMAP4 ISS Driver -================ - -Author: Sergio Aguirre <sergio.a.aguirre@gmail.com> - -Copyright (C) 2012, Texas Instruments - -Introduction ------------- - -The OMAP44XX family of chips contains the Imaging SubSystem (a.k.a. ISS), -Which contains several components that can be categorized in 3 big groups: - -- Interfaces (2 Interfaces: CSI2-A & CSI2-B/CCP2) -- ISP (Image Signal Processor) -- SIMCOP (Still Image Coprocessor) - -For more information, please look in [#f1]_ for latest version of: -"OMAP4430 Multimedia Device Silicon Revision 2.x" - -As of Revision AB, the ISS is described in detail in section 8. - -This driver is supporting **only** the CSI2-A/B interfaces for now. - -It makes use of the Media Controller framework [#f2]_, and inherited most of the -code from OMAP3 ISP driver (found under drivers/media/platform/ti/omap3isp/\*), -except that it doesn't need an IOMMU now for ISS buffers memory mapping. - -Supports usage of MMAP buffers only (for now). - -Tested platforms ----------------- - -- OMAP4430SDP, w/ ES2.1 GP & SEVM4430-CAM-V1-0 (Contains IMX060 & OV5640, in - which only the last one is supported, outputting YUV422 frames). - -- TI Blaze MDP, w/ OMAP4430 ES2.2 EMU (Contains 1 IMX060 & 2 OV5650 sensors, in - which only the OV5650 are supported, outputting RAW10 frames). - -- PandaBoard, Rev. A2, w/ OMAP4430 ES2.1 GP & OV adapter board, tested with - following sensors: - * OV5640 - * OV5650 - -- Tested on mainline kernel: - - http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=summary - - Tag: v3.3 (commit c16fa4f2ad19908a47c63d8fa436a1178438c7e7) - -File list ---------- -drivers/staging/media/omap4iss/ -include/linux/platform_data/media/omap4iss.h - -References ----------- - -.. [#f1] http://focus.ti.com/general/docs/wtbu/wtbudocumentcenter.tsp?navigationId=12037&templateId=6123#62 -.. [#f2] http://lwn.net/Articles/420485/ diff --git a/Documentation/admin-guide/media/raspberrypi-pisp-be.dot b/Documentation/admin-guide/media/raspberrypi-pisp-be.dot new file mode 100644 index 000000000000..55671dc1d443 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-pisp-be.dot @@ -0,0 +1,20 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0 | <port1> 1 | <port2> 2 | <port7> 7} | pispbe\n | {<port3> 3 | <port4> 4 | <port5> 5 | <port6> 6}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port3 -> n0000001c [style=bold] + n00000001:port4 -> n00000022 [style=bold] + n00000001:port5 -> n00000028 [style=bold] + n00000001:port6 -> n0000002e [style=bold] + n0000000a [label="pispbe-input\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n0000000a -> n00000001:port0 [style=bold] + n00000010 [label="pispbe-tdn_input\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n00000010 -> n00000001:port1 [style=bold] + n00000016 [label="pispbe-stitch_input\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n00000016 -> n00000001:port2 [style=bold] + n0000001c [label="pispbe-output0\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n00000022 [label="pispbe-output1\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n00000028 [label="pispbe-tdn_output\n/dev/video5", shape=box, style=filled, fillcolor=yellow] + n0000002e [label="pispbe-stitch_output\n/dev/video6", shape=box, style=filled, fillcolor=yellow] + n00000034 [label="pispbe-config\n/dev/video7", shape=box, style=filled, fillcolor=yellow] + n00000034 -> n00000001:port7 [style=bold] +} diff --git a/Documentation/admin-guide/media/raspberrypi-pisp-be.rst b/Documentation/admin-guide/media/raspberrypi-pisp-be.rst new file mode 100644 index 000000000000..0fcf46f26276 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-pisp-be.rst @@ -0,0 +1,109 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================================= +Raspberry Pi PiSP Back End Memory-to-Memory ISP (pisp-be) +========================================================= + +The PiSP Back End +================= + +The PiSP Back End is a memory-to-memory Image Signal Processor (ISP) which reads +image data from DRAM memory and performs image processing as specified by the +application through the parameters in a configuration buffer, before writing +pixel data back to memory through two distinct output channels. + +The ISP registers and programming model are documented in the `Raspberry Pi +Image Signal Processor (PiSP) Specification document`_ + +The PiSP Back End ISP processes images in tiles. The handling of image +tessellation and the computation of low-level configuration parameters is +realized by a free software library called `libpisp +<https://github.com/raspberrypi/libpisp>`_. + +The full image processing pipeline, which involves capturing RAW Bayer data from +an image sensor through a MIPI CSI-2 compatible capture interface, storing them +in DRAM memory and processing them in the PiSP Back End to obtain images usable +by an application is implemented in `libcamera <https://libcamera.org>`_ as +part of the Raspberry Pi platform support. + +The pisp-be driver +================== + +The Raspberry Pi PiSP Back End (pisp-be) driver is located under +drivers/media/platform/raspberrypi/pisp-be. It uses the `V4L2 API` to register +a number of video capture and output devices, the `V4L2 subdev API` to register +a subdevice for the ISP that connects the video devices in a single media graph +realized using the `Media Controller (MC) API`. + +The media topology registered by the `pisp-be` driver is represented below: + +.. _pips-be-topology: + +.. kernel-figure:: raspberrypi-pisp-be.dot + :alt: Diagram of the default media pipeline topology + :align: center + + +The media graph registers the following video device nodes: + +- pispbe-input: output device for images to be submitted to the ISP for + processing. +- pispbe-tdn_input: output device for temporal denoise. +- pispbe-stitch_input: output device for image stitching (HDR). +- pispbe-output0: first capture device for processed images. +- pispbe-output1: second capture device for processed images. +- pispbe-tdn_output: capture device for temporal denoise. +- pispbe-stitch_output: capture device for image stitching (HDR). +- pispbe-config: output device for ISP configuration parameters. + +pispbe-input +------------ + +Images to be processed by the ISP are queued to the `pispbe-input` output device +node. For a list of image formats supported as input to the ISP refer to the +`Raspberry Pi Image Signal Processor (PiSP) Specification document`_. + +pispbe-tdn_input, pispbe-tdn_output +----------------------------------- + +The `pispbe-tdn_input` output video device receives images to be processed by +the temporal denoise block which are captured from the `pispbe-tdn_output` +capture video device. Userspace is responsible for maintaining queues on both +devices, and ensuring that buffers completed on the output are queued to the +input. + +pispbe-stitch_input, pispbe-stitch_output +----------------------------------------- + +To realize HDR (high dynamic range) image processing the image stitching and +tonemapping blocks are used. The `pispbe-stitch_output` writes images to memory +and the `pispbe-stitch_input` receives the previously written frame to process +it along with the current input image. Userspace is responsible for maintaining +queues on both devices, and ensuring that buffers completed on the output are +queued to the input. + +pispbe-output0, pispbe-output1 +------------------------------ + +The two capture devices write to memory the pixel data as processed by the ISP. + +pispbe-config +------------- + +The `pispbe-config` output video devices receives a buffer of configuration +parameters that define the desired image processing to be performed by the ISP. + +The format of the ISP configuration parameter is defined by +:c:type:`pisp_be_tiles_config` C structure and the meaning of each parameter is +described in the `Raspberry Pi Image Signal Processor (PiSP) Specification +document`_. + +ISP configuration +================= + +The ISP configuration is described solely by the content of the parameters +buffer. The only parameter that userspace needs to configure using the V4L2 API +is the image format on the output and capture video devices for validation of +the content of the parameters buffer. + +.. _Raspberry Pi Image Signal Processor (PiSP) Specification document: https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf diff --git a/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot new file mode 100644 index 000000000000..7717f2291049 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.dot @@ -0,0 +1,27 @@ +digraph board { + rankdir=TB + n00000001 [label="{{<port0> 0} | csi2\n/dev/v4l-subdev0 | {<port1> 1 | <port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green] + n00000001:port1 -> n00000011 [style=dashed] + n00000001:port1 -> n00000007:port0 + n00000001:port2 -> n00000015 + n00000001:port2 -> n00000007:port0 [style=dashed] + n00000001:port3 -> n00000019 [style=dashed] + n00000001:port3 -> n00000007:port0 [style=dashed] + n00000001:port4 -> n0000001d [style=dashed] + n00000001:port4 -> n00000007:port0 [style=dashed] + n00000007 [label="{{<port0> 0 | <port1> 1} | pisp-fe\n/dev/v4l-subdev1 | {<port2> 2 | <port3> 3 | <port4> 4}}", shape=Mrecord, style=filled, fillcolor=green] + n00000007:port2 -> n00000021 + n00000007:port3 -> n00000025 [style=dashed] + n00000007:port4 -> n00000029 + n0000000d [label="{imx219 6-0010\n/dev/v4l-subdev2 | {<port0> 0}}", shape=Mrecord, style=filled, fillcolor=green] + n0000000d:port0 -> n00000001:port0 [style=bold] + n00000011 [label="rp1-cfe-csi2-ch0\n/dev/video0", shape=box, style=filled, fillcolor=yellow] + n00000015 [label="rp1-cfe-csi2-ch1\n/dev/video1", shape=box, style=filled, fillcolor=yellow] + n00000019 [label="rp1-cfe-csi2-ch2\n/dev/video2", shape=box, style=filled, fillcolor=yellow] + n0000001d [label="rp1-cfe-csi2-ch3\n/dev/video3", shape=box, style=filled, fillcolor=yellow] + n00000021 [label="rp1-cfe-fe-image0\n/dev/video4", shape=box, style=filled, fillcolor=yellow] + n00000025 [label="rp1-cfe-fe-image1\n/dev/video5", shape=box, style=filled, fillcolor=yellow] + n00000029 [label="rp1-cfe-fe-stats\n/dev/video6", shape=box, style=filled, fillcolor=yellow] + n0000002d [label="rp1-cfe-fe-config\n/dev/video7", shape=box, style=filled, fillcolor=yellow] + n0000002d -> n00000007:port1 +} diff --git a/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst new file mode 100644 index 000000000000..668d978a9875 --- /dev/null +++ b/Documentation/admin-guide/media/raspberrypi-rp1-cfe.rst @@ -0,0 +1,78 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============================================ +Raspberry Pi PiSP Camera Front End (rp1-cfe) +============================================ + +The PiSP Camera Front End +========================= + +The PiSP Camera Front End (CFE) is a module which combines a CSI-2 receiver with +a simple ISP, called the Front End (FE). + +The CFE has four DMA engines and can write frames from four separate streams +received from the CSI-2 to the memory. One of those streams can also be routed +directly to the FE, which can do minimal image processing, write two versions +(e.g. non-scaled and downscaled versions) of the received frames to memory and +provide statistics of the received frames. + +The FE registers are documented in the `Raspberry Pi Image Signal Processor +(ISP) Specification document +<https://datasheets.raspberrypi.com/camera/raspberry-pi-image-signal-processor-specification.pdf>`_, +and example code for FE can be found in `libpisp +<https://github.com/raspberrypi/libpisp>`_. + +The rp1-cfe driver +================== + +The Raspberry Pi PiSP Camera Front End (rp1-cfe) driver is located under +drivers/media/platform/raspberrypi/rp1-cfe. It uses the `V4L2 API` to register +a number of video capture and output devices, the `V4L2 subdev API` to register +subdevices for the CSI-2 received and the FE that connects the video devices in +a single media graph realized using the `Media Controller (MC) API`. + +The media topology registered by the `rp1-cfe` driver, in this particular +example connected to an imx219 sensor, is the following one: + +.. _rp1-cfe-topology: + +.. kernel-figure:: raspberrypi-rp1-cfe.dot + :alt: Diagram of an example media pipeline topology + :align: center + +The media graph contains the following video device nodes: + +- rp1-cfe-csi2-ch0: capture device for the first CSI-2 stream +- rp1-cfe-csi2-ch1: capture device for the second CSI-2 stream +- rp1-cfe-csi2-ch2: capture device for the third CSI-2 stream +- rp1-cfe-csi2-ch3: capture device for the fourth CSI-2 stream +- rp1-cfe-fe-image0: capture device for the first FE output +- rp1-cfe-fe-image1: capture device for the second FE output +- rp1-cfe-fe-stats: capture device for the FE statistics +- rp1-cfe-fe-config: output device for FE configuration + +rp1-cfe-csi2-chX +---------------- + +The rp1-cfe-csi2-chX capture devices are normal V4L2 capture devices which +can be used to capture video frames or metadata received from the CSI-2. + +rp1-cfe-fe-image0, rp1-cfe-fe-image1 +------------------------------------ + +The rp1-cfe-fe-image0 and rp1-cfe-fe-image1 capture devices are used to write +the processed frames to memory. + +rp1-cfe-fe-stats +---------------- + +The format of the FE statistics buffer is defined by +:c:type:`pisp_statistics` C structure and the meaning of each parameter is +described in the `PiSP specification` document. + +rp1-cfe-fe-config +----------------- + +The format of the FE configuration buffer is defined by +:c:type:`pisp_fe_config` C structure and the meaning of each parameter is +described in the `PiSP specification` document. diff --git a/Documentation/admin-guide/media/rkisp1.rst b/Documentation/admin-guide/media/rkisp1.rst index 6f14d9561fa5..6c878c71442f 100644 --- a/Documentation/admin-guide/media/rkisp1.rst +++ b/Documentation/admin-guide/media/rkisp1.rst @@ -114,11 +114,18 @@ to be applied to the hardware during a video stream, allowing userspace to dynamically modify values such as black level, cross talk corrections and others. -The buffer format is defined by struct :c:type:`rkisp1_params_cfg`, and -userspace should set +The ISP driver supports two different parameters configuration methods, the +`fixed parameters format` or the `extensible parameters format`. + +When using the `fixed parameters` method the buffer format is defined by struct +:c:type:`rkisp1_params_cfg`, and userspace should set :ref:`V4L2_META_FMT_RK_ISP1_PARAMS <v4l2-meta-fmt-rk-isp1-params>` as the dataformat. +When using the `extensible parameters` method the buffer format is defined by +struct :c:type:`rkisp1_ext_params_cfg`, and userspace should set +:ref:`V4L2_META_FMT_RK_ISP1_EXT_PARAMS <v4l2-meta-fmt-rk-isp1-ext-params>` as +the dataformat. Capturing Video Frames Example ============================== diff --git a/Documentation/admin-guide/media/saa7134.rst b/Documentation/admin-guide/media/saa7134.rst index 51eae7eb5ab7..18d7cbc897db 100644 --- a/Documentation/admin-guide/media/saa7134.rst +++ b/Documentation/admin-guide/media/saa7134.rst @@ -67,7 +67,7 @@ Changes / Fixes Please mail to linux-media AT vger.kernel.org unified diffs against the linux media git tree: - https://git.linuxtv.org/media_tree.git/ + https://git.linuxtv.org/media.git/ This is done by committing a patch at a clone of the git tree and submitting the patch using ``git send-email``. Don't forget to diff --git a/Documentation/admin-guide/media/tuner-cardlist.rst b/Documentation/admin-guide/media/tuner-cardlist.rst index 362617c59c5d..65ecf48ddf24 100644 --- a/Documentation/admin-guide/media/tuner-cardlist.rst +++ b/Documentation/admin-guide/media/tuner-cardlist.rst @@ -97,4 +97,6 @@ Tuner number Card name 89 Sony BTF-PG472Z PAL/SECAM 90 Sony BTF-PK467Z NTSC-M-JP 91 Sony BTF-PB463Z NTSC-M +92 Silicon Labs Si2157 tuner +93 Tena TNF931D-DFDR1 ============ ===================================================== diff --git a/Documentation/admin-guide/media/v4l-drivers.rst b/Documentation/admin-guide/media/v4l-drivers.rst index f4bb2605f07e..e8761561b2fe 100644 --- a/Documentation/admin-guide/media/v4l-drivers.rst +++ b/Documentation/admin-guide/media/v4l-drivers.rst @@ -16,14 +16,16 @@ Video4Linux (V4L) driver-specific documentation imx imx7 ipu3 + ipu6-isys ivtv mgb4 omap3isp - omap4_camera philips qcom_camss + raspberrypi-pisp-be rcar-fdp1 rkisp1 + raspberrypi-rp1-cfe saa7134 si470x si4713 diff --git a/Documentation/admin-guide/media/vivid.rst b/Documentation/admin-guide/media/vivid.rst index b6f658c0997e..034ca7c77fb9 100644 --- a/Documentation/admin-guide/media/vivid.rst +++ b/Documentation/admin-guide/media/vivid.rst @@ -302,6 +302,15 @@ all configurable using the following module options: - 0: forbid hints - 1: allow hints +- supports_requests: + + specifies if the device should support the Request API. There are + three possible values, default is 1: + + - 0: no request + - 1: supports requests + - 2: requires requests + Taken together, all these module options allow you to precisely customize the driver behavior and test your application with all sorts of permutations. It is also very suitable to emulate hardware that is not yet available, e.g. @@ -313,13 +322,13 @@ Video Capture This is probably the most frequently used feature. The video capture device can be configured by using the module options num_inputs, input_types and -ccs_cap_mode (see section 1 for more detailed information), but by default -four inputs are configured: a webcam, a TV tuner, an S-Video and an HDMI -input, one input for each input type. Those are described in more detail -below. +ccs_cap_mode (see "Configuring the driver" for more detailed information), +but by default four inputs are configured: a webcam, a TV tuner, an S-Video +and an HDMI input, one input for each input type. Those are described in more +detail below. Special attention has been given to the rate at which new frames become -available. The jitter will be around 1 jiffie (that depends on the HZ +available. The jitter will be around 1 jiffy (that depends on the HZ configuration of your kernel, so usually 1/100, 1/250 or 1/1000 of a second), but the long-term behavior is exactly following the framerate. So a framerate of 59.94 Hz is really different from 60 Hz. If the framerate @@ -434,10 +443,10 @@ Video Output ------------ The video output device can be configured by using the module options -num_outputs, output_types and ccs_out_mode (see section 1 for more detailed -information), but by default two outputs are configured: an S-Video and an -HDMI input, one output for each output type. Those are described in more detail -below. +num_outputs, output_types and ccs_out_mode (see "Configuring the driver" +for more detailed information), but by default two outputs are configured: +an S-Video and an HDMI input, one output for each output type. Those are +described in more detail below. Like with video capture the framerate is also exact in the long term. @@ -1011,11 +1020,6 @@ Digital Video Controls affects the reported colorspace since DVI_D outputs will always use sRGB. -- Display Present: - - sets the presence of a "display" on the HDMI output. This affects - the tx_edid_present, tx_hotplug and tx_rxsense controls. - FM Radio Receiver Controls ~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -1130,35 +1134,34 @@ Metadata Capture Controls if set, then the generated metadata stream contains Source Clock information. -Video, VBI and RDS Looping --------------------------- -The vivid driver supports looping of video output to video input, VBI output -to VBI input and RDS output to RDS input. For video/VBI looping this emulates -as if a cable was hooked up between the output and input connector. So video -and VBI looping is only supported between S-Video and HDMI inputs and outputs. -VBI is only valid for S-Video as it makes no sense for HDMI. +Video, Sliced VBI and HDMI CEC Looping +-------------------------------------- -Since radio is wireless this looping always happens if the radio receiver -frequency is close to the radio transmitter frequency. In that case the radio -transmitter will 'override' the emulated radio stations. - -Looping is currently supported only between devices created by the same -vivid driver instance. +Video Looping functionality is supported for devices created by the same +vivid driver instance, as well as across multiple instances of the vivid driver. +The vivid driver supports looping of video and Sliced VBI data between an S-Video output +and an S-Video input. It also supports looping of video and HDMI CEC data between an +HDMI output and an HDMI input. +To enable looping, set the 'HDMI/S-Video XXX-N Is Connected To' control(s) to select +whether an input uses the Test Pattern Generator, or is disconnected, or is connected +to an output. An input can be connected to an output from any vivid instance. +The inputs and outputs are numbered XXX-N where XXX is the vivid instance number +(see module option n_devs). If there is only one vivid instance (the default), then +XXX will be 000. And N is the Nth S-Video/HDMI input or output of that instance. +If vivid is loaded without module options, then you can connect the S-Video 000-0 input +to the S-Video 000-0 output, or the HDMI 000-0 input to the HDMI 000-0 output. +This is the equivalent of connecting or disconnecting a cable between an input and an +output in a physical device. -Video and Sliced VBI looping -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +If an 'HDMI/S-Video XXX-N Is Connected To' control selected an output, then the video +output will be looped to the video input provided that: -The way to enable video/VBI looping is currently fairly crude. A 'Loop Video' -control is available in the "Vivid" control class of the video -capture and VBI capture devices. When checked the video looping will be enabled. -Once enabled any video S-Video or HDMI input will show a static test pattern -until the video output has started. At that time the video output will be -looped to the video input provided that: +- the currently selected input matches the input indicated by the control name. -- the input type matches the output type. So the HDMI input cannot receive - video from the S-Video output. +- in the vivid instance of the output connector, the currently selected output matches + the output indicated by the control's value. - the video resolution of the video input must match that of the video output. So it is not possible to loop a 50 Hz (720x576) S-Video output to a 60 Hz @@ -1185,6 +1188,8 @@ looped to the video input provided that: "DV Timings Signal Mode" for the HDMI input should be configured so that a valid signal is passed to the video input. +If any condition is not valid, then the 'Noise' test pattern is shown. + The framerates do not have to match, although this might change in the future. By default you will see the OSD text superimposed on top of the looped video. @@ -1198,17 +1203,26 @@ and WSS (50 Hz formats) VBI data is looped. Teletext VBI data is not looped. Radio & RDS Looping -~~~~~~~~~~~~~~~~~~~ - -As mentioned in section 6 the radio receiver emulates stations are regular -frequency intervals. Depending on the frequency of the radio receiver a -signal strength value is calculated (this is returned by VIDIOC_G_TUNER). -However, it will also look at the frequency set by the radio transmitter and -if that results in a higher signal strength than the settings of the radio -transmitter will be used as if it was a valid station. This also includes -the RDS data (if any) that the transmitter 'transmits'. This is received -faithfully on the receiver side. Note that when the driver is loaded the -frequencies of the radio receiver and transmitter are not identical, so +------------------- + +The vivid driver supports looping of RDS output to RDS input. + +Since radio is wireless this looping always happens if the radio receiver +frequency is close to the radio transmitter frequency. In that case the radio +transmitter will 'override' the emulated radio stations. + +RDS looping is currently supported only between devices created by the same +vivid driver instance. + +As mentioned in the "Radio Receiver" section, the radio receiver emulates +stations at regular frequency intervals. Depending on the frequency of the +radio receiver a signal strength value is calculated (this is returned by +VIDIOC_G_TUNER). However, it will also look at the frequency set by the radio +transmitter and if that results in a higher signal strength than the settings +of the radio transmitter will be used as if it was a valid station. This also +includes the RDS data (if any) that the transmitter 'transmits'. This is +received faithfully on the receiver side. Note that when the driver is loaded +the frequencies of the radio receiver and transmitter are not identical, so initially no looping takes place. @@ -1218,8 +1232,8 @@ Cropping, Composing, Scaling This driver supports cropping, composing and scaling in any combination. Normally which features are supported can be selected through the Vivid controls, but it is also possible to hardcode it when the module is loaded through the -ccs_cap_mode and ccs_out_mode module options. See section 1 on the details of -these module options. +ccs_cap_mode and ccs_out_mode module options. See "Configuring the driver" on +the details of these module options. This allows you to test your application for all these variations. @@ -1260,7 +1274,8 @@ is set, then the alpha component is only used for the color red and set to The driver has to be configured to support the multiplanar formats. By default the driver instances are single-planar. This can be changed by setting the -multiplanar module option, see section 1 for more details on that option. +multiplanar module option, see "Configuring the driver" for more details on that +option. If the driver instance is using the multiplanar formats/API, then the first single planar format (YUYV) and the multiplanar NV16M and NV61M formats the @@ -1270,74 +1285,6 @@ data_offset to be non-zero, so this is a useful feature for testing applications Video output will also honor any data_offset that the application set. -Capture Overlay ---------------- - -Note: capture overlay support is implemented primarily to test the existing -V4L2 capture overlay API. In practice few if any GPUs support such overlays -anymore, and neither are they generally needed anymore since modern hardware -is so much more capable. By setting flag 0x10000 in the node_types module -option the vivid driver will create a simple framebuffer device that can be -used for testing this API. Whether this API should be used for new drivers is -questionable. - -This driver has support for a destructive capture overlay with bitmap clipping -and list clipping (up to 16 rectangles) capabilities. Overlays are not -supported for multiplanar formats. It also honors the struct v4l2_window field -setting: if it is set to FIELD_TOP or FIELD_BOTTOM and the capture setting is -FIELD_ALTERNATE, then only the top or bottom fields will be copied to the overlay. - -The overlay only works if you are also capturing at that same time. This is a -vivid limitation since it copies from a buffer to the overlay instead of -filling the overlay directly. And if you are not capturing, then no buffers -are available to fill. - -In addition, the pixelformat of the capture format and that of the framebuffer -must be the same for the overlay to work. Otherwise VIDIOC_OVERLAY will return -an error. - -In order to really see what it going on you will need to create two vivid -instances: the first with a framebuffer enabled. You configure the capture -overlay of the second instance to use the framebuffer of the first, then -you start capturing in the second instance. For the first instance you setup -the output overlay for the video output, turn on video looping and capture -to see the blended framebuffer overlay that's being written to by the second -instance. This setup would require the following commands: - -.. code-block:: none - - $ sudo modprobe vivid n_devs=2 node_types=0x10101,0x1 - $ v4l2-ctl -d1 --find-fb - /dev/fb1 is the framebuffer associated with base address 0x12800000 - $ sudo v4l2-ctl -d2 --set-fbuf fb=1 - $ v4l2-ctl -d1 --set-fbuf fb=1 - $ v4l2-ctl -d0 --set-fmt-video=pixelformat='AR15' - $ v4l2-ctl -d1 --set-fmt-video-out=pixelformat='AR15' - $ v4l2-ctl -d2 --set-fmt-video=pixelformat='AR15' - $ v4l2-ctl -d0 -i2 - $ v4l2-ctl -d2 -i2 - $ v4l2-ctl -d2 -c horizontal_movement=4 - $ v4l2-ctl -d1 --overlay=1 - $ v4l2-ctl -d0 -c loop_video=1 - $ v4l2-ctl -d2 --stream-mmap --overlay=1 - -And from another console: - -.. code-block:: none - - $ v4l2-ctl -d1 --stream-out-mmap - -And yet another console: - -.. code-block:: none - - $ qv4l2 - -and start streaming. - -As you can see, this is not for the faint of heart... - - Output Overlay -------------- @@ -1396,7 +1343,7 @@ Some Future Improvements Just as a reminder and in no particular order: - Add a virtual alsa driver to test audio -- Add virtual sub-devices and media controller support +- Add virtual sub-devices - Some support for testing compressed video - Add support to loop raw VBI output to raw VBI input - Add support to loop teletext sliced VBI output to VBI input @@ -1405,12 +1352,10 @@ Just as a reminder and in no particular order: - Add ARGB888 overlay support: better testing of the alpha channel - Improve pixel aspect support in the tpg code by passing a real v4l2_fract - Use per-queue locks and/or per-device locks to improve throughput -- Add support to loop from a specific output to a specific input across - vivid instances - The SDR radio should use the same 'frequencies' for stations as the normal radio receiver, and give back noise if the frequency doesn't match up with a station frequency - Make a thread for the RDS generation, that would help in particular for the "Controls" RDS Rx I/O Mode as the read-only RDS controls could be updated in real-time. -- Changing the EDID should cause hotplug detect emulation to happen. +- Changing the EDID doesn't wait 100 ms before setting the HPD signal. diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 7aa0071ff1c3..ede14b679d02 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -7,7 +7,7 @@ Getting Started This document briefly describes how you can use DAMON by demonstrating its default user space tool. Please note that this document describes only a part of its features for brevity. Please refer to the usage `doc -<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more +<https://github.com/damonitor/damo/blob/next/USAGE.md>`_ of the tool for more details. @@ -26,7 +26,7 @@ User Space Tool For the demonstration, we will use the default user space tool for DAMON, called DAMON Operator (DAMO). It is available at -https://github.com/awslabs/damo. The examples below assume that ``damo`` is on +https://github.com/damonitor/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. Because DAMO is using the sysfs interface (refer to :doc:`usage` for the @@ -34,18 +34,69 @@ detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is mounted. +Snapshot Data Access Patterns +============================= + +The commands below show the memory access pattern of a program at the moment of +the execution. :: + + $ git clone https://github.com/sjp38/masim; cd masim; make + $ sudo damo start "./masim ./configs/stairs.cfg --quiet" + $ sudo damo report access + heatmap: 641111111000000000000000000000000000000000000000000000[...]33333333333333335557984444[...]7 + # min/max temperatures: -1,840,000,000, 370,010,000, column size: 3.925 MiB + 0 addr 86.182 TiB size 8.000 KiB access 0 % age 14.900 s + 1 addr 86.182 TiB size 8.000 KiB access 60 % age 0 ns + 2 addr 86.182 TiB size 3.422 MiB access 0 % age 4.100 s + 3 addr 86.182 TiB size 2.004 MiB access 95 % age 2.200 s + 4 addr 86.182 TiB size 29.688 MiB access 0 % age 14.100 s + 5 addr 86.182 TiB size 29.516 MiB access 0 % age 16.700 s + 6 addr 86.182 TiB size 29.633 MiB access 0 % age 17.900 s + 7 addr 86.182 TiB size 117.652 MiB access 0 % age 18.400 s + 8 addr 126.990 TiB size 62.332 MiB access 0 % age 9.500 s + 9 addr 126.990 TiB size 13.980 MiB access 0 % age 5.200 s + 10 addr 126.990 TiB size 9.539 MiB access 100 % age 3.700 s + 11 addr 126.990 TiB size 16.098 MiB access 0 % age 6.400 s + 12 addr 127.987 TiB size 132.000 KiB access 0 % age 2.900 s + total size: 314.008 MiB + $ sudo damo stop + +The first command of the above example downloads and builds an artificial +memory access generator program called ``masim``. The second command asks DAMO +to start the program via the given command and make DAMON monitors the newly +started process. The third command retrieves the current snapshot of the +monitored access pattern of the process from DAMON and shows the pattern in a +human readable format. + +The first line of the output shows the relative access temperature (hotness) of +the regions in a single row hetmap format. Each column on the heatmap +represents regions of same size on the monitored virtual address space. The +position of the colun on the row and the number on the column represents the +relative location and access temperature of the region. ``[...]`` means +unmapped huge regions on the virtual address spaces. The second line shows +additional information for better understanding the heatmap. + +Each line of the output from the third line shows which virtual address range +(``addr XX size XX``) of the process is how frequently (``access XX %``) +accessed for how long time (``age XX``). For example, the evelenth region of +~9.5 MiB size is being most frequently accessed for last 3.7 seconds. Finally, +the fourth command stops DAMON. + +Note that DAMON can monitor not only virtual address spaces but multiple types +of address spaces including the physical address space. + + Recording Data Access Patterns ============================== The commands below record the memory access patterns of a program and save the monitoring results to a file. :: - $ git clone https://github.com/sjp38/masim - $ cd masim; make; ./masim ./configs/zigzag.cfg & + $ ./masim ./configs/zigzag.cfg & $ sudo damo record -o damon.data $(pidof masim) -The first two lines of the commands download an artificial memory access -generator program and run it in the background. The generator will repeatedly +The line of the commands run the artificial memory access +generator program again. The generator will repeatedly access two 100 MiB sized memory regions one by one. You can substitute this with your real workload. The last line asks ``damo`` to record the access pattern in the ``damon.data`` file. @@ -57,7 +108,7 @@ Visualizing Recorded Patterns You can visualize the pattern in a heatmap, showing which memory region (x-axis) got accessed when (y-axis) and how frequently (number).:: - $ sudo damo report heats --heatmap stdout + $ sudo damo report heatmap 22222222222222222222222222222222222222211111111111111111111111111111111111111100 44444444444444444444444444444444444444434444444444444444444444444444444444443200 44444444444444444444444444444444444444433444444444444444444444444444444444444200 @@ -122,6 +173,6 @@ Data Access Pattern Aware Memory Management Below command makes every memory region of size >=4K that has not accessed for >=60 seconds in your workload to be swapped out. :: - $ sudo damo schemes --damos_access_rate 0 0 --damos_sz_region 4K max \ - --damos_age 60s max --damos_action pageout \ - <pid of your workload> + $ sudo damo start --damos_access_rate 0 0 --damos_sz_region 4K max \ + --damos_age 60s max --damos_action pageout \ + <pid of your workload> diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index 6fce035fdbf5..47a44bd348ab 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -7,31 +7,25 @@ Detailed Usages DAMON provides below interfaces for different users. - *DAMON user space tool.* - `This <https://github.com/awslabs/damo>`_ is for privileged people such as + `This <https://github.com/damonitor/damo>`_ is for privileged people such as system administrators who want a just-working human-friendly interface. Using this, users can use the DAMON’s major features in a human-friendly way. It may not be highly tuned for special cases, though. For more detail, please refer to its `usage document - <https://github.com/awslabs/damo/blob/next/USAGE.md>`_. + <https://github.com/damonitor/damo/blob/next/USAGE.md>`_. - *sysfs interface.* :ref:`This <sysfs_interface>` is for privileged user space programmers who want more optimized use of DAMON. Using this, users can use DAMON’s major features by reading from and writing to special sysfs files. Therefore, you can write and use your personalized DAMON sysfs wrapper programs that reads/writes the sysfs files instead of you. The `DAMON user space tool - <https://github.com/awslabs/damo>`_ is one example of such programs. + <https://github.com/damonitor/damo>`_ is one example of such programs. - *Kernel Space Programming Interface.* :doc:`This </mm/damon/api>` is for kernel space programmers. Using this, users can utilize every feature of DAMON most flexibly and efficiently by writing kernel space DAMON application programs for you. You can even extend DAMON for various address spaces. For detail, please refer to the interface :doc:`document </mm/damon/api>`. -- *debugfs interface. (DEPRECATED!)* - :ref:`This <debugfs_interface>` is almost identical to :ref:`sysfs interface - <sysfs_interface>`. This is deprecated, so users should move to the - :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. .. _sysfs_interface: @@ -78,7 +72,7 @@ comma (","). │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ │ :ref:`schemes <sysfs_schemes>`/nr_schemes - │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,apply_interval_us + │ │ │ │ │ │ :ref:`0 <sysfs_scheme>`/action,target_nid,apply_interval_us │ │ │ │ │ │ │ :ref:`access_pattern <sysfs_access_pattern>`/ │ │ │ │ │ │ │ │ sz/min,max │ │ │ │ │ │ │ │ nr_accesses/min,max @@ -89,10 +83,10 @@ comma (","). │ │ │ │ │ │ │ │ │ 0/target_metric,target_value,current_value │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`filters <sysfs_filters>`/nr_filters - │ │ │ │ │ │ │ │ 0/type,matching,memcg_id - │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,qt_exceeds + │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx + │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes - │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age + │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed │ │ │ │ │ │ │ │ ... │ │ │ │ │ │ ... │ │ │ │ ... @@ -153,7 +147,7 @@ Users can write below commands for the kdamond to the ``state`` file. - ``clear_schemes_tried_regions``: Clear the DAMON-based operating scheme action tried regions directory for each DAMON-based operation scheme of the kdamond. -- ``update_schemes_effective_bytes``: Update the contents of +- ``update_schemes_effective_quotas``: Update the contents of ``effective_bytes`` files for each DAMON-based operation scheme of the kdamond. For more details, refer to :ref:`quotas directory <sysfs_quotas>`. @@ -289,14 +283,18 @@ schemes/<N>/ ------------ In each scheme directory, five directories (``access_pattern``, ``quotas``, -``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and two files -(``action`` and ``apply_interval``) exist. +``watermarks``, ``filters``, ``stats``, and ``tried_regions``) and three files +(``action``, ``target_nid`` and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read from the file and their meaning are same to those of the list on :ref:`design doc <damon_design_damos_action>`. +The ``target_nid`` file is for setting the migration target node, which is +only meaningful when the ``action`` is either ``migrate_hot`` or +``migrate_cold``. + The ``apply_interval_us`` file is for setting and getting the scheme's :ref:`apply_interval <damon_design_damos>` in microseconds. @@ -342,7 +340,7 @@ Based on the user-specified :ref:`goal <sysfs_schemes_quota_goals>`, the effective size quota is further adjusted. Reading ``effective_bytes`` returns the current effective size quota. The file is not updated in real time, so users should ask DAMON sysfs interface to update the content of the file for -the stats by writing a special keyword, ``update_schemes_effective_bytes`` to +the stats by writing a special keyword, ``update_schemes_effective_quotas`` to the relevant ``kdamonds/<N>/state`` file. Under ``weights`` directory, three files (``sz_permil``, @@ -408,59 +406,62 @@ number (``N``) to the file creates the number of child directories named ``0`` to ``N-1``. Each directory represents each filter. The filters are evaluated in the numeric order. -Each filter directory contains six files, namely ``type``, ``matcing``, -``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. To ``type`` -file, you can write one of four special keywords: ``anon`` for anonymous pages, -``memcg`` for specific memory cgroup, ``addr`` for specific address range (an -open-ended interval), or ``target`` for specific DAMON monitoring target -filtering. In case of the memory cgroup filtering, you can specify the memory -cgroup of the interest by writing the path of the memory cgroup from the -cgroups mount point to ``memcg_path`` file. In case of the address range -filtering, you can specify the start and end address of the range to -``addr_start`` and ``addr_end`` files, respectively. For the DAMON monitoring -target filtering, you can specify the index of the target between the list of -the DAMON context's monitoring targets list to ``target_idx`` file. You can -write ``Y`` or ``N`` to ``matching`` file to filter out pages that does or does -not match to the type, respectively. Then, the scheme's action will not be -applied to the pages that specified to be filtered out. +Each filter directory contains seven files, namely ``type``, ``matching``, +``allow``, ``memcg_path``, ``addr_start``, ``addr_end``, and ``target_idx``. +To ``type`` file, you can write one of five special keywords: ``anon`` for +anonymous pages, ``memcg`` for specific memory cgroup, ``young`` for young +pages, ``addr`` for specific address range (an open-ended interval), or +``target`` for specific DAMON monitoring target filtering. Meaning of the +types are same to the description on the :ref:`design doc +<damon_design_damos_filters>`. + +In case of the memory cgroup filtering, you can specify the memory cgroup of +the interest by writing the path of the memory cgroup from the cgroups mount +point to ``memcg_path`` file. In case of the address range filtering, you can +specify the start and end address of the range to ``addr_start`` and +``addr_end`` files, respectively. For the DAMON monitoring target filtering, +you can specify the index of the target between the list of the DAMON context's +monitoring targets list to ``target_idx`` file. + +You can write ``Y`` or ``N`` to ``matching`` file to specify whether the filter +is for memory that matches the ``type``. You can write ``Y`` or ``N`` to +``allow`` file to specify if applying the action to the memory that satisfies +the ``type`` and ``matching`` should be allowed or not. For example, below restricts a DAMOS action to be applied to only non-anonymous pages of all memory cgroups except ``/having_care_already``.:: # echo 2 > nr_filters - # # filter out anonymous pages + # # disallow anonymous pages echo anon > 0/type echo Y > 0/matching + echo N > 0/allow # # further filter out all cgroups except one at '/having_care_already' echo memcg > 1/type echo /having_care_already > 1/memcg_path - echo N > 1/matching - -Note that ``anon`` and ``memcg`` filters are currently supported only when -``paddr`` :ref:`implementation <sysfs_context>` is being used. + echo Y > 1/matching + echo N > 1/allow -Also, memory regions that are filtered out by ``addr`` or ``target`` filters -are not counted as the scheme has tried to those, while regions that filtered -out by other type filters are counted as the scheme has tried to. The -difference is applied to :ref:`stats <damos_stats>` and -:ref:`tried regions <sysfs_schemes_tried_regions>`. +Refer to the :ref:`DAMOS filters design documentation +<damon_design_damos_filters>` for more details including how multiple filters +of different ``allow`` works, when each of the filters are supported, and +differences on stats. .. _sysfs_schemes_stats: schemes/<N>/stats/ ------------------ -DAMON counts the total number and bytes of regions that each scheme is tried to -be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. +DAMON counts statistics for each scheme. This statistics can be used for +online analysis or tuning of the schemes. Refer to :ref:`design doc +<damon_design_damos_stat>` for more details about the stats. The statistics can be retrieved by reading the files under ``stats`` directory -(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, and -``qt_exceeds``), respectively. The files are not updated in real time, so you -should ask DAMON sysfs interface to update the content of the files for the -stats by writing a special keyword, ``update_schemes_stats`` to the relevant -``kdamonds/<N>/state`` file. +(``nr_tried``, ``sz_tried``, ``nr_applied``, ``sz_applied``, +``sz_ops_filter_passed``, and ``qt_exceeds``), respectively. The files are not +updated in real time, so you should ask DAMON sysfs interface to update the +content of the files for the stats by writing a special keyword, +``update_schemes_stats`` to the relevant ``kdamonds/<N>/state`` file. .. _sysfs_schemes_tried_regions: @@ -497,10 +498,10 @@ set the ``access pattern`` as their interested pattern that they want to query. tried_regions/<N>/ ------------------ -In each region directory, you will find four files (``start``, ``end``, -``nr_accesses``, and ``age``). Reading the files will show the start and end -addresses, ``nr_accesses``, and ``age`` of the region that corresponding -DAMON-based operation scheme ``action`` has tried to be applied. +In each region directory, you will find five files (``start``, ``end``, +``nr_accesses``, ``age``, and ``sz_filter_passed``). Reading the files will +show the properties of the region that corresponding DAMON-based operation +scheme ``action`` has tried to be applied. Example ~~~~~~~ @@ -539,7 +540,7 @@ memory rate becomes larger than 60%, or lower than 30%". :: # echo 300 > watermarks/low Please note that it's highly recommended to use user space tools like `damo -<https://github.com/awslabs/damo>`_ rather than manually reading and writing +<https://github.com/damonitor/damo>`_ rather than manually reading and writing the files as above. Above is only for an example. .. _tracepoint: @@ -596,306 +597,3 @@ fields are as usual. It shows the index of the DAMON context (``ctx_idx=X``) of the scheme in the list of the contexts of the context's kdamond, the index of the scheme (``scheme_idx=X``) in the list of the schemes of the context, in addition to the output of ``damon_aggregated`` tracepoint. - - -.. _debugfs_interface: - -debugfs Interface (DEPRECATED!) -=============================== - -.. note:: - - THIS IS DEPRECATED! - - DAMON debugfs interface is deprecated, so users should move to the - :ref:`sysfs interface <sysfs_interface>`. If you depend on this and cannot - move, please report your usecase to damon@lists.linux.dev and - linux-mm@kvack.org. - -DAMON exports nine files, ``DEPRECATED``, ``attrs``, ``target_ids``, -``init_regions``, ``schemes``, ``monitor_on_DEPRECATED``, ``kdamond_pid``, -``mk_contexts`` and ``rm_contexts`` under its debugfs directory, -``<debugfs>/damon/``. - - -``DEPRECATED`` is a read-only file for the DAMON debugfs interface deprecation -notice. Reading it returns the deprecation notice, as below:: - - # cat DEPRECATED - DAMON debugfs interface is deprecated, so users should move to DAMON_SYSFS. If you cannot, please report your usecase to damon@lists.linux.dev and linux-mm@kvack.org. - - -Attributes ----------- - -Users can get and set the ``sampling interval``, ``aggregation interval``, -``update interval``, and min/max number of monitoring target regions by -reading from and writing to the ``attrs`` file. To know about the monitoring -attributes in detail, please refer to the :doc:`/mm/damon/design`. For -example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and -1000, and then check it again:: - - # cd <debugfs>/damon - # echo 5000 100000 1000000 10 1000 > attrs - # cat attrs - 5000 100000 1000000 10 1000 - - -Target IDs ----------- - -Some types of address spaces supports multiple monitoring target. For example, -the virtual memory address spaces monitoring can have multiple processes as the -monitoring targets. Users can set the targets by writing relevant id values of -the targets to, and get the ids of the current targets by reading from the -``target_ids`` file. In case of the virtual address spaces monitoring, the -values should be pids of the monitoring target processes. For example, below -commands set processes having pids 42 and 4242 as the monitoring targets and -check it again:: - - # cd <debugfs>/damon - # echo 42 4242 > target_ids - # cat target_ids - 42 4242 - -Users can also monitor the physical memory address space of the system by -writing a special keyword, "``paddr\n``" to the file. Because physical address -space monitoring doesn't support multiple targets, reading the file will show a -fake value, ``42``, as below:: - - # cd <debugfs>/damon - # echo paddr > target_ids - # cat target_ids - 42 - -Note that setting the target ids doesn't start the monitoring. - - -Initial Monitoring Target Regions ---------------------------------- - -In case of the virtual address space monitoring, DAMON automatically sets and -updates the monitoring target regions so that entire memory mappings of target -processes can be covered. However, users can want to limit the monitoring -region to specific address ranges, such as the heap, the stack, or specific -file-mapped area. Or, some users can know the initial access pattern of their -workloads and therefore want to set optimal initial regions for the 'adaptive -regions adjustment'. - -In contrast, DAMON do not automatically sets and updates the monitoring target -regions in case of physical memory monitoring. Therefore, users should set the -monitoring target regions by themselves. - -In such cases, users can explicitly set the initial monitoring target regions -as they want, by writing proper values to the ``init_regions`` file. The input -should be a sequence of three integers separated by white spaces that represent -one region in below form.:: - - <target idx> <start address> <end address> - -The ``target idx`` should be the index of the target in ``target_ids`` file, -starting from ``0``, and the regions should be passed in address order. For -example, below commands will set a couple of address ranges, ``1-100`` and -``100-200`` as the initial monitoring target region of pid 42, which is the -first one (index ``0``) in ``target_ids``, and another couple of address -ranges, ``20-40`` and ``50-100`` as that of pid 4242, which is the second one -(index ``1``) in ``target_ids``.:: - - # cd <debugfs>/damon - # cat target_ids - 42 4242 - # echo "0 1 100 \ - 0 100 200 \ - 1 20 40 \ - 1 50 100" > init_regions - -Note that this sets the initial monitoring target regions only. In case of -virtual memory monitoring, DAMON will automatically updates the boundary of the -regions after one ``update interval``. Therefore, users should set the -``update interval`` large enough in this case, if they don't want the -update. - - -Schemes -------- - -Users can get and set the DAMON-based operation :ref:`schemes -<damon_design_damos>` by reading from and writing to ``schemes`` debugfs file. -Reading the file also shows the statistics of each scheme. To the file, each -of the schemes should be represented in each line in below form:: - - <target access pattern> <action> <quota> <watermarks> - -You can disable schemes by simply writing an empty string to the file. - -Target Access Pattern -~~~~~~~~~~~~~~~~~~~~~ - -The target access :ref:`pattern <damon_design_damos_access_pattern>` of the -scheme. The ``<target access pattern>`` is constructed with three ranges in -below form:: - - min-size max-size min-acc max-acc min-age max-age - -Specifically, bytes for the size of regions (``min-size`` and ``max-size``), -number of monitored accesses per aggregate interval for access frequency -(``min-acc`` and ``max-acc``), number of aggregate intervals for the age of -regions (``min-age`` and ``max-age``) are specified. Note that the ranges are -closed interval. - -Action -~~~~~~ - -The ``<action>`` is a predefined integer for memory management :ref:`actions -<damon_design_damos_action>`. The mapping between the ``<action>`` values and -the memory management actions is as below. For the detailed meaning of the -action and DAMON operations set supporting each action, please refer to the -list on :ref:`design doc <damon_design_damos_action>`. - - - 0: ``willneed`` - - 1: ``cold`` - - 2: ``pageout`` - - 3: ``hugepage`` - - 4: ``nohugepage`` - - 5: ``stat`` - -Quota -~~~~~ - -Users can set the :ref:`quotas <damon_design_damos_quotas>` of the given scheme -via the ``<quota>`` in below form:: - - <ms> <sz> <reset interval> <priority weights> - -This makes DAMON to try to use only up to ``<ms>`` milliseconds for applying -the action to memory regions of the ``target access pattern`` within the -``<reset interval>`` milliseconds, and to apply the action to only up to -``<sz>`` bytes of memory regions within the ``<reset interval>``. Setting both -``<ms>`` and ``<sz>`` zero disables the quota limits. - -For the :ref:`prioritization <damon_design_damos_quotas_prioritization>`, users -can set the weights for the three properties in ``<priority weights>`` in below -form:: - - <size weight> <access frequency weight> <age weight> - -Watermarks -~~~~~~~~~~ - -Users can specify :ref:`watermarks <damon_design_damos_watermarks>` of the -given scheme via ``<watermarks>`` in below form:: - - <metric> <check interval> <high mark> <middle mark> <low mark> - -``<metric>`` is a predefined integer for the metric to be checked. The -supported numbers and their meanings are as below. - - - 0: Ignore the watermarks - - 1: System's free memory rate (per thousand) - -The value of the metric is checked every ``<check interval>`` microseconds. - -If the value is higher than ``<high mark>`` or lower than ``<low mark>``, the -scheme is deactivated. If the value is lower than ``<mid mark>``, the scheme -is activated. - -.. _damos_stats: - -Statistics -~~~~~~~~~~ - -It also counts the total number and bytes of regions that each scheme is tried -to be applied, the two numbers for the regions that each scheme is successfully -applied, and the total number of the quota limit exceeds. This statistics can -be used for online analysis or tuning of the schemes. - -The statistics can be shown by reading the ``schemes`` file. Reading the file -will show each scheme you entered in each line, and the five numbers for the -statistics will be added at the end of each line. - -Example -~~~~~~~ - -Below commands applies a scheme saying "If a memory region of size in [4KiB, -8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate -interval in [10, 20], page out the region. For the paging out, use only up to -10ms per second, and also don't page out more than 1GiB per second. Under the -limitation, page out memory regions having longer age first. Also, check the -free memory rate of the system every 5 seconds, start the monitoring and paging -out when the free memory rate becomes lower than 50%, but stop it if the free -memory rate becomes larger than 60%, or lower than 30%".:: - - # cd <debugfs>/damon - # scheme="4096 8192 0 5 10 20 2" # target access pattern and action - # scheme+=" 10 $((1024*1024*1024)) 1000" # quotas - # scheme+=" 0 0 100" # prioritization weights - # scheme+=" 1 5000000 600 500 300" # watermarks - # echo "$scheme" > schemes - - -Turning On/Off --------------- - -Setting the files as described above doesn't incur effect unless you explicitly -start the monitoring. You can start, stop, and check the current status of the -monitoring by writing to and reading from the ``monitor_on_DEPRECATED`` file. -Writing ``on`` to the file starts the monitoring of the targets with the -attributes. Writing ``off`` to the file stops those. DAMON also stops if -every target process is terminated. Below example commands turn on, off, and -check the status of DAMON:: - - # cd <debugfs>/damon - # echo on > monitor_on_DEPRECATED - # echo off > monitor_on_DEPRECATED - # cat monitor_on_DEPRECATED - off - -Please note that you cannot write to the above-mentioned debugfs files while -the monitoring is turned on. If you write to the files while DAMON is running, -an error code such as ``-EBUSY`` will be returned. - - -Monitoring Thread PID ---------------------- - -DAMON does requested monitoring with a kernel thread called ``kdamond``. You -can get the pid of the thread by reading the ``kdamond_pid`` file. When the -monitoring is turned off, reading the file returns ``none``. :: - - # cd <debugfs>/damon - # cat monitor_on_DEPRECATED - off - # cat kdamond_pid - none - # echo on > monitor_on_DEPRECATED - # cat kdamond_pid - 18594 - - -Using Multiple Monitoring Threads ---------------------------------- - -One ``kdamond`` thread is created for each monitoring context. You can create -and remove monitoring contexts for multiple ``kdamond`` required use case using -the ``mk_contexts`` and ``rm_contexts`` files. - -Writing the name of the new context to the ``mk_contexts`` file creates a -directory of the name on the DAMON debugfs directory. The directory will have -DAMON debugfs files for the context. :: - - # cd <debugfs>/damon - # ls foo - # ls: cannot access 'foo': No such file or directory - # echo foo > mk_contexts - # ls foo - # attrs init_regions kdamond_pid schemes target_ids - -If the context is not needed anymore, you can remove it and the corresponding -directory by putting the name of the context to the ``rm_contexts`` file. :: - - # echo foo > rm_contexts - # ls foo - # ls: cannot access 'foo': No such file or directory - -Note that ``mk_contexts``, ``rm_contexts``, and ``monitor_on_DEPRECATED`` files -are in the root directory only. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index e4d4b4a8dc97..f34a0d798d5b 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -376,6 +376,13 @@ Note that the number of overcommit and reserve pages remain global quantities, as we don't know until fault time, when the faulting task's mempolicy is applied, from which node the huge page allocation will be attempted. +The hugetlb may be migrated between the per-node hugepages pool in the following +scenarios: memory offline, memory failure, longterm pinning, syscalls(mbind, +migrate_pages and move_pages), alloc_contig_range() and alloc_contig_pages(). +Now only memory offline, memory failure and syscalls allow fallbacking to allocate +a new hugetlb on a different node if the current node is unable to allocate during +hugetlb migration, that means these 3 cases can break the per-node hugepages pool. + .. _using_huge_pages: Using Huge Pages diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 1f883abf3f00..8b35795b664b 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -10,7 +10,7 @@ processes address space and many other cool things. Linux memory management is a complex system with many configurable settings. Most of these settings are available via ``/proc`` -filesystem and can be quired and adjusted using ``sysctl``. These APIs +filesystem and can be queried and adjusted using ``sysctl``. These APIs are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index a639cac12477..ad8e7a41f3b5 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -308,7 +308,7 @@ limited by the ``advisor_max_cpu`` parameter. In addition there is also the ``advisor_target_scan_time`` parameter. This parameter sets the target time to scan all the KSM candidate pages. The parameter ``advisor_target_scan_time`` decides how aggressive the scan time advisor scans candidate pages. Lower -values make the scan time advisor to scan more aggresively. This is the most +values make the scan time advisor to scan more aggressively. This is the most important parameter for the configuration of the scan time advisor. The initial value and the maximum value can be changed with diff --git a/Documentation/admin-guide/mm/memory-hotplug.rst b/Documentation/admin-guide/mm/memory-hotplug.rst index 098f14d83e99..33c886f3d198 100644 --- a/Documentation/admin-guide/mm/memory-hotplug.rst +++ b/Documentation/admin-guide/mm/memory-hotplug.rst @@ -280,8 +280,8 @@ The following files are currently defined: blocks; configure auto-onlining. The default value depends on the - CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration - option. + CONFIG_MHP_DEFAULT_ONLINE_TYPE kernel configuration + options. See the ``state`` property of memory blocks for details. ``block_size_bytes`` read-only: the size in bytes of a memory block. @@ -294,8 +294,9 @@ The following files are currently defined: ``crash_hotplug`` read-only: when changes to the system memory map occur due to hot un/plug of memory, this file contains '1' if the kernel updates the kdump capture kernel memory - map itself (via elfcorehdr), or '0' if userspace must update - the kdump capture kernel memory map. + map itself (via elfcorehdr and other relevant kexec + segments), or '0' if userspace must update the kdump + capture kernel memory map. Availability depends on the CONFIG_MEMORY_HOTPLUG kernel configuration option. diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin-guide/mm/pagemap.rst index f5f065c67615..caba0f52dd36 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -118,7 +118,7 @@ Short descriptions to the page flags 21 - KSM Identical memory pages dynamically shared between one or more processes. 22 - THP - Contiguous pages which construct transparent hugepages. + Contiguous pages which construct THP of any size and mapped by any granularity. 23 - OFFLINE The page is logically offline. 24 - ZERO_PAGE @@ -173,27 +173,6 @@ LRU related page flags The page-types tool in the tools/mm directory can be used to query the above flags. -Using pagemap to do something useful -==================================== - -The general procedure for using pagemap to find out about a process' memory -usage goes like this: - - 1. Read ``/proc/pid/maps`` to determine which parts of the memory space are - mapped to what. - 2. Select the maps you are interested in -- all of them, or a particular - library, or the stack or the heap, etc. - 3. Open ``/proc/pid/pagemap`` and seek to the pages you would like to examine. - 4. Read a u64 for each page from pagemap. - 5. Open ``/proc/kpagecount`` and/or ``/proc/kpageflags``. For each PFN you - just read, seek to that entry in the file, and read the data you want. - -For example, to find the "unique set size" (USS), which is the amount of -memory that a process is using that is not shared with any other process, -you can go through every map in the process, find the PFNs, look those up -in kpagecount, and tally up the number of pages that are only referenced -once. - Exceptions for Shared Memory ============================ @@ -252,7 +231,7 @@ Following flags about pages are currently supported: - ``PAGE_IS_PRESENT`` - Page is present in the memory - ``PAGE_IS_SWAPPED`` - Page is in swapped - ``PAGE_IS_PFNZERO`` - Page has zero PFN -- ``PAGE_IS_HUGE`` - Page is THP or Hugetlb backed +- ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty The ``struct pm_scan_arg`` is used as the argument of the IOCTL. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index 04eb45a2f940..dff8d5985f0f 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -202,12 +202,21 @@ PMD-mappable transparent hugepage:: cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -khugepaged will be automatically started when one or more hugepage -sizes are enabled (either by directly setting "always" or "madvise", -or by setting "inherit" while the top-level enabled is set to "always" -or "madvise"), and it'll be automatically shutdown when the last -hugepage size is disabled (either by directly setting "never", or by -setting "inherit" while the top-level enabled is set to "never"). +All THPs at fault and collapse time will be added to _deferred_list, +and will therefore be split under memory presure if they are considered +"underused". A THP is underused if the number of zero-filled pages in +the THP is above max_ptes_none (see below). It is possible to disable +this behaviour by writing 0 to shrink_underused, and enable it by writing +1 to it:: + + echo 0 > /sys/kernel/mm/transparent_hugepage/shrink_underused + echo 1 > /sys/kernel/mm/transparent_hugepage/shrink_underused + +khugepaged will be automatically started when PMD-sized THP is enabled +(either of the per-size anon control or the top-level control are set +to "always" or "madvise"), and it'll be automatically shutdown when +PMD-sized THP is disabled (when both the per-size anon control and the +top-level control are "never") Khugepaged controls ------------------- @@ -278,25 +287,92 @@ collapsed, resulting fewer pages being collapsed into THPs, and lower memory access performance. ``max_ptes_shared`` specifies how many pages can be shared across multiple -processes. Exceeding the number would block the collapse:: +processes. khugepaged might treat pages of THPs as shared if any page of +that THP is shared. Exceeding the number would block the collapse:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared A higher value may increase memory footprint for some workloads. -Boot parameter -============== - -You can change the sysfs boot time defaults of Transparent Hugepage -Support by passing the parameter ``transparent_hugepage=always`` or -``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` -to the kernel command line. +Boot parameters +=============== + +You can change the sysfs boot time default for the top-level "enabled" +control by passing the parameter ``transparent_hugepage=always`` or +``transparent_hugepage=madvise`` or ``transparent_hugepage=never`` to the +kernel command line. + +Alternatively, each supported anonymous THP size can be controlled by +passing ``thp_anon=<size>[KMG],<size>[KMG]:<state>;<size>[KMG]-<size>[KMG]:<state>``, +where ``<size>`` is the THP size (must be a power of 2 of PAGE_SIZE and +supported anonymous THP) and ``<state>`` is one of ``always``, ``madvise``, +``never`` or ``inherit``. + +For example, the following will set 16K, 32K, 64K THP to ``always``, +set 128K, 512K to ``inherit``, set 256K to ``madvise`` and 1M, 2M +to ``never``:: + + thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never + +``thp_anon=`` may be specified multiple times to configure all THP sizes as +required. If ``thp_anon=`` is specified at least once, any anon THP sizes +not explicitly configured on the command line are implicitly set to +``never``. + +``transparent_hugepage`` setting only affects the global toggle. If +``thp_anon`` is not specified, PMD_ORDER THP will default to ``inherit``. +However, if a valid ``thp_anon`` setting is provided by the user, the +PMD_ORDER THP policy will be overridden. If the policy for PMD_ORDER +is not defined within a valid ``thp_anon``, its policy will default to +``never``. + +Similarly to ``transparent_hugepage``, you can control the hugepage +allocation policy for the internal shmem mount by using the kernel parameter +``transparent_hugepage_shmem=<policy>``, where ``<policy>`` is one of the +seven valid policies for shmem (``always``, ``within_size``, ``advise``, +``never``, ``deny``, and ``force``). + +Similarly to ``transparent_hugepage_shmem``, you can control the default +hugepage allocation policy for the tmpfs mount by using the kernel parameter +``transparent_hugepage_tmpfs=<policy>``, where ``<policy>`` is one of the +four valid policies for tmpfs (``always``, ``within_size``, ``advise``, +``never``). The tmpfs mount default policy is ``never``. + +In the same manner as ``thp_anon`` controls each supported anonymous THP +size, ``thp_shmem`` controls each supported shmem THP size. ``thp_shmem`` +has the same format as ``thp_anon``, but also supports the policy +``within_size``. + +``thp_shmem=`` may be specified multiple times to configure all THP sizes +as required. If ``thp_shmem=`` is specified at least once, any shmem THP +sizes not explicitly configured on the command line are implicitly set to +``never``. + +``transparent_hugepage_shmem`` setting only affects the global toggle. If +``thp_shmem`` is not specified, PMD_ORDER hugepage will default to +``inherit``. However, if a valid ``thp_shmem`` setting is provided by the +user, the PMD_ORDER hugepage policy will be overridden. If the policy for +PMD_ORDER is not defined within a valid ``thp_shmem``, its policy will +default to ``never``. Hugepages in tmpfs/shmem ======================== -You can control hugepage allocation policy in tmpfs with mount option -``huge=``. It can have following values: +Traditionally, tmpfs only supported a single huge page size ("PMD"). Today, +it also supports smaller sizes just like anonymous memory, often referred +to as "multi-size THP" (mTHP). Huge pages of any size are commonly +represented in the kernel as "large folios". + +While there is fine control over the huge page sizes to use for the internal +shmem mount (see below), ordinary tmpfs mounts will make use of all available +huge page sizes without any control over the exact sizes, behaving more like +other file systems. + +tmpfs mounts +------------ + +The THP allocation policy for tmpfs mounts can be adjusted using the mount +option: ``huge=``. It can have following values: always Attempt to allocate huge pages every time we need a new page; @@ -306,24 +382,24 @@ never within_size Only allocate huge page if it will be fully within i_size. - Also respect fadvise()/madvise() hints; + Also respect madvise() hints; advise - Only allocate huge pages if requested with fadvise()/madvise(); + Only allocate huge pages if requested with madvise(); -The default policy is ``never``. +Remember, that the kernel may use huge pages of all available sizes, and +that no fine control as for the internal tmpfs mount is available. + +The default policy in the past was ``never``, but it can now be adjusted +using the kernel parameter ``transparent_hugepage_tmpfs=<policy>``. ``mount -o remount,huge= /mountpoint`` works fine after mount: remounting ``huge=never`` will not attempt to break up huge pages at all, just stop more from being allocated. -There's also sysfs knob to control hugepage allocation policy for internal -shmem mount: /sys/kernel/mm/transparent_hugepage/shmem_enabled. The mount -is used for SysV SHM, memfds, shared anonymous mmaps (of /dev/zero or -MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. - -In addition to policies listed above, shmem_enabled allows two further -values: +In addition to policies listed above, the sysfs knob +/sys/kernel/mm/transparent_hugepage/shmem_enabled will affect the +allocation policy of tmpfs mounts, when set to the following values: deny For use in emergencies, to force the huge option off from @@ -331,6 +407,42 @@ deny force Force the huge option on for all - very useful for testing; +shmem / internal tmpfs +---------------------- +The mount internal tmpfs mount is used for SysV SHM, memfds, shared anonymous +mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM objects, Ashmem. + +To control the THP allocation policy for this internal tmpfs mount, the +sysfs knob /sys/kernel/mm/transparent_hugepage/shmem_enabled and the knobs +per THP size in +'/sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/shmem_enabled' +can be used. + +The global knob has the same semantics as the ``huge=`` mount options +for tmpfs mounts, except that the different huge page sizes can be controlled +individually, and will only use the setting of the global knob when the +per-size knob is set to 'inherit'. + +The options 'force' and 'deny' are dropped for the individual sizes, which +are rather testing artifacts from the old ages. + +always + Attempt to allocate <size> huge pages every time we need a new page; + +inherit + Inherit the top-level "shmem_enabled" value. By default, PMD-sized hugepages + have enabled="inherit" and all other hugepage sizes have enabled="never"; + +never + Do not allocate <size> huge pages; + +within_size + Only allocate <size> huge page if it will be fully within i_size. + Also respect madvise() hints; + +advise + Only allocate <size> huge pages if requested with madvise(); + Need of application restart =========================== @@ -343,10 +455,6 @@ also applies to the regions registered in khugepaged. Monitoring usage ================ -.. note:: - Currently the below counters only record events relating to - PMD-sized THP. Events relating to other THP sizes are not included. - The number of PMD-sized anonymous transparent huge pages currently used by the system is available by reading the AnonHugePages field in ``/proc/meminfo``. To identify what applications are using PMD-sized anonymous transparent huge @@ -358,7 +466,7 @@ AnonHugePmdMapped). The number of file transparent huge pages mapped to userspace is available by reading ShmemPmdMapped and ShmemHugePages fields in ``/proc/meminfo``. To identify what applications are mapping file transparent huge pages, it -is necessary to read ``/proc/PID/smaps`` and count the FileHugeMapped fields +is necessary to read ``/proc/PID/smaps`` and count the FilePmdMapped fields for each mapping. Note that reading the smaps file is expensive and reading it @@ -369,7 +477,7 @@ monitor how successfully the system is providing huge pages for use. thp_fault_alloc is incremented every time a huge page is successfully - allocated to handle a page fault. + allocated and charged to handle a page fault. thp_collapse_alloc is incremented by khugepaged when it has found @@ -377,7 +485,7 @@ thp_collapse_alloc successfully allocated a new huge page to store the data. thp_fault_fallback - is incremented if a page fault fails to allocate + is incremented if a page fault fails to allocate or charge a huge page and instead falls back to using small pages. thp_fault_fallback_charge @@ -391,20 +499,23 @@ thp_collapse_alloc_failed the allocation. thp_file_alloc - is incremented every time a file huge page is successfully - allocated. + is incremented every time a shmem huge page is successfully + allocated (Note that despite being named after "file", the counter + measures only shmem). thp_file_fallback - is incremented if a file huge page is attempted to be allocated - but fails and instead falls back to using small pages. + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. (Note that + despite being named after "file", the counter measures only shmem). thp_file_fallback_charge - is incremented if a file huge page cannot be charged and instead + is incremented if a shmem huge page cannot be charged and instead falls back to using small pages even though the allocation was - successful. + successful. (Note that despite being named after "file", the + counter measures only shmem). thp_file_mapped - is incremented every time a file huge page is mapped into + is incremented every time a file or shmem huge page is mapped into user address space. thp_split_page @@ -423,6 +534,12 @@ thp_deferred_split_page splitting it would free up some memory. Pages on split queue are going to be split under memory pressure. +thp_underused_split_page + is incremented when a huge page on the split queue was split + because it was underused. A THP is underused if the number of + zero pages in the THP is above a certain threshold + (/sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none). + thp_split_pmd is incremented every time a PMD split into table of PTEs. This can happen, for instance, when application calls mprotect() or @@ -447,6 +564,92 @@ thp_swpout_fallback Usually because failed to allocate some continuous swap space for the huge page. +In /sys/kernel/mm/transparent_hugepage/hugepages-<size>kB/stats, There are +also individual counters for each huge page size, which can be utilized to +monitor the system's effectiveness in providing huge pages for usage. Each +counter has its own corresponding file. + +anon_fault_alloc + is incremented every time a huge page is successfully + allocated and charged to handle a page fault. + +anon_fault_fallback + is incremented if a page fault fails to allocate or charge + a huge page and instead falls back to using huge pages with + lower orders or small pages. + +anon_fault_fallback_charge + is incremented if a page fault fails to charge a huge page and + instead falls back to using huge pages with lower orders or + small pages even though the allocation was successful. + +zswpout + is incremented every time a huge page is swapped out to zswap in one + piece without splitting. + +swpin + is incremented every time a huge page is swapped in from a non-zswap + swap device in one piece. + +swpin_fallback + is incremented if swapin fails to allocate or charge a huge page + and instead falls back to using huge pages with lower orders or + small pages. + +swpin_fallback_charge + is incremented if swapin fails to charge a huge page and instead + falls back to using huge pages with lower orders or small pages + even though the allocation was successful. + +swpout + is incremented every time a huge page is swapped out to a non-zswap + swap device in one piece without splitting. + +swpout_fallback + is incremented if a huge page has to be split before swapout. + Usually because failed to allocate some continuous swap space + for the huge page. + +shmem_alloc + is incremented every time a shmem huge page is successfully + allocated. + +shmem_fallback + is incremented if a shmem huge page is attempted to be allocated + but fails and instead falls back to using small pages. + +shmem_fallback_charge + is incremented if a shmem huge page cannot be charged and instead + falls back to using small pages even though the allocation was + successful. + +split + is incremented every time a huge page is successfully split into + smaller orders. This can happen for a variety of reasons but a + common reason is that a huge page is old and is being reclaimed. + +split_failed + is incremented if kernel fails to split huge + page. This can happen if the page was pinned by somebody. + +split_deferred + is incremented when a huge page is put onto split queue. + This happens when a huge page is partially unmapped and splitting + it would free up some memory. Pages on split queue are going to + be split under memory pressure, if splitting is possible. + +nr_anon + the number of anonymous THP we have in the whole system. These THPs + might be currently entirely mapped or have partially unmapped/unused + subpages. + +nr_anon_partially_mapped + the number of anonymous THP which are likely partially mapped, possibly + wasting memory, and have been queued for deferred memory reclamation. + Note that in corner some cases (e.g., failed migration), we might detect + an anonymous THP as "partially mapped" and count it here, even though it + is not actually partially mapped anymore. + As the system ages, allocating huge pages may be expensive as the system uses memory compaction to copy data around memory to free a huge page for use. There are some counters in ``/proc/vmstat`` to help diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst index 13632671adae..3598dcd7dbe7 100644 --- a/Documentation/admin-guide/mm/zswap.rst +++ b/Documentation/admin-guide/mm/zswap.rst @@ -111,35 +111,6 @@ checked if it is a same-value filled page before compressing it. If true, the compressed length of the page is set to zero and the pattern or same-filled value is stored. -Same-value filled pages identification feature is enabled by default and can be -disabled at boot time by setting the ``same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.same_filled_pages_enabled=0``. It can also be enabled and -disabled at runtime using the sysfs ``same_filled_pages_enabled`` -attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/same_filled_pages_enabled - -When zswap same-filled page identification is disabled at runtime, it will stop -checking for the same-value filled pages during store operation. -In other words, every page will be then considered non-same-value filled. -However, the existing pages which are marked as same-value filled pages remain -stored unchanged in zswap until they are either loaded or invalidated. - -In some circumstances it might be advantageous to make use of just the zswap -ability to efficiently store same-filled pages without enabling the whole -compressed page storage. -In this case the handling of non-same-value pages by zswap (enabled by default) -can be disabled by setting the ``non_same_filled_pages_enabled`` attribute -to 0, e.g. ``zswap.non_same_filled_pages_enabled=0``. -It can also be enabled and disabled at runtime using the sysfs -``non_same_filled_pages_enabled`` attribute, e.g.:: - - echo 1 > /sys/module/zswap/parameters/non_same_filled_pages_enabled - -Disabling both ``zswap.same_filled_pages_enabled`` and -``zswap.non_same_filled_pages_enabled`` effectively disables accepting any new -pages by zswap. - To prevent zswap from shrinking pool when zswap is full and there's a high pressure on swap (this will result in flipping pages in and out zswap pool without any real benefit but with a performance drop for the system), a diff --git a/Documentation/admin-guide/nvme-multipath.rst b/Documentation/admin-guide/nvme-multipath.rst new file mode 100644 index 000000000000..97ca1ccef459 --- /dev/null +++ b/Documentation/admin-guide/nvme-multipath.rst @@ -0,0 +1,72 @@ +.. SPDX-License-Identifier: GPL-2.0 + +==================== +Linux NVMe multipath +==================== + +This document describes NVMe multipath and its path selection policies supported +by the Linux NVMe host driver. + + +Introduction +============ + +The NVMe multipath feature in Linux integrates namespaces with the same +identifier into a single block device. Using multipath enhances the reliability +and stability of I/O access while improving bandwidth performance. When a user +sends I/O to this merged block device, the multipath mechanism selects one of +the underlying block devices (paths) according to the configured policy. +Different policies result in different path selections. + + +Policies +======== + +All policies follow the ANA (Asymmetric Namespace Access) mechanism, meaning +that when an optimized path is available, it will be chosen over a non-optimized +one. Current the NVMe multipath policies include numa(default), round-robin and +queue-depth. + +To set the desired policy (e.g., round-robin), use one of the following methods: + 1. echo -n "round-robin" > /sys/module/nvme_core/parameters/iopolicy + 2. or add the "nvme_core.iopolicy=round-robin" to cmdline. + + +NUMA +---- + +The NUMA policy selects the path closest to the NUMA node of the current CPU for +I/O distribution. This policy maintains the nearest paths to each NUMA node +based on network interface connections. + +When to use the NUMA policy: + 1. Multi-core Systems: Optimizes memory access in multi-core and + multi-processor systems, especially under NUMA architecture. + 2. High Affinity Workloads: Binds I/O processing to the CPU to reduce + communication and data transfer delays across nodes. + + +Round-Robin +----------- + +The round-robin policy distributes I/O requests evenly across all paths to +enhance throughput and resource utilization. Each I/O operation is sent to the +next path in sequence. + +When to use the round-robin policy: + 1. Balanced Workloads: Effective for balanced and predictable workloads with + similar I/O size and type. + 2. Homogeneous Path Performance: Utilizes all paths efficiently when + performance characteristics (e.g., latency, bandwidth) are similar. + + +Queue-Depth +----------- + +The queue-depth policy manages I/O requests based on the current queue depth +of each path, selecting the path with the least number of in-flight I/Os. + +When to use the queue-depth policy: + 1. High load with small I/Os: Effectively balances load across paths when + the load is high, and I/O operations consist of small, relatively + fixed-sized requests. diff --git a/Documentation/admin-guide/perf/arm-ni.rst b/Documentation/admin-guide/perf/arm-ni.rst new file mode 100644 index 000000000000..d26a8f697c36 --- /dev/null +++ b/Documentation/admin-guide/perf/arm-ni.rst @@ -0,0 +1,17 @@ +==================================== +Arm Network-on Chip Interconnect PMU +==================================== + +NI-700 and friends implement a distinct PMU for each clock domain within the +interconnect. Correspondingly, the driver exposes multiple PMU devices named +arm_ni_<x>_cd_<y>, where <x> is an (arbitrary) instance identifier and <y> is +the clock domain ID within that particular instance. If multiple NI instances +exist within a system, the PMU devices can be correlated with the underlying +hardware instance via sysfs parentage. + +Each PMU exposes base event aliases for the interface types present in its clock +domain. These require qualifying with the "eventid" and "nodeid" parameters +to specify the event code to count and the interface at which to count it +(per the configured hardware ID as reflected in the xxNI_NODE_INFO register). +The exception is the "cycles" alias for the PMU cycle counter, which is encoded +with the PMU node type and needs no further qualification. diff --git a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst index d47cd229d710..cb376f335f40 100644 --- a/Documentation/admin-guide/perf/dwc_pcie_pmu.rst +++ b/Documentation/admin-guide/perf/dwc_pcie_pmu.rst @@ -46,41 +46,41 @@ Some of the events only exist for specific configurations. DesignWare Cores (DWC) PCIe PMU Driver ======================================= -This driver adds PMU devices for each PCIe Root Port named based on the BDF of +This driver adds PMU devices for each PCIe Root Port named based on the SBDF of the Root Port. For example, - 30:03.0 PCI bridge: Device 1ded:8000 (rev 01) + 0001:30:03.0 PCI bridge: Device 1ded:8000 (rev 01) -the PMU device name for this Root Port is dwc_rootport_3018. +the PMU device name for this Root Port is dwc_rootport_13018. The DWC PCIe PMU driver registers a perf PMU driver, which provides description of available events and configuration options in sysfs, see -/sys/bus/event_source/devices/dwc_rootport_{bdf}. +/sys/bus/event_source/devices/dwc_rootport_{sbdf}. The "format" directory describes format of the config fields of the perf_event_attr structure. The "events" directory provides configuration templates for all documented events. For example, -"Rx_PCIe_TLP_Data_Payload" is an equivalent of "eventid=0x22,type=0x1". +"rx_pcie_tlp_data_payload" is an equivalent of "eventid=0x21,type=0x0". The "perf list" command shall list the available events from sysfs, e.g.:: $# perf list | grep dwc_rootport <...> - dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] + dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ [Kernel PMU event] <...> - dwc_rootport_3018/rx_memory_read,lane=?/ [Kernel PMU event] + dwc_rootport_13018/rx_memory_read,lane=?/ [Kernel PMU event] Time Based Analysis Event Usage ------------------------------- Example usage of counting PCIe RX TLP data payload (Units of bytes):: - $# perf stat -a -e dwc_rootport_3018/Rx_PCIe_TLP_Data_Payload/ + $# perf stat -a -e dwc_rootport_13018/Rx_PCIe_TLP_Data_Payload/ The average RX/TX bandwidth can be calculated using the following formula: - PCIe RX Bandwidth = Rx_PCIe_TLP_Data_Payload / Measure_Time_Window - PCIe TX Bandwidth = Tx_PCIe_TLP_Data_Payload / Measure_Time_Window + PCIe RX Bandwidth = rx_pcie_tlp_data_payload / Measure_Time_Window + PCIe TX Bandwidth = tx_pcie_tlp_data_payload / Measure_Time_Window Lane Event Usage ------------------------------- @@ -88,7 +88,7 @@ Lane Event Usage Each lane has the same event set and to avoid generating a list of hundreds of events, the user need to specify the lane ID explicitly, e.g.:: - $# perf stat -a -e dwc_rootport_3018/rx_memory_read,lane=4/ + $# perf stat -a -e dwc_rootport_13018/rx_memory_read,lane=4/ The driver does not support sampling, therefore "perf record" will not work. Per-task (without "-a") perf sessions are not supported. diff --git a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst index 5541ff40e06a..083ca50de896 100644 --- a/Documentation/admin-guide/perf/hisi-pcie-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pcie-pmu.rst @@ -28,7 +28,9 @@ The "identifier" sysfs file allows users to identify the version of the PMU hardware device. The "bus" sysfs file allows users to get the bus number of Root Ports -monitored by PMU. +monitored by PMU. Furthermore users can get the Root Ports range in +[bdf_min, bdf_max] from "bdf_min" and "bdf_max" sysfs attributes +respectively. Example usage of perf:: diff --git a/Documentation/admin-guide/perf/hisi-pmu.rst b/Documentation/admin-guide/perf/hisi-pmu.rst index e0174d20809a..48992a0b8e94 100644 --- a/Documentation/admin-guide/perf/hisi-pmu.rst +++ b/Documentation/admin-guide/perf/hisi-pmu.rst @@ -20,7 +20,6 @@ interrupt, and the PMU driver shall register perf PMU drivers like L3C, HHA and DDRC etc. The available events and configuration options shall be described in the sysfs, see: -/sys/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>/, or /sys/bus/event_source/devices/hisi_sccl{X}_<l3c{Y}/hha{Y}/ddrc{Y}>. The "perf list" command shall list the available events from sysfs. @@ -36,7 +35,10 @@ e.g. hisi_sccl1_hha0/rx_operations is RX_OPERATIONS event of HHA index #0 in SCCL ID #1. The driver also provides a "cpumask" sysfs attribute, which shows the CPU core -ID used to count the uncore PMU event. +ID used to count the uncore PMU event. An "associated_cpus" sysfs attribute is +also provided to show the CPUs associated with this PMU. The "cpumask" indicates +the CPUs to open the events, usually as a hint for userspaces tools like perf. +It only contains one associated CPU from the "associated_cpus". Example usage of perf:: diff --git a/Documentation/admin-guide/perf/hns3-pmu.rst b/Documentation/admin-guide/perf/hns3-pmu.rst index 75a40846d47f..1195e570f2d6 100644 --- a/Documentation/admin-guide/perf/hns3-pmu.rst +++ b/Documentation/admin-guide/perf/hns3-pmu.rst @@ -16,7 +16,7 @@ HNS3 PMU driver The HNS3 PMU driver registers a perf PMU with the name of its sicl id.:: - /sys/devices/hns3_pmu_sicl_<sicl_id> + /sys/bus/event_source/devices/hns3_pmu_sicl_<sicl_id> PMU driver provides description of available events, filter modes, format, identifier and cpumask in sysfs. @@ -40,9 +40,9 @@ device. Example usage of checking event code and subevent code:: - $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time + $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_time config=0x00204 - $# cat /sys/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num + $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/events/dly_tx_normal_to_mac_packet_num config=0x10204 Each performance statistic has a pair of events to get two values to @@ -60,7 +60,7 @@ computation to calculate real performance data is::: Example usage of checking supported filter mode:: - $# cat /sys/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num + $# cat /sys/bus/event_source/devices/hns3_pmu_sicl_0/filtermode/bw_ssu_rpu_byte_num filter mode supported: global/port/port-tc/func/func-queue/ Example usage of perf:: diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 7eb3dcd6f4da..072b510385c4 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -14,8 +14,11 @@ Performance monitor support qcom_l2_pmu qcom_l3_pmu starfive_starlink_pmu + mrvl-odyssey-ddr-pmu + mrvl-odyssey-tad-pmu arm-ccn arm-cmn + arm-ni xgene-pmu arm_dsu_pmu thunderx2-pmu @@ -25,3 +28,4 @@ Performance monitor support meson-ddr-pmu cxl ampere_cspmu + mrvl-pem-pmu diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst new file mode 100644 index 000000000000..2e817593a4d9 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-ddr-pmu.rst @@ -0,0 +1,80 @@ +=================================================================== +Marvell Odyssey DDR PMU Performance Monitoring Unit (PMU UNCORE) +=================================================================== + +Odyssey DRAM Subsystem supports eight counters for monitoring performance +and software can program those counters to monitor any of the defined +performance events. Supported performance events include those counted +at the interface between the DDR controller and the PHY, interface between +the DDR Controller and the CHI interconnect, or within the DDR Controller. + +Additionally DSS also supports two fixed performance event counters, one +for ddr reads and the other for ddr writes. + +The counter will be operating in either manual or auto mode. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/events/ + /sys/bus/event_source/devices/mrvl_ddr_pmu_<>/format/ + +Examples:: + + $ perf list | grep ddr + mrvl_ddr_pmu_<>/ddr_act_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_alloc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_bsm_starvation/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_mwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_active_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_read/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_cam_write/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_capar_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_crit_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_reads/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_ddr_writes/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cmd_is_retry/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_cycles/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_parity_poison/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_rd_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dfi_wr_data_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mpc/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_dqsosc_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_mpsm/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_powerdown/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_enter_selfref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_pri_rdaccess/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rd_or_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_rmw_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hif_wr_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_hpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_load_mode/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_lpri_sched_rd_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_other/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_precharge_for_rdwr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_raw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_bypass_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rd_uc_ecc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_rdwr_transitions/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_refresh/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_retry_fifo_full/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_spec_ref/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_tcr_mrr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_war_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_waw_hazard/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_rd/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_win_limit_reached_wr/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_crc_error/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_wr_trxn_crit_access/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_write_combine/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqcl/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqlatch/ [Kernel PMU event] + mrvl_ddr_pmu_<>/ddr_zqstart/ [Kernel PMU event] + + $ perf stat -e ddr_cam_read,ddr_cam_write,ddr_cam_active_access,ddr_cam + rd_or_wr_access,ddr_cam_rd_active_access,ddr_cam_mwr <workload> diff --git a/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst new file mode 100644 index 000000000000..ad1975b14087 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-odyssey-tad-pmu.rst @@ -0,0 +1,37 @@ +==================================================================== +Marvell Odyssey LLC-TAD Performance Monitoring Unit (PMU UNCORE) +==================================================================== + +Each TAD provides eight 64-bit counters for monitoring +cache behavior.The driver always configures the same counter for +all the TADs. The user would end up effectively reserving one of +eight counters in every TAD to look across all TADs. +The occurrences of events are aggregated and presented to the user +at the end of running the workload. The driver does not provide a +way for the user to partition TADs so that different TADs are used for +different applications. + +The performance events reflect various internal or interface activities. +By combining the values from multiple performance counters, cache +performance can be measured in terms such as: cache miss rate, cache +allocations, interface retry rate, internal resource occupancy, etc. + +The PMU driver exposes the available events and format options under sysfs:: + + /sys/bus/event_source/devices/tad/events/ + /sys/bus/event_source/devices/tad/format/ + +Examples:: + + $ perf list | grep tad + tad/tad_alloc_any/ [Kernel PMU event] + tad/tad_alloc_dtg/ [Kernel PMU event] + tad/tad_alloc_ltg/ [Kernel PMU event] + tad/tad_hit_any/ [Kernel PMU event] + tad/tad_hit_dtg/ [Kernel PMU event] + tad/tad_hit_ltg/ [Kernel PMU event] + tad/tad_req_msh_in_exlmn/ [Kernel PMU event] + tad/tad_tag_rd/ [Kernel PMU event] + tad/tad_tot_cycle/ [Kernel PMU event] + + $ perf stat -e tad_alloc_dtg,tad_alloc_ltg,tad_alloc_any,tad_hit_dtg,tad_hit_ltg,tad_hit_any,tad_tag_rd <workload> diff --git a/Documentation/admin-guide/perf/mrvl-pem-pmu.rst b/Documentation/admin-guide/perf/mrvl-pem-pmu.rst new file mode 100644 index 000000000000..c39007149b97 --- /dev/null +++ b/Documentation/admin-guide/perf/mrvl-pem-pmu.rst @@ -0,0 +1,56 @@ +================================================================= +Marvell Odyssey PEM Performance Monitoring Unit (PMU UNCORE) +================================================================= + +The PCI Express Interface Units(PEM) are associated with a corresponding +monitoring unit. This includes performance counters to track various +characteristics of the data that is transmitted over the PCIe link. + +The counters track inbound and outbound transactions which +includes separate counters for posted/non-posted/completion TLPs. +Also, inbound and outbound memory read requests along with their +latencies can also be monitored. Address Translation Services(ATS)events +such as ATS Translation, ATS Page Request, ATS Invalidation along with +their corresponding latencies are also tracked. + +There are separate 64 bit counters to measure posted/non-posted/completion +tlps in inbound and outbound transactions. ATS events are measured by +different counters. + +The PMU driver exposes the available events and format options under sysfs, +/sys/bus/event_source/devices/mrvl_pcie_rc_pmu_<>/events/ +/sys/bus/event_source/devices/mrvl_pcie_rc_pmu_<>/format/ + +Examples:: + + # perf list | grep mrvl_pcie_rc_pmu + mrvl_pcie_rc_pmu_<>/ats_inv/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_inv_latency/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_pri/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_pri_latency/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_trans/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ats_trans_latency/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_inflight/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_reads/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_req_no_ro_ebus/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_req_no_ro_ncb/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_npr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_dwords_pr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_npr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ib_tlp_pr/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_inflight_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_merges_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_merges_npr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_merges_pr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_reads_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_cpl_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_npr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_dwords_pr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_npr_partid/ [Kernel PMU event] + mrvl_pcie_rc_pmu_<>/ob_tlp_pr_partid/ [Kernel PMU event] + + + # perf stat -e ib_inflight,ib_reads,ib_req_no_ro_ebus,ib_req_no_ro_ncb <workload> diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-pmu.rst index 2e0d47cfe7ea..f538ef67e0e8 100644 --- a/Documentation/admin-guide/perf/nvidia-pmu.rst +++ b/Documentation/admin-guide/perf/nvidia-pmu.rst @@ -34,7 +34,7 @@ strongly-ordered (SO) PCIE write traffic to local/remote memory. Please see traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_scf_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_scf_pmu_<socket-id>. Example usage: @@ -66,7 +66,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c0_pmu_<socket-id>. Example usage: @@ -86,6 +86,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c0_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c0_pmu_0/event=0x0,port=0x3/ + NVLink-C2C1 PMU ------------------- @@ -96,7 +112,7 @@ Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section` for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_nvlink_c2c1_pmu_<socket-id>. Example usage: @@ -116,6 +132,22 @@ Example usage: perf stat -a -e nvidia_nvlink_c2c1_pmu_3/event=0x0/ +The NVLink-C2C has two ports that can be connected to one GPU (occupying both +ports) or to two GPUs (one GPU per port). The user can use "port" bitmap +parameter to select the port(s) to monitor. Each bit represents the port number, +e.g. "port=0x1" corresponds to port 0 and "port=0x3" is for port 0 and 1. The +PMU will monitor both ports by default if not specified. + +Example for port filtering: + +* Count event id 0x0 from the GPU connected with socket 0 on port 0:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x1/ + +* Count event id 0x0 from the GPUs connected with socket 0 on port 0 and port 1:: + + perf stat -a -e nvidia_nvlink_c2c1_pmu_0/event=0x0,port=0x3/ + CNVLink PMU --------------- @@ -125,13 +157,14 @@ to local memory. For PCIE traffic, this PMU captures read and relaxed ordered for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>. Each SoC socket can be connected to one or more sockets via CNVLink. The user can use "rem_socket" bitmap parameter to select the remote socket(s) to monitor. Each bit represents the socket number, e.g. "rem_socket=0xE" corresponds to -socket 1 to 3. -/sys/bus/event_sources/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket +socket 1 to 3. The PMU will monitor all remote sockets by default if not +specified. +/sys/bus/event_source/devices/nvidia_cnvlink_pmu_<socket-id>/format/rem_socket shows the valid bits that can be set in the "rem_socket" parameter. The PMU can not distinguish the remote traffic initiator, therefore it does not @@ -165,12 +198,13 @@ local/remote memory. Please see :ref:`NVIDIA_Uncore_PMU_Traffic_Coverage_Section for more info about the PMU traffic coverage. The events and configuration options of this PMU device are described in sysfs, -see /sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>. +see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>. Each SoC socket can support multiple root ports. The user can use "root_port" bitmap parameter to select the port(s) to monitor, i.e. -"root_port=0xF" corresponds to root port 0 to 3. -/sys/bus/event_sources/devices/nvidia_pcie_pmu_<socket-id>/format/root_port +"root_port=0xF" corresponds to root port 0 to 3. The PMU will monitor all root +ports by default if not specified. +/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>/format/root_port shows the valid bits that can be set in the "root_port" parameter. Example usage: diff --git a/Documentation/admin-guide/perf/qcom_l2_pmu.rst b/Documentation/admin-guide/perf/qcom_l2_pmu.rst index c130178a4a55..c37c6be9b8d8 100644 --- a/Documentation/admin-guide/perf/qcom_l2_pmu.rst +++ b/Documentation/admin-guide/perf/qcom_l2_pmu.rst @@ -10,7 +10,7 @@ There is one logical L2 PMU exposed, which aggregates the results from the physical PMUs. The driver provides a description of its available events and configuration -options in sysfs, see /sys/devices/l2cache_0. +options in sysfs, see /sys/bus/event_source/devices/l2cache_0. The "format" directory describes the format of the events. diff --git a/Documentation/admin-guide/perf/qcom_l3_pmu.rst b/Documentation/admin-guide/perf/qcom_l3_pmu.rst index a3d014a46bfd..a66556b7e985 100644 --- a/Documentation/admin-guide/perf/qcom_l3_pmu.rst +++ b/Documentation/admin-guide/perf/qcom_l3_pmu.rst @@ -9,7 +9,7 @@ PMU with device name l3cache_<socket>_<instance>. User space is responsible for aggregating across slices. The driver provides a description of its available events and configuration -options in sysfs, see /sys/devices/l3cache*. Given that these are uncore PMUs +options in sysfs, see /sys/bus/event_source/devices/l3cache*. Given that these are uncore PMUs the driver also exposes a "cpumask" sysfs attribute which contains a mask consisting of one CPU per socket which will be used to handle all the PMU events on that socket. diff --git a/Documentation/admin-guide/perf/thunderx2-pmu.rst b/Documentation/admin-guide/perf/thunderx2-pmu.rst index 01f158238ae1..9255f7bf9452 100644 --- a/Documentation/admin-guide/perf/thunderx2-pmu.rst +++ b/Documentation/admin-guide/perf/thunderx2-pmu.rst @@ -22,7 +22,7 @@ The thunderx2_pmu driver registers per-socket perf PMUs for the DMC and L3C devices. Each PMU can be used to count up to 4 (DMC/L3C) or up to 8 (CCPI2) events simultaneously. The PMUs provide a description of their available events and configuration options under sysfs, see -/sys/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. +/sys/bus/event_source/devices/uncore_<l3c_S/dmc_S/ccpi2_S/>; S is the socket id. The driver does not support sampling, therefore "perf record" will not work. Per-task perf sessions are also not supported. diff --git a/Documentation/admin-guide/perf/xgene-pmu.rst b/Documentation/admin-guide/perf/xgene-pmu.rst index 644f8ed89152..98ccb8e777c4 100644 --- a/Documentation/admin-guide/perf/xgene-pmu.rst +++ b/Documentation/admin-guide/perf/xgene-pmu.rst @@ -13,7 +13,7 @@ PMU (perf) driver The xgene-pmu driver registers several perf PMU drivers. Each of the perf driver provides description of its available events and configuration options -in sysfs, see /sys/devices/<l3cX/iobX/mcbX/mcX>/. +in sysfs, see /sys/bus/event_source/devices/<l3cX/iobX/mcbX/mcX>/. The "format" directory describes format of the config (event ID), config1 (agent ID) fields of the perf_event_attr structure. The "events" diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 1e0d101b020a..412423c54f25 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -262,6 +262,17 @@ lowest non-linear performance in `AMD CPPC Performance Capability <perf_cap_>`_.) This attribute is read-only. +``amd_pstate_hw_prefcore`` + +Whether the platform supports the preferred core feature and it has been +enabled. This attribute is read-only. + +``amd_pstate_prefcore_ranking`` + +The performance ranking of the core. This number doesn't have any unit, but +larger numbers are preferred at the time of reading. This can change at +runtime based on platform conditions. This attribute is read-only. + ``energy_performance_available_preferences`` A list of all the supported EPP preferences that could be used for @@ -281,6 +292,22 @@ integer values defined between 0 to 255 when EPP feature is enabled by platform firmware, if EPP feature is disabled, driver will ignore the written value This attribute is read-write. +``boost`` +The `boost` sysfs attribute provides control over the CPU core +performance boost, allowing users to manage the maximum frequency limitation +of the CPU. This attribute can be used to enable or disable the boost feature +on individual CPUs. + +When the boost feature is enabled, the CPU can dynamically increase its frequency +beyond the base frequency, providing enhanced performance for demanding workloads. +On the other hand, disabling the boost feature restricts the CPU to operate at the +base frequency, which may be desirable in certain scenarios to prioritize power +efficiency or manage temperature. + +To manipulate the `boost` attribute, users can write a value of `0` to disable the +boost or `1` to enable it, for the respective CPU using the sysfs path +`/sys/devices/system/cpu/cpuX/cpufreq/boost`, where `X` represents the CPU number. + Other performance and frequency values can be read back from ``/sys/devices/system/cpu/cpuX/acpi_cppc/``, see :ref:`cppc_sysfs`. @@ -406,7 +433,7 @@ control its functionality at the system level. They are located in the ``/sys/devices/system/cpu/amd_pstate/`` directory and affect all CPUs. ``status`` - Operation mode of the driver: "active", "passive" or "disable". + Operation mode of the driver: "active", "passive", "guided" or "disable". "active" The driver is functional and in the ``active mode`` diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 6adb7988e0eb..a21369eba034 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -267,6 +267,10 @@ are the following: ``related_cpus`` List of all (online and offline) CPUs belonging to this policy. +``scaling_available_frequencies`` + List of available frequencies of the CPUs belonging to this policy + (in kHz). + ``scaling_available_governors`` List of ``CPUFreq`` scaling governors present in the kernel that can be attached to this policy or (if the |intel_pstate| scaling driver is @@ -421,8 +425,8 @@ This governor exposes only one tunable: ``rate_limit_us`` Minimum time (in microseconds) that has to pass between two consecutive - runs of governor computations (default: 1000 times the scaling driver's - transition latency). + runs of governor computations (default: 1.5 times the scaling driver's + transition latency or the maximum 2ms). The purpose of this tunable is to reduce the scheduler context overhead of the governor which might be excessive without it. @@ -470,17 +474,17 @@ This governor exposes the following tunables: This is how often the governor's worker routine should run, in microseconds. - Typically, it is set to values of the order of 10000 (10 ms). Its - default value is equal to the value of ``cpuinfo_transition_latency`` - for each policy this governor is attached to (but since the unit here - is greater by 1000, this means that the time represented by - ``sampling_rate`` is 1000 times greater than the transition latency by - default). + Typically, it is set to values of the order of 2000 (2 ms). Its + default value is to add a 50% breathing room + to ``cpuinfo_transition_latency`` on each policy this governor is + attached to. The minimum is typically the length of two scheduler + ticks. If this tunable is per-policy, the following shell command sets the time - represented by it to be 750 times as high as the transition latency:: + represented by it to be 1.5 times as high as the transition latency + (the default):: - # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate + # echo `$(($(cat cpuinfo_transition_latency) * 3 / 2)) > ondemand/sampling_rate ``up_threshold`` If the estimated CPU load is above this value (in percent), the governor diff --git a/Documentation/admin-guide/pm/cpuidle.rst b/Documentation/admin-guide/pm/cpuidle.rst index 19754beb5a4e..eb58d7a5affd 100644 --- a/Documentation/admin-guide/pm/cpuidle.rst +++ b/Documentation/admin-guide/pm/cpuidle.rst @@ -269,27 +269,7 @@ Namely, when invoked to select an idle state for a CPU (i.e. an idle state that the CPU will ask the processor hardware to enter), it attempts to predict the idle duration and uses the predicted value for idle state selection. -It first obtains the time until the closest timer event with the assumption -that the scheduler tick will be stopped. That time, referred to as the *sleep -length* in what follows, is the upper bound on the time before the next CPU -wakeup. It is used to determine the sleep length range, which in turn is needed -to get the sleep length correction factor. - -The ``menu`` governor maintains two arrays of sleep length correction factors. -One of them is used when tasks previously running on the given CPU are waiting -for some I/O operations to complete and the other one is used when that is not -the case. Each array contains several correction factor values that correspond -to different sleep length ranges organized so that each range represented in the -array is approximately 10 times wider than the previous one. - -The correction factor for the given sleep length range (determined before -selecting the idle state for the CPU) is updated after the CPU has been woken -up and the closer the sleep length is to the observed idle duration, the closer -to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). -The sleep length is multiplied by the correction factor for the range that it -falls into to obtain the first approximation of the predicted idle duration. - -Next, the governor uses a simple pattern recognition algorithm to refine its +It first uses a simple pattern recognition algorithm to obtain a preliminary idle duration prediction. Namely, it saves the last 8 observed idle duration values and, when predicting the idle duration next time, it computes the average and variance of them. If the variance is small (smaller than 400 square @@ -301,29 +281,39 @@ Again, if the variance of them is small (in the above sense), the average is taken as the "typical interval" value and so on, until either the "typical interval" is determined or too many data points are disregarded, in which case the "typical interval" is assumed to equal "infinity" (the maximum unsigned -integer value). The "typical interval" computed this way is compared with the -sleep length multiplied by the correction factor and the minimum of the two is -taken as the predicted idle duration. - -Then, the governor computes an extra latency limit to help "interactive" -workloads. It uses the observation that if the exit latency of the selected -idle state is comparable with the predicted idle duration, the total time spent -in that state probably will be very short and the amount of energy to save by -entering it will be relatively small, so likely it is better to avoid the -overhead related to entering that state and exiting it. Thus selecting a -shallower state is likely to be a better option then. The first approximation -of the extra latency limit is the predicted idle duration itself which -additionally is divided by a value depending on the number of tasks that -previously ran on the given CPU and now they are waiting for I/O operations to -complete. The result of that division is compared with the latency limit coming -from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_, -framework and the minimum of the two is taken as the limit for the idle states' -exit latency. +integer value). + +If the "typical interval" computed this way is long enough, the governor obtains +the time until the closest timer event with the assumption that the scheduler +tick will be stopped. That time, referred to as the *sleep length* in what follows, +is the upper bound on the time before the next CPU wakeup. It is used to determine +the sleep length range, which in turn is needed to get the sleep length correction +factor. + +The ``menu`` governor maintains an array containing several correction factor +values that correspond to different sleep length ranges organized so that each +range represented in the array is approximately 10 times wider than the previous +one. + +The correction factor for the given sleep length range (determined before +selecting the idle state for the CPU) is updated after the CPU has been woken +up and the closer the sleep length is to the observed idle duration, the closer +to 1 the correction factor becomes (it must fall between 0 and 1 inclusive). +The sleep length is multiplied by the correction factor for the range that it +falls into to obtain an approximation of the predicted idle duration that is +compared to the "typical interval" determined previously and the minimum of +the two is taken as the idle duration prediction. + +If the "typical interval" value is small, which means that the CPU is likely +to be woken up soon enough, the sleep length computation is skipped as it may +be costly and the idle duration is simply predicted to equal the "typical +interval" value. Now, the governor is ready to walk the list of idle states and choose one of them. For this purpose, it compares the target residency of each state with -the predicted idle duration and the exit latency of it with the computed latency -limit. It selects the state with the target residency closest to the predicted +the predicted idle duration and the exit latency of it with the with the latency +limit coming from the power management quality of service, or `PM QoS <cpu-pm-qos_>`_, +framework. It selects the state with the target residency closest to the predicted idle duration, but still below it, and exit latency that does not exceed the limit. diff --git a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst index 5ab3440e6cee..5151ec312dc0 100644 --- a/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst +++ b/Documentation/admin-guide/pm/intel_uncore_frequency_scaling.rst @@ -113,3 +113,62 @@ to apply at each uncore* level. Support for "current_freq_khz" is available only at each fabric cluster level (i.e., in uncore* directory). + +Efficiency vs. Latency Tradeoff +------------------------------- + +The Efficiency Latency Control (ELC) feature improves performance +per watt. With this feature hardware power management algorithms +optimize trade-off between latency and power consumption. For some +latency sensitive workloads further tuning can be done by SW to +get desired performance. + +The hardware monitors the average CPU utilization across all cores +in a power domain at regular intervals and decides an uncore frequency. +While this may result in the best performance per watt, workload may be +expecting higher performance at the expense of power. Consider an +application that intermittently wakes up to perform memory reads on an +otherwise idle system. In such cases, if hardware lowers uncore +frequency, then there may be delay in ramp up of frequency to meet +target performance. + +The ELC control defines some parameters which can be changed from SW. +If the average CPU utilization is below a user-defined threshold +(elc_low_threshold_percent attribute below), the user-defined uncore +floor frequency will be used (elc_floor_freq_khz attribute below) +instead of hardware calculated minimum. + +Similarly in high load scenario where the CPU utilization goes above +the high threshold value (elc_high_threshold_percent attribute below) +instead of jumping to maximum uncore frequency, frequency is increased +in 100MHz steps. This avoids consuming unnecessarily high power +immediately with CPU utilization spikes. + +Attributes for efficiency latency control: + +``elc_floor_freq_khz`` + This attribute is used to get/set the efficiency latency floor frequency. + If this variable is lower than the 'min_freq_khz', it is ignored by + the firmware. + +``elc_low_threshold_percent`` + This attribute is used to get/set the efficiency latency control low + threshold. This attribute is in percentages of CPU utilization. + +``elc_high_threshold_percent`` + This attribute is used to get/set the efficiency latency control high + threshold. This attribute is in percentages of CPU utilization. + +``elc_high_threshold_enable`` + This attribute is used to enable/disable the efficiency latency control + high threshold. Write '1' to enable, '0' to disable. + +Example system configuration below, which does following: + * when CPU utilization is less than 10%: sets uncore frequency to 800MHz + * when CPU utilization is higher than 95%: increases uncore frequency in + 100MHz steps, until power limit is reached + + elc_floor_freq_khz:800000 + elc_high_threshold_percent:95 + elc_high_threshold_enable:1 + elc_low_threshold_percent:10 diff --git a/Documentation/admin-guide/pmf.rst b/Documentation/admin-guide/pmf.rst deleted file mode 100644 index 9ee729ffc19b..000000000000 --- a/Documentation/admin-guide/pmf.rst +++ /dev/null @@ -1,24 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -Set udev rules for PMF Smart PC Builder ---------------------------------------- - -AMD PMF(Platform Management Framework) Smart PC Solution builder has to set the system states -like S0i3, Screen lock, hibernate etc, based on the output actions provided by the PMF -TA (Trusted Application). - -In order for this to work the PMF driver generates a uevent for userspace to react to. Below are -sample udev rules that can facilitate this experience when a machine has PMF Smart PC solution builder -enabled. - -Please add the following line(s) to -``/etc/udev/rules.d/99-local.rules``:: - - DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="0", RUN+="/usr/bin/systemctl suspend" - DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="1", RUN+="/usr/bin/systemctl hibernate" - DRIVERS=="amd-pmf", ACTION=="change", ENV{EVENT_ID}=="2", RUN+="/bin/loginctl lock-sessions" - -EVENT_ID values: -0= Put the system to S0i3/S2Idle -1= Put the system to hibernate -2= Lock the screen diff --git a/Documentation/admin-guide/quickly-build-trimmed-linux.rst b/Documentation/admin-guide/quickly-build-trimmed-linux.rst index f08149bc53f8..07cfd8863b46 100644 --- a/Documentation/admin-guide/quickly-build-trimmed-linux.rst +++ b/Documentation/admin-guide/quickly-build-trimmed-linux.rst @@ -733,7 +733,7 @@ can easily happen that your self-built kernel will lack modules for tasks you did not perform before utilizing this make target. That's because those tasks require kernel modules that are normally autoloaded when you perform that task for the first time; if you didn't perform that task at least once before using -localmodonfig, the latter will thus assume these modules are superfluous and +localmodconfig, the latter will thus assume these modules are superfluous and disable them. You can try to avoid this by performing typical tasks that often will autoload diff --git a/Documentation/admin-guide/ramoops.rst b/Documentation/admin-guide/ramoops.rst index e9f85142182d..2eabef31220d 100644 --- a/Documentation/admin-guide/ramoops.rst +++ b/Documentation/admin-guide/ramoops.rst @@ -23,6 +23,8 @@ and type of the memory area are set using three variables: * ``mem_size`` for the size. The memory size will be rounded down to a power of two. * ``mem_type`` to specify if the memory type (default is pgprot_writecombine). + * ``mem_name`` to specify a memory region defined by ``reserve_mem`` command + line parameter. Typically the default value of ``mem_type=0`` should be used as that sets the pstore mapping to pgprot_writecombine. Setting ``mem_type=1`` attempts to use @@ -118,6 +120,17 @@ Setting the ramoops parameters can be done in several different manners: return ret; } + D. Using a region of memory reserved via ``reserve_mem`` command line + parameter. The address and size will be defined by the ``reserve_mem`` + parameter. Note, that ``reserve_mem`` may not always allocate memory + in the same location, and cannot be relied upon. Testing will need + to be done, and it may not work on every machine, nor every kernel. + Consider this a "best effort" approach. The ``reserve_mem`` option + takes a size, alignment and name as arguments. The name is used + to map the memory to a label that can be retrieved by ramoops. + + reserve_mem=2M:4096:oops ramoops.mem_name=oops + You can specify either RAM memory or peripheral devices' memory. However, when specifying RAM, be sure to reserve the memory by issuing memblock_reserve() very early in the architecture code, e.g.:: diff --git a/Documentation/admin-guide/reporting-regressions.rst b/Documentation/admin-guide/reporting-regressions.rst index 76b246ecf21b..946518355a2c 100644 --- a/Documentation/admin-guide/reporting-regressions.rst +++ b/Documentation/admin-guide/reporting-regressions.rst @@ -42,12 +42,12 @@ The important basics -------------------- -What is a "regression" and what is the "no regressions rule"? +What is a "regression" and what is the "no regressions" rule? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ It's a regression if some application or practical use case running fine with one Linux kernel works worse or not at all with a newer version compiled using a -similar configuration. The "no regressions rule" forbids this to take place; if +similar configuration. The "no regressions" rule forbids this to take place; if it happens by accident, developers that caused it are expected to quickly fix the issue. @@ -173,7 +173,7 @@ Additional details about regressions ------------------------------------ -What is the goal of the "no regressions rule"? +What is the goal of the "no regressions" rule? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Users should feel safe when updating kernel versions and not have to worry @@ -199,8 +199,8 @@ Exceptions to this rule are extremely rare; in the past developers almost always turned out to be wrong when they assumed a particular situation was warranting an exception. -Who ensures the "no regressions" is actually followed? -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Who ensures the "no regressions" rule is actually followed? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The subsystem maintainers should take care of that, which are watched and supported by the tree maintainers -- e.g. Linus Torvalds for mainline and diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst index 47499a1742bd..08e89e031714 100644 --- a/Documentation/admin-guide/sysctl/fs.rst +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -38,6 +38,11 @@ requests. ``aio-max-nr`` allows you to change the maximum value ``aio-max-nr`` does not result in the pre-allocation or re-sizing of any kernel data structures. +dentry-negative +---------------------------- + +Policy for negative dentries. Set to 1 to always delete the dentry when a +file is removed, and 0 to disable it. By default, this behavior is disabled. dentry-state ------------ @@ -332,3 +337,13 @@ Each "watch" costs roughly 90 bytes on a 32-bit kernel, and roughly 160 bytes on a 64-bit one. The current default value for ``max_user_watches`` is 4% of the available low memory, divided by the "watch" cost in bytes. + +5. /proc/sys/fs/fuse - Configuration options for FUSE filesystems +===================================================================== + +This directory contains the following configuration options for FUSE +filesystems: + +``/proc/sys/fs/fuse/max_pages_limit`` is a read/write file for +setting/getting the maximum number of pages that can be used for servicing +requests in FUSE. diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 7fd43947832f..a43b78b4b646 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -401,6 +401,15 @@ The upper bound on the number of tasks that are checked. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. +hung_task_detect_count +====================== + +Indicates the total number of tasks that have been detected as hung since +the system boot. + +This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. + + hung_task_timeout_secs ====================== @@ -454,7 +463,7 @@ ignore-unaligned-usertrap On architectures where unaligned accesses cause traps, and where this feature is supported (``CONFIG_SYSCTL_ARCH_UNALIGN_NO_WARN``; -currently, ``arc`` and ``loongarch``), controls whether all +currently, ``arc``, ``parisc`` and ``loongarch``), controls whether all unaligned traps are logged. = ============================================================= @@ -1535,6 +1544,13 @@ constant ``FUTEX_TID_MASK`` (0x3fffffff). If a value outside of this range is written to ``threads-max`` an ``EINVAL`` error occurs. +timer_migration +=============== + +When set to a non-zero value, attempt to migrate timers away from idle cpus to +allow them to remain in low power states longer. + +Default is set (1). traceoff_on_warning =================== diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 7250c0542828..7b0c4291c686 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -72,6 +72,7 @@ two flavors of JITs, the newer eBPF JIT currently supported on: - riscv64 - riscv32 - loongarch64 + - arc And the older cBPF JIT supported on the following archs: diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index c59889de122b..f48eaa98d22d 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -36,6 +36,7 @@ Currently, these files are in /proc/sys/vm: - dirtytime_expire_seconds - dirty_writeback_centisecs - drop_caches +- enable_soft_offline - extfrag_threshold - highmem_is_dirtyable - hugetlb_shm_group @@ -43,6 +44,7 @@ Currently, these files are in /proc/sys/vm: - legacy_va_layout - lowmem_reserve_ratio - max_map_count +- mem_profiling (only if CONFIG_MEM_ALLOC_PROFILING=y) - memory_failure_early_kill - memory_failure_recovery - min_free_kbytes @@ -266,6 +268,43 @@ used:: These are informational only. They do not mean that anything is wrong with your system. To disable them, echo 4 (bit 2) into drop_caches. +enable_soft_offline +=================== +Correctable memory errors are very common on servers. Soft-offline is kernel's +solution for memory pages having (excessive) corrected memory errors. + +For different types of page, soft-offline has different behaviors / costs. + +- For a raw error page, soft-offline migrates the in-use page's content to + a new raw page. + +- For a page that is part of a transparent hugepage, soft-offline splits the + transparent hugepage into raw pages, then migrates only the raw error page. + As a result, user is transparently backed by 1 less hugepage, impacting + memory access performance. + +- For a page that is part of a HugeTLB hugepage, soft-offline first migrates + the entire HugeTLB hugepage, during which a free hugepage will be consumed + as migration target. Then the original hugepage is dissolved into raw + pages without compensation, reducing the capacity of the HugeTLB pool by 1. + +It is user's call to choose between reliability (staying away from fragile +physical memory) vs performance / capacity implications in transparent and +HugeTLB cases. + +For all architectures, enable_soft_offline controls whether to soft offline +memory pages. When set to 1, kernel attempts to soft offline the pages +whenever it thinks needed. When set to 0, kernel returns EOPNOTSUPP to +the request to soft offline the pages. Its default value is 1. + +It is worth mentioning that after setting enable_soft_offline to 0, the +following requests to soft offline pages will not be performed: + +- Request to soft offline pages from RAS Correctable Errors Collector. + +- On ARM, the request to soft offline pages from GHES driver. + +- On PARISC, the request to soft offline pages from Page Deallocation Table. extfrag_threshold ================= @@ -425,6 +464,21 @@ e.g., up to one or two maps per allocation. The default value is 65530. +mem_profiling +============== + +Enable memory profiling (when CONFIG_MEM_ALLOC_PROFILING=y) + +1: Enable memory profiling. + +0: Disable memory profiling. + +Enabling memory profiling introduces a small performance overhead for all +memory allocations. + +The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. + + memory_failure_early_kill: ========================== diff --git a/Documentation/admin-guide/sysrq.rst b/Documentation/admin-guide/sysrq.rst index 2f2e5bd440f9..9c7aa817adc7 100644 --- a/Documentation/admin-guide/sysrq.rst +++ b/Documentation/admin-guide/sysrq.rst @@ -49,26 +49,26 @@ How do I use the magic SysRq key? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ On x86 - You press the key combo :kbd:`ALT-SysRq-<command key>`. + You press the key combo `ALT-SysRq-<command key>`. .. note:: Some keyboards may not have a key labeled 'SysRq'. The 'SysRq' key is also known as the 'Print Screen' key. Also some keyboards cannot handle so many keys being pressed at the same time, so you might - have better luck with press :kbd:`Alt`, press :kbd:`SysRq`, - release :kbd:`SysRq`, press :kbd:`<command key>`, release everything. + have better luck with press `Alt`, press `SysRq`, + release `SysRq`, press `<command key>`, release everything. On SPARC - You press :kbd:`ALT-STOP-<command key>`, I believe. + You press `ALT-STOP-<command key>`, I believe. On the serial console (PC style standard serial ports only) You send a ``BREAK``, then within 5 seconds a command key. Sending ``BREAK`` twice is interpreted as a normal BREAK. On PowerPC - Press :kbd:`ALT - Print Screen` (or :kbd:`F13`) - :kbd:`<command key>`. - :kbd:`Print Screen` (or :kbd:`F13`) - :kbd:`<command key>` may suffice. + Press `ALT - Print Screen` (or `F13`) - `<command key>`. + `Print Screen` (or `F13`) - `<command key>` may suffice. On other If you know of the key combos for other architectures, please @@ -88,7 +88,7 @@ On all echo _reisub > /proc/sysrq-trigger -The :kbd:`<command key>` is case sensitive. +The `<command key>` is case sensitive. What are the 'command' keys? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -161,6 +161,8 @@ Command Function will be printed to your console. (``0``, for example would make it so that only emergency messages like PANICs or OOPSes would make it to your console.) + +``R`` Replay the kernel log messages on consoles. =========== =================================================================== Okay, so what can I use them for? @@ -211,14 +213,21 @@ processes. "just thaw ``it(j)``" is useful if your system becomes unresponsive due to a frozen (probably root) filesystem via the FIFREEZE ioctl. +``Replay logs(R)`` is useful to view the kernel log messages when system is hung +or you are not able to use dmesg command to view the messages in printk buffer. +User may have to press the key combination multiple times if console system is +busy. If it is completely locked up, then messages won't be printed. Output +messages depend on current console loglevel, which can be modified using +sysrq[0-9] (see above). + Sometimes SysRq seems to get 'stuck' after using it, what can I do? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When this happens, try tapping shift, alt and control on both sides of the keyboard, and hitting an invalid sysrq sequence again. (i.e., something like -:kbd:`alt-sysrq-z`). +`alt-sysrq-z`). -Switching to another virtual console (:kbd:`ALT+Fn`) and then back again +Switching to another virtual console (`ALT+Fn`) and then back again should also help. I hit SysRq, but nothing seems to happen, what's wrong? @@ -281,7 +290,7 @@ exception the header line from the sysrq command is passed to all console consumers as if the current loglevel was maximum. If only the header is emitted it is almost certain that the kernel loglevel is too low. Should you require the output on the console channel then you will need -to temporarily up the console loglevel using :kbd:`alt-sysrq-8` or:: +to temporarily up the console loglevel using `alt-sysrq-8` or:: echo 8 > /proc/sysrq-trigger diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index f92551539e8a..700aa72eecb1 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -182,3 +182,5 @@ More detailed explanation for tainting produce extremely unusual kernel structure layouts (even performance pathological ones), which is important to know when debugging. Set at build time. + + 18) ``N`` if an in-kernel test, such as a KUnit test, has been run. diff --git a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst index c389d4fd7599..03c55151346c 100644 --- a/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst +++ b/Documentation/admin-guide/verify-bugs-and-bisect-regressions.rst @@ -23,7 +23,7 @@ mistakes occasionally made even by experienced developers. up in the reference section, then jump back to where you left off. .. Find the latest rendered version of this text here: - https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.rst.html + https://docs.kernel.org/admin-guide/verify-bugs-and-bisect-regressions.html The essence of the process (aka 'TL;DR') ======================================== @@ -1431,7 +1431,7 @@ can easily happen that your self-built kernels will lack modules for tasks you did not perform at least once before utilizing this make target. That happens when a task requires kernel modules which are only autoloaded when you execute it for the first time. So when you never performed that task since starting your -kernel the modules will not have been loaded -- and from localmodonfig's point +kernel the modules will not have been loaded -- and from localmodconfig's point of view look superfluous, which thus disables them to reduce the amount of code to be compiled. diff --git a/Documentation/admin-guide/workload-tracing.rst b/Documentation/admin-guide/workload-tracing.rst index b2e254ec8ee8..6be38c1b9c5b 100644 --- a/Documentation/admin-guide/workload-tracing.rst +++ b/Documentation/admin-guide/workload-tracing.rst @@ -83,7 +83,7 @@ scripts/ver_linux is a good way to check if your system already has the necessary tools:: sudo apt-get build-essentials flex bison yacc - sudo apt install libelf-dev systemtap-sdt-dev libaudit-dev libslang2-dev libperl-dev libdw-dev + sudo apt install libelf-dev systemtap-sdt-dev libslang2-dev libperl-dev libdw-dev cscope is a good tool to browse kernel sources. Let's install it now:: |