diff options
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r-- | Documentation/admin-guide/LSM/SELinux.rst | 11 | ||||
-rw-r--r-- | Documentation/admin-guide/gpio/gpio-sim.rst | 7 | ||||
-rw-r--r-- | Documentation/admin-guide/hw-vuln/attack_vector_controls.rst | 238 | ||||
-rw-r--r-- | Documentation/admin-guide/hw-vuln/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 51 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/slab.rst | 469 | ||||
-rw-r--r-- | Documentation/admin-guide/pm/amd-pstate.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/pm/cpufreq.rst | 4 | ||||
-rw-r--r-- | Documentation/admin-guide/syscall-user-dispatch.rst | 23 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/kernel.rst | 36 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/vm.rst | 8 | ||||
-rw-r--r-- | Documentation/admin-guide/thunderbolt.rst | 9 |
14 files changed, 814 insertions, 54 deletions
diff --git a/Documentation/admin-guide/LSM/SELinux.rst b/Documentation/admin-guide/LSM/SELinux.rst index 520a1c2c6fd2..cdd65164ca96 100644 --- a/Documentation/admin-guide/LSM/SELinux.rst +++ b/Documentation/admin-guide/LSM/SELinux.rst @@ -2,6 +2,17 @@ SELinux ======= +Information about the SELinux kernel subsystem can be found at the +following links: + + https://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git/tree/README.md + + https://github.com/selinuxproject/selinux-kernel/wiki + +Information about the SELinux userspace can be found at: + + https://github.com/SELinuxProject/selinux/wiki + If you want to use SELinux, chances are you will want to use the distro-provided policies, or install the latest reference policy release from diff --git a/Documentation/admin-guide/gpio/gpio-sim.rst b/Documentation/admin-guide/gpio/gpio-sim.rst index 35d49ccd49e0..f5135a14ef2e 100644 --- a/Documentation/admin-guide/gpio/gpio-sim.rst +++ b/Documentation/admin-guide/gpio/gpio-sim.rst @@ -50,8 +50,11 @@ the number of lines exposed by this bank. **Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/name`` -This group represents a single line at the offset Y. The 'name' attribute -allows to set the line name as represented by the 'gpio-line-names' property. +**Attribute:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/valid`` + +This group represents a single line at the offset Y. The ``valid`` attribute +indicates whether the line can be used as GPIO. The ``name`` attribute allows +to set the line name as represented by the 'gpio-line-names' property. **Item:** ``/config/gpio-sim/gpio-device/gpio-bankX/lineY/hog`` diff --git a/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst new file mode 100644 index 000000000000..b4de16f5ec44 --- /dev/null +++ b/Documentation/admin-guide/hw-vuln/attack_vector_controls.rst @@ -0,0 +1,238 @@ +.. SPDX-License-Identifier: GPL-2.0 + +Attack Vector Controls +====================== + +Attack vector controls provide a simple method to configure only the mitigations +for CPU vulnerabilities which are relevant given the intended use of a system. +Administrators are encouraged to consider which attack vectors are relevant and +disable all others in order to recoup system performance. + +When new relevant CPU vulnerabilities are found, they will be added to these +attack vector controls so administrators will likely not need to reconfigure +their command line parameters as mitigations will continue to be correctly +applied based on the chosen attack vector controls. + +Attack Vectors +-------------- + +There are 5 sets of attack-vector mitigations currently supported by the kernel: + +#. :ref:`user_kernel` +#. :ref:`user_user` +#. :ref:`guest_host` +#. :ref:`guest_guest` +#. :ref:`smt` + +To control the enabled attack vectors, see :ref:`cmdline`. + +.. _user_kernel: + +User-to-Kernel +^^^^^^^^^^^^^^ + +The user-to-kernel attack vector involves a malicious userspace program +attempting to leak kernel data into userspace by exploiting a CPU vulnerability. +The kernel data involved might be limited to certain kernel memory, or include +all memory in the system, depending on the vulnerability exploited. + +If no untrusted userspace applications are being run, such as with single-user +systems, consider disabling user-to-kernel mitigations. + +Note that the CPU vulnerabilities mitigated by Linux have generally not been +shown to be exploitable from browser-based sandboxes. User-to-kernel +mitigations are therefore mostly relevant if unknown userspace applications may +be run by untrusted users. + +*user-to-kernel mitigations are enabled by default* + +.. _user_user: + +User-to-User +^^^^^^^^^^^^ + +The user-to-user attack vector involves a malicious userspace program attempting +to influence the behavior of another unsuspecting userspace program in order to +exfiltrate data. The vulnerability of a userspace program is based on the +program itself and the interfaces it provides. + +If no untrusted userspace applications are being run, consider disabling +user-to-user mitigations. + +Note that because the Linux kernel contains a mapping of all physical memory, +preventing a malicious userspace program from leaking data from another +userspace program requires mitigating user-to-kernel attacks as well for +complete protection. + +*user-to-user mitigations are enabled by default* + +.. _guest_host: + +Guest-to-Host +^^^^^^^^^^^^^ + +The guest-to-host attack vector involves a malicious VM attempting to leak +hypervisor data into the VM. The data involved may be limited, or may +potentially include all memory in the system, depending on the vulnerability +exploited. + +If no untrusted VMs are being run, consider disabling guest-to-host mitigations. + +*guest-to-host mitigations are enabled by default if KVM support is present* + +.. _guest_guest: + +Guest-to-Guest +^^^^^^^^^^^^^^ + +The guest-to-guest attack vector involves a malicious VM attempting to influence +the behavior of another unsuspecting VM in order to exfiltrate data. The +vulnerability of a VM is based on the code inside the VM itself and the +interfaces it provides. + +If no untrusted VMs, or only a single VM is being run, consider disabling +guest-to-guest mitigations. + +Similar to the user-to-user attack vector, preventing a malicious VM from +leaking data from another VM requires mitigating guest-to-host attacks as well +due to the Linux kernel phys map. + +*guest-to-guest mitigations are enabled by default if KVM support is present* + +.. _smt: + +Cross-Thread +^^^^^^^^^^^^ + +The cross-thread attack vector involves a malicious userspace program or +malicious VM either observing or attempting to influence the behavior of code +running on the SMT sibling thread in order to exfiltrate data. + +Many cross-thread attacks can only be mitigated if SMT is disabled, which will +result in reduced CPU core count and reduced performance. + +If cross-thread mitigations are fully enabled ('auto,nosmt'), all mitigations +for cross-thread attacks will be enabled. SMT may be disabled depending on +which vulnerabilities are present in the CPU. + +If cross-thread mitigations are partially enabled ('auto'), mitigations for +cross-thread attacks will be enabled but SMT will not be disabled. + +If cross-thread mitigations are disabled, no mitigations for cross-thread +attacks will be enabled. + +Cross-thread mitigation may not be required if core-scheduling or similar +techniques are used to prevent untrusted workloads from running on SMT siblings. + +*cross-thread mitigations default to partially enabled* + +.. _cmdline: + +Command Line Controls +--------------------- + +Attack vectors are controlled through the mitigations= command line option. The +value provided begins with a global option and then may optionally include one +or more options to disable various attack vectors. + +Format: + | ``mitigations=[global]`` + | ``mitigations=[global],[attack vectors]`` + +Global options: + +============ ============================================================= +Option Description +============ ============================================================= +'off' All attack vectors disabled. +'auto' All attack vectors enabled, partial cross-thread mitigations. +'auto,nosmt' All attack vectors enabled, full cross-thread mitigations. +============ ============================================================= + +Attack vector options: + +================= ======================================= +Option Description +================= ======================================= +'no_user_kernel' Disables user-to-kernel mitigations. +'no_user_user' Disables user-to-user mitigations. +'no_guest_host' Disables guest-to-host mitigations. +'no_guest_guest' Disables guest-to-guest mitigations +'no_cross_thread' Disables all cross-thread mitigations. +================= ======================================= + +Multiple attack vector options may be specified in a comma-separated list. If +the global option is not specified, it defaults to 'auto'. The global option +'off' is equivalent to disabling all attack vectors. + +Examples: + | ``mitigations=auto,no_user_kernel`` + + Enable all attack vectors except user-to-kernel. Partial cross-thread + mitigations. + + | ``mitigations=auto,nosmt,no_guest_host,no_guest_guest`` + + Enable all attack vectors and cross-thread mitigations except for + guest-to-host and guest-to-guest mitigations. + + | ``mitigations=,no_cross_thread`` + + Enable all attack vectors but not cross-thread mitigations. + +Interactions with command-line options +-------------------------------------- + +Vulnerability-specific controls (e.g. "retbleed=off") take precedence over all +attack vector controls. Mitigations for individual vulnerabilities may be +turned on or off via their command-line options regardless of the attack vector +controls. + +Summary of attack-vector mitigations +------------------------------------ + +When a vulnerability is mitigated due to an attack-vector control, the default +mitigation option for that particular vulnerability is used. To use a different +mitigation, please use the vulnerability-specific command line option. + +The table below summarizes which vulnerabilities are mitigated when different +attack vectors are enabled and assuming the CPU is vulnerable. + +=============== ============== ============ ============= ============== ============ ======== +Vulnerability User-to-Kernel User-to-User Guest-to-Host Guest-to-Guest Cross-Thread Notes +=============== ============== ============ ============= ============== ============ ======== +BHI X X +ITS X X +GDS X X X X * (Note 1) +L1TF X X * (Note 2) +MDS X X X X * (Note 2) +MMIO X X X X * (Note 2) +Meltdown X +Retbleed X X * (Note 3) +RFDS X X X X +Spectre_v1 X +Spectre_v2 X X +Spectre_v2_user X X * (Note 1) +SRBDS X X X X +SRSO X X +SSB (Note 4) +TAA X X X X * (Note 2) +TSA X X X X +=============== ============== ============ ============= ============== ============ ======== + +Notes: + 1 -- Can be mitigated without disabling SMT. + + 2 -- Disables SMT if cross-thread mitigations are fully enabled and the CPU + is vulnerable + + 3 -- Disables SMT if cross-thread mitigations are fully enabled, the CPU is + vulnerable, and STIBP is not supported + + 4 -- Speculative store bypass is always enabled by default (no kernel + mitigation applied) unless overridden with spec_store_bypass_disable option + +When an attack-vector is disabled, all mitigations for the vulnerabilities +listed in the above table are disabled, unless mitigation is required for a +different enabled attack-vector or a mitigation is explicitly selected via a +vulnerability-specific command line option. diff --git a/Documentation/admin-guide/hw-vuln/index.rst b/Documentation/admin-guide/hw-vuln/index.rst index 09890a8f3ee9..89ca636081b7 100644 --- a/Documentation/admin-guide/hw-vuln/index.rst +++ b/Documentation/admin-guide/hw-vuln/index.rst @@ -9,6 +9,7 @@ are configurable at compile, boot or run time. .. toctree:: :maxdepth: 1 + attack_vector_controls spectre l1tf mds diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 8cf4614385b7..404a15f6782c 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -325,14 +325,14 @@ NR_FREE_PAGES On linux-2.6.21 or later, the number of free pages is in vm_stat[NR_FREE_PAGES]. Used to get the number of free pages. -PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb ------------------------------------------------------------------------------------------ +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_hwpoison|PG_head_mask +-------------------------------------------------------------------------- Page attributes. These flags are used to filter various unnecessary for dumping pages. -PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_unaccepted) -------------------------------------------------------------------------------------------------------------------------- +PAGE_SLAB_MAPCOUNT_VALUE|PAGE_BUDDY_MAPCOUNT_VALUE|PAGE_OFFLINE_MAPCOUNT_VALUE|PAGE_HUGETLB_MAPCOUNT_VALUE|PAGE_UNACCEPTED_MAPCOUNT_VALUE +------------------------------------------------------------------------------------------------------------------------------------------ More page attributes. These flags are used to filter various unnecessary for dumping pages. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 07e22ba5bfe3..4943fc845a15 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2538,6 +2538,13 @@ requires the kernel to be built with CONFIG_ARM64_PSEUDO_NMI. + irqchip.riscv_imsic_noipi + [RISC-V,EARLY] + Force the kernel to not use IMSIC software injected MSIs + as IPIs. Intended for system where IMSIC is trap-n-emulated, + and thus want to reduce MMIO traps when triggering IPIs + to multiple harts. + irqfixup [HW] When an interrupt is not handled search all handlers for it. Intended to get systems with badly broken @@ -3790,6 +3797,10 @@ mmio_stale_data=full,nosmt [X86] retbleed=auto,nosmt [X86] + [X86] After one of the above options, additionally + supports attack-vector based controls as documented in + Documentation/admin-guide/hw-vuln/attack_vector_controls.rst + mminit_loglevel= [KNL,EARLY] When CONFIG_DEBUG_MEMORY_INIT is set, this parameter allows control of the logging verbosity for @@ -5000,6 +5011,18 @@ that number, otherwise (e.g., 'pmu_override=on'), MMCR1 remains 0. + pm_async= [PM] + Format: off + This parameter sets the initial value of the + /sys/power/pm_async sysfs knob at boot time. + If set to "off", disables asynchronous suspend and + resume of devices during system-wide power transitions. + This can be useful on platforms where device + dependencies are not well-defined, or for debugging + power management issues. Asynchronous operations are + enabled by default. + + pm_debug_messages [SUSPEND,KNL] Enable suspend/resume debug messages during boot up. @@ -5485,7 +5508,8 @@ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" - Default is 0. + Default is 1 if num_possible_cpus() <= 16 and it is not explicitly + disabled by the boot parameter passing 0. rcuscale.gp_async= [KNL] Measure performance of asynchronous @@ -6387,6 +6411,11 @@ sa1100ir [NET] See drivers/net/irda/sa1100_ir.c. + sched_proxy_exec= [KNL] + Enables or disables "proxy execution" style + solution to mutex-based priority inversion. + Format: <bool> + sched_verbose [KNL,EARLY] Enables verbose scheduler debug messages. schedstats= [KNL,X86] Enable or disable scheduled statistics. @@ -6558,14 +6587,14 @@ slab_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_debug legacy name also accepted for now) slab_max_order= [MM] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_max_order legacy name also accepted for now) slab_merge [MM] @@ -6580,13 +6609,14 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_min_objects legacy name also accepted for now) slab_min_order= [MM] Determines the minimum page order for slabs. Must be lower or equal to slab_max_order. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_min_order legacy name also accepted for now) slab_nomerge [MM] @@ -6600,7 +6630,8 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_nomerge legacy name also accepted for now) slab_strict_numa [MM] @@ -7214,6 +7245,14 @@ causing a major performance hit, and the space where machines are deployed is by other means guarded. + tpm_crb_ffa.busy_timeout_ms= [ARM64,TPM] + Maximum time in milliseconds to retry sending a message + to the TPM service before giving up. This parameter controls + how long the system will continue retrying when the TPM + service is busy. + Format: <unsigned int> + Default: 2000 (2 seconds) + tpm_suspend_pcr=[HW,TPM] Format: integer pcr id Specify that at suspend time, the tpm driver diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 2d2f6c222308..ebc83ca20fdc 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,6 +37,7 @@ the Linux memory management. numaperf pagemap shrinker_debugfs + slab soft-dirty swap_numa transhuge diff --git a/Documentation/admin-guide/mm/slab.rst b/Documentation/admin-guide/mm/slab.rst new file mode 100644 index 000000000000..14429ab90611 --- /dev/null +++ b/Documentation/admin-guide/mm/slab.rst @@ -0,0 +1,469 @@ +======================================== +Short users guide for the slab allocator +======================================== + +The slab allocator includes full debugging support (when built with +CONFIG_SLUB_DEBUG=y) but it is off by default (unless built with +CONFIG_SLUB_DEBUG_ON=y). You can enable debugging only for selected +slabs in order to avoid an impact on overall system performance which +may make a bug more difficult to find. + +In order to switch debugging on one can add an option ``slab_debug`` +to the kernel command line. That will enable full debugging for +all slabs. + +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists +slabs that have data in them. See "slabinfo -h" for more options when +running the command. ``slabinfo`` can be compiled with +:: + + gcc -o slabinfo tools/mm/slabinfo.c + +Some of the modes of operation of ``slabinfo`` require that slub debugging +be enabled on the command line. F.e. no tracking information will be +available without debugging on and validation can only partially +be performed if debugging was not switched on. + +Some more sophisticated uses of slab_debug: +------------------------------------------- + +Parameters may be given to ``slab_debug``. If none is specified then full +debugging is enabled. Format: + +slab_debug=<Debug-Options> + Enable options for all slabs + +slab_debug=<Debug-Options>,<slab name1>,<slab name2>,... + Enable options only for select slabs (no spaces + after a comma) + +Multiple blocks of options for all slabs or selected slabs can be given, with +blocks of options delimited by ';'. The last of "all slabs" blocks is applied +to all slabs except those that match one of the "select slabs" block. Options +of the first "select slabs" blocks that matches the slab's name are applied. + +Possible debug options are:: + + F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS + Sorry SLAB legacy issues) + Z Red zoning + P Poisoning (object and padding) + U User tracking (free and alloc) + T Trace (please only use on single slabs) + A Enable failslab filter mark for the cache + O Switch debugging off for caches that would have + caused higher minimum slab orders + - Switch all debugging off (useful if the kernel is + configured with CONFIG_SLUB_DEBUG_ON) + +F.e. in order to boot just with sanity checks and red zoning one would specify:: + + slab_debug=FZ + +Trying to find an issue in the dentry cache? Try:: + + slab_debug=,dentry + +to only enable debugging on the dentry cache. You may use an asterisk at the +end of the slab name, in order to cover all slabs with the same prefix. For +example, here's how you can poison the dentry cache as well as all kmalloc +slabs:: + + slab_debug=P,kmalloc-*,dentry + +Red zoning and tracking may realign the slab. We can just apply sanity checks +to the dentry cache with:: + + slab_debug=F,dentry + +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher likelihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use:: + + slab_debug=O + +You can apply different options to different list of slab names, using blocks +of options. This will enable red zoning for dentry and user tracking for +kmalloc. All other slabs will not get any debugging enabled:: + + slab_debug=Z,dentry;U,kmalloc-* + +You can also enable options (e.g. sanity checks and poisoning) for all caches +except some that are deemed too performance critical and don't need to be +debugged by specifying global debug options followed by a list of slab names +with "-" as options:: + + slab_debug=FZ;-,zs_handle,zspage + +The state of each debug option for a slab can be found in the respective files +under:: + + /sys/kernel/slab/<slab name>/ + +If the file contains 1, the option is enabled, 0 means disabled. The debug +options from the ``slab_debug`` parameter translate to the following files:: + + F sanity_checks + Z red_zone + P poison + U store_user + T trace + A failslab + +failslab file is writable, so writing 1 or 0 will enable or disable +the option at runtime. Write returns -EINVAL if cache is an alias. +Careful with tracing: It may spew out lots of information and never stop if +used on the wrong slab. + +Slab merging +============ + +If no debug options are specified then SLUB may merge similar slabs together +in order to reduce overhead and increase cache hotness of objects. +``slabinfo -a`` displays which slabs were merged together. + +Slab validation +=============== + +SLUB can validate all object if the kernel was booted with slab_debug. In +order to do so you must have the ``slabinfo`` tool. Then you can do +:: + + slabinfo -v + +which will test all objects. Output will be generated to the syslog. + +This also works in a more limited way if boot was without slab debug. +In that case ``slabinfo -v`` simply tests all reachable objects. Usually +these are in the cpu slabs and the partial slabs. Full slabs are not +tracked by SLUB in a non debug situation. + +Getting more performance +======================== + +To some degree SLUB's performance is limited by the need to take the +list_lock once in a while to deal with partial slabs. That overhead is +governed by the order of the allocation for each slab. The allocations +can be influenced by kernel parameters: + +.. slab_min_objects=x (default: automatically scaled by number of cpus) +.. slab_min_order=x (default 0) +.. slab_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) + +``slab_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. + +``slab_min_order`` + specifies a minimum order of slabs. A similar effect like + ``slab_min_objects``. + +``slab_max_order`` + specified the order at which ``slab_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slab_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slab_max_order`` to 0, what cause minimum possible order of + slabs allocation. + +``slab_strict_numa`` + Enables the application of memory policies on each + allocation. This results in more accurate placement of + objects which may result in the reduction of accesses + to remote nodes. The default is to only apply memory + policies at the folio level when a new folio is acquired + or a folio is retrieved from the lists. Enabling this + option reduces the fastpath performance of the slab allocator. + +SLUB Debug output +================= + +Here is a sample of slub debug output:: + + ==================================================================== + BUG kmalloc-8: Right Redzone overwritten + -------------------------------------------------------------------- + + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + + Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object (0xc90f6d20): 31 30 31 39 2e 30 30 35 1019.005 + Redzone (0xc90f6d28): 00 cc cc cc . + Padding (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + + [<c010523d>] dump_trace+0x63/0x1eb + [<c01053df>] show_trace_log_lvl+0x1a/0x2f + [<c010601d>] show_trace+0x12/0x14 + [<c0106035>] dump_stack+0x16/0x18 + [<c017e0fa>] object_err+0x143/0x14b + [<c017e2cc>] check_object+0x66/0x234 + [<c017eb43>] __slab_free+0x239/0x384 + [<c017f446>] kfree+0xa6/0xc6 + [<c02e2335>] get_modalias+0xb9/0xf5 + [<c02e23b7>] dmi_dev_uevent+0x27/0x3c + [<c027866a>] dev_uevent+0x1ad/0x1da + [<c0205024>] kobject_uevent_env+0x20a/0x45b + [<c020527f>] kobject_uevent+0xa/0xf + [<c02779f1>] store_uevent+0x4f/0x58 + [<c027758e>] dev_attr_store+0x29/0x2f + [<c01bec4f>] sysfs_write_file+0x16e/0x19c + [<c0183ba7>] vfs_write+0xd1/0x15a + [<c01841d7>] sys_write+0x3d/0x72 + [<c0104112>] sysenter_past_esp+0x5f/0x99 + [<b7f7b410>] 0xb7f7b410 + ======================= + + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + +If SLUB encounters a corrupted object (full detection requires the kernel +to be booted with slab_debug) then the following output will be dumped +into the syslog: + +1. Description of the problem encountered + + This will be a message in the system log starting with:: + + =============================================== + BUG <slab cache affected>: <What went wrong> + ----------------------------------------------- + + INFO: <corruption start>-<corruption_end> <more info> + INFO: Slab <address> <slab information> + INFO: Object <address> <object information> + INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by + cpu> pid=<pid of the process> + INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> + pid=<pid of the process> + + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slab_debug sets that option) + +2. The object contents if an object was involved. + + Various types of lines can follow the BUG SLUB line: + + Bytes b4 <address> : <bytes> + Shows a few bytes before the object where the problem was detected. + Can be useful if the corruption does not stop with the start of the + object. + + Object <address> : <bytes> + The bytes of the object. If the object is inactive then the bytes + typically contain poison values. Any non-poison value shows a + corruption by a write after free. + + Redzone <address> : <bytes> + The Redzone following the object. The Redzone is used to detect + writes after the object. All bytes should always have the same + value. If there is any deviation then it is due to a write after + the object boundary. + + (Redzone information is only available if SLAB_RED_ZONE is set. + slab_debug sets that option) + + Padding <address> : <bytes> + Unused data to fill up the space in order to get the next object + properly aligned. In the debug case we make sure that there are + at least 4 bytes of padding. This allows the detection of writes + before the object. + +3. A stackdump + + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. + +4. Report on how the problem was dealt with in order to ensure the continued + operation of the system. + + These are messages in the system log beginning with:: + + FIX <slab cache affected>: <corrective action taken> + + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. + +Emergency operations +==================== + +Minimal debugging (sanity checks alone) can be enabled by booting with:: + + slab_debug=F + +This will be generally be enough to enable the resiliency features of slub +which will keep the system running even if a bad kernel component will +keep corrupting objects. This may be important for production systems. +Performance will be impacted by the sanity checks and there will be a +continual stream of error messages to the syslog but no additional memory +will be used (unlike full debugging). + +No guarantees. The kernel component still needs to be fixed. Performance +may be optimized further by locating the slab that experiences corruption +and enabling debugging only for that cache + +I.e.:: + + slab_debug=F,dentry + +If the corruption occurs by writing after the end of the object then it +may be advisable to enable a Redzone to avoid corrupting the beginning +of other objects:: + + slab_debug=FZ,dentry + +Extended slabinfo mode and plotting +=================================== + +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N <num> slabs, default 1) + - Slabs sorted by loss (up to -N <num> slabs, default 1) + +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. + +To generate plots: + +a) collect slabinfo extended records, for example:: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: + + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: + + while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done + +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and height + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). + + +DebugFS files for SLUB +====================== + +For more information about current state of SLUB caches with the user tracking +debug option enabled, debugfs files are available, typically under +/sys/kernel/debug/slab/<cache>/ (created only for caches with enabled user +tracking). There are 2 types of these files with the following debug +information: + +1. alloc_traces:: + + Prints information about unique allocation traces of the currently + allocated objects. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, allocating function, possible memory wastage of + kmalloc objects(total/per-object), minimal/average/maximal jiffies + since alloc, pid range of the allocating processes, cpu mask of + allocating cpus, numa node mask of origins of memory, and stack trace. + + Example::: + + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 + __kmem_cache_alloc_node+0x11f/0x4e0 + kmalloc_trace+0x26/0xa0 + pci_alloc_dev+0x2c/0xa0 + pci_scan_single_device+0xd2/0x150 + pci_scan_slot+0xf7/0x2d0 + pci_scan_child_bus_extend+0x4e/0x360 + acpi_pci_root_create+0x32e/0x3b0 + pci_acpi_scan_root+0x2b9/0x2d0 + acpi_pci_root_add.cold.11+0x110/0xb0a + acpi_bus_attach+0x262/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + +2. free_traces:: + + Prints information about unique freeing traces of the currently allocated + objects. The freeing traces thus come from the previous life-cycle of the + objects and are reported as not available for objects allocated for the first + time. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, freeing function, minimal/average/maximal jiffies since free, + pid range of the freeing processes, cpu mask of freeing cpus, and stack trace. + + Example::: + + 1980 <not-available> age=4294912290 pid=0 cpus=0 + 51 acpi_ut_update_ref_count+0x6a6/0x782 age=236886/237027/237772 pid=1 cpus=1 + kfree+0x2db/0x420 + acpi_ut_update_ref_count+0x6a6/0x782 + acpi_ut_update_object_reference+0x1ad/0x234 + acpi_ut_remove_reference+0x7d/0x84 + acpi_rs_get_prt_method_data+0x97/0xd6 + acpi_get_irq_routing_table+0x82/0xc4 + acpi_pci_irq_find_prt_entry+0x8e/0x2e0 + acpi_pci_irq_lookup+0x3a/0x1e0 + acpi_pci_irq_enable+0x77/0x240 + pcibios_enable_device+0x39/0x40 + do_pci_enable_device.part.0+0x5d/0xe0 + pci_enable_device_flags+0xfc/0x120 + pci_enable_device+0x13/0x20 + virtio_pci_probe+0x9e/0x170 + local_pci_probe+0x48/0x80 + pci_device_probe+0x105/0x1c0 + +Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/admin-guide/pm/amd-pstate.rst b/Documentation/admin-guide/pm/amd-pstate.rst index 412423c54f25..e1771f2225d5 100644 --- a/Documentation/admin-guide/pm/amd-pstate.rst +++ b/Documentation/admin-guide/pm/amd-pstate.rst @@ -72,7 +72,7 @@ to manage each performance update behavior. :: Lowest non- | | | | linear perf ------>+-----------------------+ +-----------------------+ | | | | - | | Lowest perf ---->| | + | | Min perf ---->| | | | | | Lowest perf ------>+-----------------------+ +-----------------------+ | | | | diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst index 2d74af7f0efe..cacb9f0307dd 100644 --- a/Documentation/admin-guide/pm/cpufreq.rst +++ b/Documentation/admin-guide/pm/cpufreq.rst @@ -398,7 +398,9 @@ policy limits change after that. This governor does not do anything by itself. Instead, it allows user space to set the CPU frequency for the policy it is attached to by writing to the -``scaling_setspeed`` attribute of that policy. +``scaling_setspeed`` attribute of that policy. Though the intention may be to +set an exact frequency for the policy, the actual frequency may vary depending +on hardware coordination, thermal and power limits, and other factors. ``schedutil`` ------------- diff --git a/Documentation/admin-guide/syscall-user-dispatch.rst b/Documentation/admin-guide/syscall-user-dispatch.rst index e3cfffef5a63..c1768d9e80fa 100644 --- a/Documentation/admin-guide/syscall-user-dispatch.rst +++ b/Documentation/admin-guide/syscall-user-dispatch.rst @@ -53,20 +53,25 @@ following prctl: prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) -<op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and -disable the mechanism globally for that thread. When -PR_SYS_DISPATCH_OFF is used, the other fields must be zero. - -[<offset>, <offset>+<length>) delimit a memory region interval -from which syscalls are always executed directly, regardless of the -userspace selector. This provides a fast path for the C library, which -includes the most common syscall dispatchers in the native code -applications, and also provides a way for the signal handler to return +<op> is either PR_SYS_DISPATCH_EXCLUSIVE_ON/PR_SYS_DISPATCH_INCLUSIVE_ON +or PR_SYS_DISPATCH_OFF, to enable and disable the mechanism globally for +that thread. When PR_SYS_DISPATCH_OFF is used, the other fields must be zero. + +For PR_SYS_DISPATCH_EXCLUSIVE_ON [<offset>, <offset>+<length>) delimit +a memory region interval from which syscalls are always executed directly, +regardless of the userspace selector. This provides a fast path for the +C library, which includes the most common syscall dispatchers in the native +code applications, and also provides a way for the signal handler to return without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this interface should make sure that at least the signal trampoline code is included in this region. In addition, for syscalls that implement the trampoline code on the vDSO, that trampoline is never intercepted. +For PR_SYS_DISPATCH_INCLUSIVE_ON [<offset>, <offset>+<length>) delimit +a memory region interval from which syscalls are dispatched based on +the userspace selector. Syscalls from outside of the range are always +executed directly. + [selector] is a pointer to a char-sized region in the process memory region, that provides a quick way to enable disable syscall redirection thread-wide, without the need to invoke the kernel directly. selector diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index dd49a89a62d3..c04e6b8eb2b1 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1014,30 +1014,26 @@ perf_user_access (arm64 and riscv only) Controls user space access for reading perf event counters. -arm64 -===== - -The default value is 0 (access disabled). +* for arm64 + The default value is 0 (access disabled). -When set to 1, user space can read performance monitor counter registers -directly. + When set to 1, user space can read performance monitor counter registers + directly. -See Documentation/arch/arm64/perf.rst for more information. - -riscv -===== + See Documentation/arch/arm64/perf.rst for more information. -When set to 0, user space access is disabled. +* for riscv + When set to 0, user space access is disabled. -The default value is 1, user space can read performance monitor counter -registers through perf, any direct access without perf intervention will trigger -an illegal instruction. + The default value is 1, user space can read performance monitor counter + registers through perf, any direct access without perf intervention will trigger + an illegal instruction. -When set to 2, which enables legacy mode (user space has direct access to cycle -and insret CSRs only). Note that this legacy value is deprecated and will be -removed once all user space applications are fixed. + When set to 2, which enables legacy mode (user space has direct access to cycle + and insret CSRs only). Note that this legacy value is deprecated and will be + removed once all user space applications are fixed. -Note that the time CSR is always directly accessible to all modes. + Note that the time CSR is always directly accessible to all modes. pid_max ======= @@ -1465,7 +1461,7 @@ stack_erasing ============= This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with ``CONFIG_GCC_PLUGIN_STACKLEAK``. +of syscalls for kernels built with ``CONFIG_KSTACK_ERASE``. That erasing reduces the information which kernel stack leak bugs can reveal and blocks some uninitialized stack variable attacks. @@ -1473,7 +1469,7 @@ The tradeoff is the performance impact: on a single CPU system kernel compilation sees a 1% slowdown, other systems and workloads may vary. = ==================================================================== -0 Kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. +0 Kernel stack erasing is disabled, KSTACK_ERASE_METRICS are not updated. 1 Kernel stack erasing is enabled (default), it is performed before returning to the userspace at the end of syscalls. = ==================================================================== diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 9bef46151d53..4d71211fdad8 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -465,8 +465,8 @@ The minimum value is 1 (1/1 -> 100%). The value less than 1 completely disables protection of the pages. -max_map_count: -============== +max_map_count +============= This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling @@ -495,8 +495,8 @@ memory allocations. The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT. -memory_failure_early_kill: -========================== +memory_failure_early_kill +========================= Control how to kill processes when uncorrected memory error (typically a 2bit error in a memory module) is detected in the background by hardware diff --git a/Documentation/admin-guide/thunderbolt.rst b/Documentation/admin-guide/thunderbolt.rst index 240fee618e06..102c693c8f81 100644 --- a/Documentation/admin-guide/thunderbolt.rst +++ b/Documentation/admin-guide/thunderbolt.rst @@ -358,12 +358,7 @@ Forcing power Many OEMs include a method that can be used to force the power of a Thunderbolt controller to an "On" state even if nothing is connected. If supported by your machine this will be exposed by the WMI bus with -a sysfs attribute called "force_power". - -For example the intel-wmi-thunderbolt driver exposes this attribute in: - /sys/bus/wmi/devices/86CCFD48-205E-4A77-9C48-2021CBEDE341/force_power - - To force the power to on, write 1 to this attribute file. - To disable force power, write 0 to this attribute file. +a sysfs attribute called "force_power", see +Documentation/ABI/testing/sysfs-platform-intel-wmi-thunderbolt for details. Note: it's currently not possible to query the force power state of a platform. |