diff options
Diffstat (limited to 'Documentation/admin-guide')
-rw-r--r-- | Documentation/admin-guide/bootconfig.rst | 2 | ||||
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 9 | ||||
-rw-r--r-- | Documentation/admin-guide/device-mapper/thin-provisioning.rst | 16 | ||||
-rw-r--r-- | Documentation/admin-guide/kdump/kdump.rst | 21 | ||||
-rw-r--r-- | Documentation/admin-guide/kdump/vmcoreinfo.rst | 8 | ||||
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 114 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/stat.rst | 69 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/damon/usage.rst | 46 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/slab.rst | 469 | ||||
-rw-r--r-- | Documentation/admin-guide/mm/transhuge.rst | 19 | ||||
-rw-r--r-- | Documentation/admin-guide/sysctl/kernel.rst | 24 |
13 files changed, 754 insertions, 45 deletions
diff --git a/Documentation/admin-guide/bootconfig.rst b/Documentation/admin-guide/bootconfig.rst index 91339efdcb54..7a86042c9b6d 100644 --- a/Documentation/admin-guide/bootconfig.rst +++ b/Documentation/admin-guide/bootconfig.rst @@ -265,7 +265,7 @@ The final kernel cmdline will be the following:: Config File Limitation ====================== -Currently the maximum config size size is 32KB and the total key-words (not +Currently the maximum config size is 32KB and the total key-words (not key-value entries) must be under 1024 nodes. Note: this is not the number of entries but nodes, an entry must consume more than 2 nodes (a key-word and a value). So theoretically, it will be diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index bd98ea3175ec..d9d3cc7df348 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -435,6 +435,15 @@ both cgroups. Controlling Controllers ----------------------- +Availablity +~~~~~~~~~~~ + +A controller is available in a cgroup when it is supported by the kernel (i.e., +compiled in, not disabled and not attached to a v1 hierarchy) and listed in the +"cgroup.controllers" file. Availability means the controller's interface files +are exposed in the cgroup’s directory, allowing the distribution of the target +resource to be observed or controlled within that cgroup. + Enabling and Disabling ~~~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/admin-guide/device-mapper/thin-provisioning.rst b/Documentation/admin-guide/device-mapper/thin-provisioning.rst index bafebf79da4b..b2fa49a5608a 100644 --- a/Documentation/admin-guide/device-mapper/thin-provisioning.rst +++ b/Documentation/admin-guide/device-mapper/thin-provisioning.rst @@ -80,11 +80,11 @@ less sharing than average you'll need a larger-than-average metadata device. As a guide, we suggest you calculate the number of bytes to use in the metadata device as 48 * $data_dev_size / $data_block_size but round it up -to 2MB if the answer is smaller. If you're creating large numbers of +to 2MiB if the answer is smaller. If you're creating large numbers of snapshots which are recording large amounts of change, you may find you need to increase this. -The largest size supported is 16GB: If the device is larger, +The largest size supported is 16GiB: If the device is larger, a warning will be issued and the excess space will not be used. Reloading a pool table @@ -107,13 +107,13 @@ Using an existing pool device $data_block_size gives the smallest unit of disk space that can be allocated at a time expressed in units of 512-byte sectors. -$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a -multiple of 128 (64KB). $data_block_size cannot be changed after the +$data_block_size must be between 128 (64KiB) and 2097152 (1GiB) and a +multiple of 128 (64KiB). $data_block_size cannot be changed after the thin-pool is created. People primarily interested in thin provisioning -may want to use a value such as 1024 (512KB). People doing lots of -snapshotting may want a smaller value such as 128 (64KB). If you are +may want to use a value such as 1024 (512KiB). People doing lots of +snapshotting may want a smaller value such as 128 (64KiB). If you are not zeroing newly-allocated data, a larger $data_block_size in the -region of 256000 (128MB) is suggested. +region of 262144 (128MiB) is suggested. $low_water_mark is expressed in blocks of size $data_block_size. If free space on the data device drops below this level then a dm event @@ -291,7 +291,7 @@ i) Constructor error_if_no_space: Error IOs, instead of queueing, if no space. - Data block size must be between 64KB (128 sectors) and 1GB + Data block size must be between 64KiB (128 sectors) and 1GiB (2097152 sectors) inclusive. diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 20fabdf6567e..9c6cd52f69cf 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -311,6 +311,27 @@ crashkernel syntax crashkernel=0,low +4) crashkernel=size,cma + + Reserve additional crash kernel memory from CMA. This reservation is + usable by the first system's userspace memory and kernel movable + allocations (memory balloon, zswap). Pages allocated from this memory + range will not be included in the vmcore so this should not be used if + dumping of userspace memory is intended and it has to be expected that + some movable kernel pages may be missing from the dump. + + A standard crashkernel reservation, as described above, is still needed + to hold the crash kernel and initrd. + + This option increases the risk of a kdump failure: DMA transfers + configured by the first kernel may end up corrupting the second + kernel's memory. + + This reservation method is intended for systems that can't afford to + sacrifice enough memory for standard crashkernel reservation and where + less reliable and possibly incomplete kdump is preferable to no kdump at + all. + Boot into System Kernel ----------------------- 1) Update the boot loader (such as grub, yaboot, or lilo) configuration diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 8cf4614385b7..404a15f6782c 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -325,14 +325,14 @@ NR_FREE_PAGES On linux-2.6.21 or later, the number of free pages is in vm_stat[NR_FREE_PAGES]. Used to get the number of free pages. -PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision|PG_head_mask|PG_hugetlb ------------------------------------------------------------------------------------------ +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_hwpoison|PG_head_mask +-------------------------------------------------------------------------- Page attributes. These flags are used to filter various unnecessary for dumping pages. -PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_unaccepted) -------------------------------------------------------------------------------------------------------------------------- +PAGE_SLAB_MAPCOUNT_VALUE|PAGE_BUDDY_MAPCOUNT_VALUE|PAGE_OFFLINE_MAPCOUNT_VALUE|PAGE_HUGETLB_MAPCOUNT_VALUE|PAGE_UNACCEPTED_MAPCOUNT_VALUE +------------------------------------------------------------------------------------------------------------------------------------------ More page attributes. These flags are used to filter various unnecessary for dumping pages. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0d2ea9a60145..747a55abf494 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -633,6 +633,14 @@ named mounts. Specifying both "all" and "named" disables all v1 hierarchies. + cgroup_v1_proc= [KNL] Show also missing controllers in /proc/cgroups + Format: { "true" | "false" } + /proc/cgroups lists only v1 controllers by default. + This compatibility option enables listing also v2 + controllers (whose v1 code is not compiled!), so that + semi-legacy software can check this file to decide + about usage of v2 (sic) controllers. + cgroup_favordynmods= [KNL] Enable or Disable favordynmods. Format: { "true" | "false" } Defaults to the value of CONFIG_CGROUP_FAVOR_DYNMODS. @@ -986,6 +994,28 @@ 0: to disable low allocation. It will be ignored when crashkernel=X,high is not used or memory reserved is below 4G. + crashkernel=size[KMG],cma + [KNL, X86] Reserve additional crash kernel memory from + CMA. This reservation is usable by the first system's + userspace memory and kernel movable allocations (memory + balloon, zswap). Pages allocated from this memory range + will not be included in the vmcore so this should not + be used if dumping of userspace memory is intended and + it has to be expected that some movable kernel pages + may be missing from the dump. + + A standard crashkernel reservation, as described above, + is still needed to hold the crash kernel and initrd. + + This option increases the risk of a kdump failure: DMA + transfers configured by the first kernel may end up + corrupting the second kernel's memory. + + This reservation method is intended for systems that + can't afford to sacrifice enough memory for standard + crashkernel reservation and where less reliable and + possibly incomplete kdump is preferable to no kdump at + all. cryptomgr.notests [KNL] Disable crypto self-tests @@ -1798,6 +1828,27 @@ backtraces on all cpus. Format: 0 | 1 + hash_pointers= + [KNL,EARLY] + By default, when pointers are printed to the console + or buffers via the %p format string, that pointer is + "hashed", i.e. obscured by hashing the pointer value. + This is a security feature that hides actual kernel + addresses from unprivileged users, but it also makes + debugging the kernel more difficult since unequal + pointers can no longer be compared. The choices are: + Format: { auto | always | never } + Default: auto + + auto - Hash pointers unless slab_debug is enabled. + always - Always hash pointers (even if slab_debug is + enabled). + never - Never hash pointers. This option should only + be specified when debugging the kernel. Do + not use on production kernels. The boot + param "no_hash_pointers" is an alias for + this mode. + hashdist= [KNL,NUMA] Large hashes allocated during boot are distributed across NUMA nodes. Defaults on for 64-bit NUMA, off otherwise. @@ -2212,6 +2263,11 @@ different crypto accelerators. This option can be used to achieve best performance for particular HW. + ima= [IMA] Enable or disable IMA + Format: { "off" | "on" } + Default: "on" + Note that disabling IMA is limited to kdump kernel. + indirect_target_selection= [X86,Intel] Mitigation control for Indirect Target Selection(ITS) bug in Intel CPUs. Updated microcode is also required for a fix in IBPB. @@ -4181,18 +4237,7 @@ no_hash_pointers [KNL,EARLY] - Force pointers printed to the console or buffers to be - unhashed. By default, when a pointer is printed via %p - format string, that pointer is "hashed", i.e. obscured - by hashing the pointer value. This is a security feature - that hides actual kernel addresses from unprivileged - users, but it also makes debugging the kernel more - difficult since unequal pointers can no longer be - compared. However, if this command-line option is - specified, then all normal pointers will have their true - value printed. This option should only be specified when - debugging the kernel. Please do not use on production - kernels. + Alias for "hash_pointers=never". nohibernate [HIBERNATION] Disable hibernation and resume. @@ -4544,7 +4589,7 @@ bit 2: print timer info bit 3: print locks info if CONFIG_LOCKDEP is on bit 4: print ftrace buffer - bit 5: print all printk messages in buffer + bit 5: replay all messages on consoles at the end of panic bit 6: print all CPUs backtrace (if available in the arch) bit 7: print only tasks in uninterruptible (blocked) state *Be aware* that this option may print a _lot_ of lines, @@ -4552,6 +4597,25 @@ Use this option carefully, maybe worth to setup a bigger log buffer with "log_buf_len" along with this. + panic_sys_info= A comma separated list of extra information to be dumped + on panic. + Format: val[,val...] + Where @val can be any of the following: + + tasks: print all tasks info + mem: print system memory info + timers: print timers info + locks: print locks info if CONFIG_LOCKDEP is on + ftrace: print ftrace buffer + all_bt: print all CPUs backtrace (if available in the arch) + blocked_tasks: print only tasks in uninterruptible (blocked) state + + This is a human readable alternative to the 'panic_print' option. + + panic_console_replay + When panic happens, replay all kernel messages on + consoles at the end of panic. + parkbd.port= [HW] Parallel port number the keyboard adapter is connected to, default is 0. Format: <parport#> @@ -5508,7 +5572,8 @@ echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp or pass a boot parameter "rcutree.rcu_normal_wake_from_gp=1" - Default is 0. + Default is 1 if num_possible_cpus() <= 16 and it is not explicitly + disabled by the boot parameter passing 0. rcuscale.gp_async= [KNL] Measure performance of asynchronous @@ -6586,14 +6651,18 @@ slab_debug can create guard zones around objects and may poison objects when not in use. Also tracks the last alloc / free. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_debug legacy name also accepted for now) + Using this option implies the "no_hash_pointers" + option which can be undone by adding the + "hash_pointers=always" option. + slab_max_order= [MM] Determines the maximum allowed order for slabs. A high setting may cause OOMs due to memory fragmentation. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_max_order legacy name also accepted for now) slab_merge [MM] @@ -6608,13 +6677,14 @@ the number of objects indicated. The higher the number of objects the smaller the overhead of tracking slabs and the less frequently locks need to be acquired. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_min_objects legacy name also accepted for now) slab_min_order= [MM] Determines the minimum page order for slabs. Must be lower or equal to slab_max_order. For more information see - Documentation/mm/slub.rst. + Documentation/admin-guide/mm/slab.rst. (slub_min_order legacy name also accepted for now) slab_nomerge [MM] @@ -6628,7 +6698,8 @@ cache (risks via metadata attacks are mostly unchanged). Debug options disable merging on their own. - For more information see Documentation/mm/slub.rst. + For more information see + Documentation/admin-guide/mm/slab.rst. (slub_nomerge legacy name also accepted for now) slab_strict_numa [MM] @@ -7016,6 +7087,11 @@ consumed by the stack hash table. By default this is set to false. + stack_depot_max_pools= [KNL,EARLY] + Specify the maximum number of pools to use for storing + stack traces. Pools are allocated on-demand up to this + limit. Default value is 8191 pools. + stacktrace [FTRACE] Enabled the stack tracer on boot up. diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index bc7e976120e0..3ce3164480c7 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -14,3 +14,4 @@ access monitoring and access-aware system operations. usage reclaim lru_sort + stat diff --git a/Documentation/admin-guide/mm/damon/stat.rst b/Documentation/admin-guide/mm/damon/stat.rst new file mode 100644 index 000000000000..4c517c2c219a --- /dev/null +++ b/Documentation/admin-guide/mm/damon/stat.rst @@ -0,0 +1,69 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Data Access Monitoring Results Stat +=================================== + +Data Access Monitoring Results Stat (DAMON_STAT) is a static kernel module that +is aimed to be used for simple access pattern monitoring. It monitors accesses +on the system's entire physical memory using DAMON, and provides simplified +access monitoring results statistics, namely idle time percentiles and +estimated memory bandwidth. + +Monitoring Accuracy and Overhead +================================ + +DAMON_STAT uses monitoring intervals :ref:`auto-tuning +<damon_design_monitoring_intervals_autotuning>` to make its accuracy high and +overhead minimum. It auto-tunes the intervals aiming 4 % of observable access +events to be captured in each snapshot, while limiting the resulting sampling +events to be 5 milliseconds in minimum and 10 seconds in maximum. On a few +production server systems, it resulted in consuming only 0.x % single CPU time, +while capturing reasonable quality of access patterns. + +Interface: Module Parameters +============================ + +To use this feature, you should first ensure your system is running on a kernel +that is built with ``CONFIG_DAMON_STAT=y``. The feature can be enabled by +default at build time, by setting ``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` true. + +To let sysadmins enable or disable it at boot and/or runtime, and read the +monitoring results, DAMON_STAT provides module parameters. Following +sections are descriptions of the parameters. + +enabled +------- + +Enable or disable DAMON_STAT. + +You can enable DAMON_STAT by setting the value of this parameter as ``Y``. +Setting it as ``N`` disables DAMON_STAT. The default value is set by +``CONFIG_DAMON_STAT_ENABLED_DEFAULT`` build config option. + +estimated_memory_bandwidth +-------------------------- + +Estimated memory bandwidth consumption (bytes per second) of the system. + +DAMON_STAT reads observed access events on the current DAMON results snapshot +and converts it to memory bandwidth consumption estimation in bytes per second. +The resulting metric is exposed to user via this read-only parameter. Because +DAMON uses sampling, this is only an estimation of the access intensity rather +than accurate memory bandwidth. + +memory_idle_ms_percentiles +-------------------------- + +Per-byte idle time (milliseconds) percentiles of the system. + +DAMON_STAT calculates how long each byte of the memory was not accessed until +now (idle time), based on the current DAMON results snapshot. If DAMON found a +region of access frequency (nr_accesses) larger than zero, every byte of the +region gets zero idle time. If a region has zero access frequency +(nr_accesses), how long the region was keeping the zero access frequency (age) +becomes the idle time of every byte of the region. Then, DAMON_STAT exposes +the percentiles of the idle time values via this read-only parameter. Reading +the parameter returns 101 idle time values in milliseconds, separated by comma. +Each value represents 0-th, 1st, 2nd, 3rd, ..., 99th and 100th percentile idle +times. diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index d960aba72b82..ff3a2dda1f02 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -59,7 +59,7 @@ comma (","). :ref:`/sys/kernel/mm/damon <sysfs_root>`/admin │ :ref:`kdamonds <sysfs_kdamonds>`/nr_kdamonds - │ │ :ref:`0 <sysfs_kdamond>`/state,pid + │ │ :ref:`0 <sysfs_kdamond>`/state,pid,refresh_ms │ │ │ :ref:`contexts <sysfs_contexts>`/nr_contexts │ │ │ │ :ref:`0 <sysfs_context>`/avail_operations,operations │ │ │ │ │ :ref:`monitoring_attrs <sysfs_monitoring_attrs>`/ @@ -85,6 +85,8 @@ comma (","). │ │ │ │ │ │ │ :ref:`watermarks <sysfs_watermarks>`/metric,interval_us,high,mid,low │ │ │ │ │ │ │ :ref:`{core_,ops_,}filters <sysfs_filters>`/nr_filters │ │ │ │ │ │ │ │ 0/type,matching,allow,memcg_path,addr_start,addr_end,target_idx,min,max + │ │ │ │ │ │ │ :ref:`dests <damon_sysfs_dests>`/nr_dests + │ │ │ │ │ │ │ │ 0/id,weight │ │ │ │ │ │ │ :ref:`stats <sysfs_schemes_stats>`/nr_tried,sz_tried,nr_applied,sz_applied,sz_ops_filter_passed,qt_exceeds │ │ │ │ │ │ │ :ref:`tried_regions <sysfs_schemes_tried_regions>`/total_bytes │ │ │ │ │ │ │ │ 0/start,end,nr_accesses,age,sz_filter_passed @@ -121,8 +123,8 @@ kdamond. kdamonds/<N>/ ------------- -In each kdamond directory, two files (``state`` and ``pid``) and one directory -(``contexts``) exist. +In each kdamond directory, three files (``state``, ``pid`` and ``refresh_ms``) +and one directory (``contexts``) exist. Reading ``state`` returns ``on`` if the kdamond is currently running, or ``off`` if it is not running. @@ -159,6 +161,13 @@ Users can write below commands for the kdamond to the ``state`` file. If the state is ``on``, reading ``pid`` shows the pid of the kdamond thread. +Users can ask the kernel to periodically update files showing auto-tuned +parameters and DAMOS stats instead of manually writing +``update_tuned_intervals`` like keywords to ``state`` file. For this, users +should write the desired update time interval in milliseconds to ``refresh_ms`` +file. If the interval is zero, the periodic update is disabled. Reading the +file shows currently set time interval. + ``contexts`` directory contains files for controlling the monitoring contexts that this kdamond will execute. @@ -307,10 +316,10 @@ to ``N-1``. Each directory represents each DAMON-based operation scheme. schemes/<N>/ ------------ -In each scheme directory, seven directories (``access_pattern``, ``quotas``, -``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``stats``, and -``tried_regions``) and three files (``action``, ``target_nid`` and -``apply_interval``) exist. +In each scheme directory, eight directories (``access_pattern``, ``quotas``, +``watermarks``, ``core_filters``, ``ops_filters``, ``filters``, ``dests``, +``stats``, and ``tried_regions``) and three files (``action``, ``target_nid`` +and ``apply_interval``) exist. The ``action`` file is for setting and getting the scheme's :ref:`action <damon_design_damos_action>`. The keywords that can be written to and read @@ -484,6 +493,29 @@ Refer to the :ref:`DAMOS filters design documentation of different ``allow`` works, when each of the filters are supported, and differences on stats. +.. _damon_sysfs_dests: + +schemes/<N>/dests/ +------------------ + +Directory for specifying the destinations of given DAMON-based operation +scheme's action. This directory is ignored if the action of the given scheme +is not supporting multiple destinations. Only ``DAMOS_MIGRATE_{HOT,COLD}`` +actions are supporting multiple destinations. + +In the beginning, the directory has only one file, ``nr_dests``. Writing a +number (``N``) to the file creates the number of child directories named ``0`` +to ``N-1``. Each directory represents each action destination. + +Each destination directory contains two files, namely ``id`` and ``weight``. +Users can write and read the identifier of the destination to ``id`` file. +For ``DAMOS_MIGRATE_{HOT,COLD}`` actions, the migrate destination node's node +id should be written to ``id`` file. Users can write and read the weight of +the destination among the given destinations to the ``weight`` file. The +weight can be an arbitrary integer. When DAMOS apply the action to each entity +of the memory region, it will select the destination of the action based on the +relative weights of the destinations. + .. _sysfs_schemes_stats: schemes/<N>/stats/ diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 2d2f6c222308..ebc83ca20fdc 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -37,6 +37,7 @@ the Linux memory management. numaperf pagemap shrinker_debugfs + slab soft-dirty swap_numa transhuge diff --git a/Documentation/admin-guide/mm/slab.rst b/Documentation/admin-guide/mm/slab.rst new file mode 100644 index 000000000000..14429ab90611 --- /dev/null +++ b/Documentation/admin-guide/mm/slab.rst @@ -0,0 +1,469 @@ +======================================== +Short users guide for the slab allocator +======================================== + +The slab allocator includes full debugging support (when built with +CONFIG_SLUB_DEBUG=y) but it is off by default (unless built with +CONFIG_SLUB_DEBUG_ON=y). You can enable debugging only for selected +slabs in order to avoid an impact on overall system performance which +may make a bug more difficult to find. + +In order to switch debugging on one can add an option ``slab_debug`` +to the kernel command line. That will enable full debugging for +all slabs. + +Typically one would then use the ``slabinfo`` command to get statistical +data and perform operation on the slabs. By default ``slabinfo`` only lists +slabs that have data in them. See "slabinfo -h" for more options when +running the command. ``slabinfo`` can be compiled with +:: + + gcc -o slabinfo tools/mm/slabinfo.c + +Some of the modes of operation of ``slabinfo`` require that slub debugging +be enabled on the command line. F.e. no tracking information will be +available without debugging on and validation can only partially +be performed if debugging was not switched on. + +Some more sophisticated uses of slab_debug: +------------------------------------------- + +Parameters may be given to ``slab_debug``. If none is specified then full +debugging is enabled. Format: + +slab_debug=<Debug-Options> + Enable options for all slabs + +slab_debug=<Debug-Options>,<slab name1>,<slab name2>,... + Enable options only for select slabs (no spaces + after a comma) + +Multiple blocks of options for all slabs or selected slabs can be given, with +blocks of options delimited by ';'. The last of "all slabs" blocks is applied +to all slabs except those that match one of the "select slabs" block. Options +of the first "select slabs" blocks that matches the slab's name are applied. + +Possible debug options are:: + + F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS + Sorry SLAB legacy issues) + Z Red zoning + P Poisoning (object and padding) + U User tracking (free and alloc) + T Trace (please only use on single slabs) + A Enable failslab filter mark for the cache + O Switch debugging off for caches that would have + caused higher minimum slab orders + - Switch all debugging off (useful if the kernel is + configured with CONFIG_SLUB_DEBUG_ON) + +F.e. in order to boot just with sanity checks and red zoning one would specify:: + + slab_debug=FZ + +Trying to find an issue in the dentry cache? Try:: + + slab_debug=,dentry + +to only enable debugging on the dentry cache. You may use an asterisk at the +end of the slab name, in order to cover all slabs with the same prefix. For +example, here's how you can poison the dentry cache as well as all kmalloc +slabs:: + + slab_debug=P,kmalloc-*,dentry + +Red zoning and tracking may realign the slab. We can just apply sanity checks +to the dentry cache with:: + + slab_debug=F,dentry + +Debugging options may require the minimum possible slab order to increase as +a result of storing the metadata (for example, caches with PAGE_SIZE object +sizes). This has a higher likelihood of resulting in slab allocation errors +in low memory situations or if there's high fragmentation of memory. To +switch off debugging for such caches by default, use:: + + slab_debug=O + +You can apply different options to different list of slab names, using blocks +of options. This will enable red zoning for dentry and user tracking for +kmalloc. All other slabs will not get any debugging enabled:: + + slab_debug=Z,dentry;U,kmalloc-* + +You can also enable options (e.g. sanity checks and poisoning) for all caches +except some that are deemed too performance critical and don't need to be +debugged by specifying global debug options followed by a list of slab names +with "-" as options:: + + slab_debug=FZ;-,zs_handle,zspage + +The state of each debug option for a slab can be found in the respective files +under:: + + /sys/kernel/slab/<slab name>/ + +If the file contains 1, the option is enabled, 0 means disabled. The debug +options from the ``slab_debug`` parameter translate to the following files:: + + F sanity_checks + Z red_zone + P poison + U store_user + T trace + A failslab + +failslab file is writable, so writing 1 or 0 will enable or disable +the option at runtime. Write returns -EINVAL if cache is an alias. +Careful with tracing: It may spew out lots of information and never stop if +used on the wrong slab. + +Slab merging +============ + +If no debug options are specified then SLUB may merge similar slabs together +in order to reduce overhead and increase cache hotness of objects. +``slabinfo -a`` displays which slabs were merged together. + +Slab validation +=============== + +SLUB can validate all object if the kernel was booted with slab_debug. In +order to do so you must have the ``slabinfo`` tool. Then you can do +:: + + slabinfo -v + +which will test all objects. Output will be generated to the syslog. + +This also works in a more limited way if boot was without slab debug. +In that case ``slabinfo -v`` simply tests all reachable objects. Usually +these are in the cpu slabs and the partial slabs. Full slabs are not +tracked by SLUB in a non debug situation. + +Getting more performance +======================== + +To some degree SLUB's performance is limited by the need to take the +list_lock once in a while to deal with partial slabs. That overhead is +governed by the order of the allocation for each slab. The allocations +can be influenced by kernel parameters: + +.. slab_min_objects=x (default: automatically scaled by number of cpus) +.. slab_min_order=x (default 0) +.. slab_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER)) + +``slab_min_objects`` + allows to specify how many objects must at least fit into one + slab in order for the allocation order to be acceptable. In + general slub will be able to perform this number of + allocations on a slab without consulting centralized resources + (list_lock) where contention may occur. + +``slab_min_order`` + specifies a minimum order of slabs. A similar effect like + ``slab_min_objects``. + +``slab_max_order`` + specified the order at which ``slab_min_objects`` should no + longer be checked. This is useful to avoid SLUB trying to + generate super large order pages to fit ``slab_min_objects`` + of a slab cache with large object sizes into one high order + page. Setting command line parameter + ``debug_guardpage_minorder=N`` (N > 0), forces setting + ``slab_max_order`` to 0, what cause minimum possible order of + slabs allocation. + +``slab_strict_numa`` + Enables the application of memory policies on each + allocation. This results in more accurate placement of + objects which may result in the reduction of accesses + to remote nodes. The default is to only apply memory + policies at the folio level when a new folio is acquired + or a folio is retrieved from the lists. Enabling this + option reduces the fastpath performance of the slab allocator. + +SLUB Debug output +================= + +Here is a sample of slub debug output:: + + ==================================================================== + BUG kmalloc-8: Right Redzone overwritten + -------------------------------------------------------------------- + + INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc + INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58 + INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58 + INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554 + + Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ + Object (0xc90f6d20): 31 30 31 39 2e 30 30 35 1019.005 + Redzone (0xc90f6d28): 00 cc cc cc . + Padding (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ + + [<c010523d>] dump_trace+0x63/0x1eb + [<c01053df>] show_trace_log_lvl+0x1a/0x2f + [<c010601d>] show_trace+0x12/0x14 + [<c0106035>] dump_stack+0x16/0x18 + [<c017e0fa>] object_err+0x143/0x14b + [<c017e2cc>] check_object+0x66/0x234 + [<c017eb43>] __slab_free+0x239/0x384 + [<c017f446>] kfree+0xa6/0xc6 + [<c02e2335>] get_modalias+0xb9/0xf5 + [<c02e23b7>] dmi_dev_uevent+0x27/0x3c + [<c027866a>] dev_uevent+0x1ad/0x1da + [<c0205024>] kobject_uevent_env+0x20a/0x45b + [<c020527f>] kobject_uevent+0xa/0xf + [<c02779f1>] store_uevent+0x4f/0x58 + [<c027758e>] dev_attr_store+0x29/0x2f + [<c01bec4f>] sysfs_write_file+0x16e/0x19c + [<c0183ba7>] vfs_write+0xd1/0x15a + [<c01841d7>] sys_write+0x3d/0x72 + [<c0104112>] sysenter_past_esp+0x5f/0x99 + [<b7f7b410>] 0xb7f7b410 + ======================= + + FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc + +If SLUB encounters a corrupted object (full detection requires the kernel +to be booted with slab_debug) then the following output will be dumped +into the syslog: + +1. Description of the problem encountered + + This will be a message in the system log starting with:: + + =============================================== + BUG <slab cache affected>: <What went wrong> + ----------------------------------------------- + + INFO: <corruption start>-<corruption_end> <more info> + INFO: Slab <address> <slab information> + INFO: Object <address> <object information> + INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by + cpu> pid=<pid of the process> + INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu> + pid=<pid of the process> + + (Object allocation / free information is only available if SLAB_STORE_USER is + set for the slab. slab_debug sets that option) + +2. The object contents if an object was involved. + + Various types of lines can follow the BUG SLUB line: + + Bytes b4 <address> : <bytes> + Shows a few bytes before the object where the problem was detected. + Can be useful if the corruption does not stop with the start of the + object. + + Object <address> : <bytes> + The bytes of the object. If the object is inactive then the bytes + typically contain poison values. Any non-poison value shows a + corruption by a write after free. + + Redzone <address> : <bytes> + The Redzone following the object. The Redzone is used to detect + writes after the object. All bytes should always have the same + value. If there is any deviation then it is due to a write after + the object boundary. + + (Redzone information is only available if SLAB_RED_ZONE is set. + slab_debug sets that option) + + Padding <address> : <bytes> + Unused data to fill up the space in order to get the next object + properly aligned. In the debug case we make sure that there are + at least 4 bytes of padding. This allows the detection of writes + before the object. + +3. A stackdump + + The stackdump describes the location where the error was detected. The cause + of the corruption is may be more likely found by looking at the function that + allocated or freed the object. + +4. Report on how the problem was dealt with in order to ensure the continued + operation of the system. + + These are messages in the system log beginning with:: + + FIX <slab cache affected>: <corrective action taken> + + In the above sample SLUB found that the Redzone of an active object has + been overwritten. Here a string of 8 characters was written into a slab that + has the length of 8 characters. However, a 8 character string needs a + terminating 0. That zero has overwritten the first byte of the Redzone field. + After reporting the details of the issue encountered the FIX SLUB message + tells us that SLUB has restored the Redzone to its proper value and then + system operations continue. + +Emergency operations +==================== + +Minimal debugging (sanity checks alone) can be enabled by booting with:: + + slab_debug=F + +This will be generally be enough to enable the resiliency features of slub +which will keep the system running even if a bad kernel component will +keep corrupting objects. This may be important for production systems. +Performance will be impacted by the sanity checks and there will be a +continual stream of error messages to the syslog but no additional memory +will be used (unlike full debugging). + +No guarantees. The kernel component still needs to be fixed. Performance +may be optimized further by locating the slab that experiences corruption +and enabling debugging only for that cache + +I.e.:: + + slab_debug=F,dentry + +If the corruption occurs by writing after the end of the object then it +may be advisable to enable a Redzone to avoid corrupting the beginning +of other objects:: + + slab_debug=FZ,dentry + +Extended slabinfo mode and plotting +=================================== + +The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes: + - Slabcache Totals + - Slabs sorted by size (up to -N <num> slabs, default 1) + - Slabs sorted by loss (up to -N <num> slabs, default 1) + +Additionally, in this mode ``slabinfo`` does not dynamically scale +sizes (G/M/K) and reports everything in bytes (this functionality is +also available to other slabinfo modes via '-B' option) which makes +reporting more precise and accurate. Moreover, in some sense the `-X' +mode also simplifies the analysis of slabs' behaviour, because its +output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it +pushes the analysis from looking through the numbers (tons of numbers) +to something easier -- visual analysis. + +To generate plots: + +a) collect slabinfo extended records, for example:: + + while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done + +b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script:: + + slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN] + + The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records + and generates 3 png files (and 3 pre-processing cache files) per STATS + file: + - Slabcache Totals: FOO_STATS-totals.png + - Slabs sorted by size: FOO_STATS-slabs-by-size.png + - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png + +Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you +need to compare slabs' behaviour "prior to" and "after" some code +modification. To help you out there, ``slabinfo-gnuplot.sh`` script +can 'merge' the `Slabcache Totals` sections from different +measurements. To visually compare N plots: + +a) Collect as many STATS1, STATS2, .. STATSN files as you need:: + + while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done + +b) Pre-process those STATS files:: + + slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN + +c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the + generated pre-processed \*-totals:: + + slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals + + This will produce a single plot (png file). + + Plots, expectedly, can be large so some fluctuations or small spikes + can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two + options to 'zoom-in'/'zoom-out': + + a) ``-s %d,%d`` -- overwrites the default image width and height + b) ``-r %d,%d`` -- specifies a range of samples to use (for example, + in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r + 40,60`` range will plot only samples collected between 40th and + 60th seconds). + + +DebugFS files for SLUB +====================== + +For more information about current state of SLUB caches with the user tracking +debug option enabled, debugfs files are available, typically under +/sys/kernel/debug/slab/<cache>/ (created only for caches with enabled user +tracking). There are 2 types of these files with the following debug +information: + +1. alloc_traces:: + + Prints information about unique allocation traces of the currently + allocated objects. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, allocating function, possible memory wastage of + kmalloc objects(total/per-object), minimal/average/maximal jiffies + since alloc, pid range of the allocating processes, cpu mask of + allocating cpus, numa node mask of origins of memory, and stack trace. + + Example::: + + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 + __kmem_cache_alloc_node+0x11f/0x4e0 + kmalloc_trace+0x26/0xa0 + pci_alloc_dev+0x2c/0xa0 + pci_scan_single_device+0xd2/0x150 + pci_scan_slot+0xf7/0x2d0 + pci_scan_child_bus_extend+0x4e/0x360 + acpi_pci_root_create+0x32e/0x3b0 + pci_acpi_scan_root+0x2b9/0x2d0 + acpi_pci_root_add.cold.11+0x110/0xb0a + acpi_bus_attach+0x262/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + +2. free_traces:: + + Prints information about unique freeing traces of the currently allocated + objects. The freeing traces thus come from the previous life-cycle of the + objects and are reported as not available for objects allocated for the first + time. The output is sorted by frequency of each trace. + + Information in the output: + Number of objects, freeing function, minimal/average/maximal jiffies since free, + pid range of the freeing processes, cpu mask of freeing cpus, and stack trace. + + Example::: + + 1980 <not-available> age=4294912290 pid=0 cpus=0 + 51 acpi_ut_update_ref_count+0x6a6/0x782 age=236886/237027/237772 pid=1 cpus=1 + kfree+0x2db/0x420 + acpi_ut_update_ref_count+0x6a6/0x782 + acpi_ut_update_object_reference+0x1ad/0x234 + acpi_ut_remove_reference+0x7d/0x84 + acpi_rs_get_prt_method_data+0x97/0xd6 + acpi_get_irq_routing_table+0x82/0xc4 + acpi_pci_irq_find_prt_entry+0x8e/0x2e0 + acpi_pci_irq_lookup+0x3a/0x1e0 + acpi_pci_irq_enable+0x77/0x240 + pcibios_enable_device+0x39/0x40 + do_pci_enable_device.part.0+0x5d/0xe0 + pci_enable_device_flags+0xfc/0x120 + pci_enable_device+0x13/0x20 + virtio_pci_probe+0x9e/0x170 + local_pci_probe+0x48/0x80 + pci_device_probe+0x105/0x1c0 + +Christoph Lameter, May 30, 2007 +Sergey Senozhatsky, October 23, 2015 diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index dff8d5985f0f..370fba113460 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -107,7 +107,7 @@ sysfs Global THP controls ------------------- -Transparent Hugepage Support for anonymous memory can be entirely disabled +Transparent Hugepage Support for anonymous memory can be disabled (mostly for debugging purposes) or only enabled inside MADV_HUGEPAGE regions (to avoid the risk of consuming more memory resources) or enabled system wide. This can be achieved per-supported-THP-size with one of:: @@ -119,6 +119,11 @@ system wide. This can be achieved per-supported-THP-size with one of:: where <size> is the hugepage size being addressed, the available sizes for which vary by system. +.. note:: Setting "never" in all sysfs THP controls does **not** disable + Transparent Huge Pages globally. This is because ``madvise(..., + MADV_COLLAPSE)`` ignores these settings and collapses ranges to + PMD-sized huge pages unconditionally. + For example:: echo always >/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled @@ -187,7 +192,9 @@ madvise behaviour. never - should be self-explanatory. + should be self-explanatory. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be + obtained even if this mode is specified everywhere. By default kernel tries to use huge, PMD-mappable zero page on read page fault to anonymous mapping. It's possible to disable huge zero @@ -378,7 +385,9 @@ always Attempt to allocate huge pages every time we need a new page; never - Do not allocate huge pages; + Do not allocate huge pages. Note that ``madvise(..., MADV_COLLAPSE)`` + can still cause transparent huge pages to be obtained even if this mode + is specified everywhere; within_size Only allocate huge page if it will be fully within i_size. @@ -434,7 +443,9 @@ inherit have enabled="inherit" and all other hugepage sizes have enabled="never"; never - Do not allocate <size> huge pages; + Do not allocate <size> huge pages. Note that ``madvise(..., + MADV_COLLAPSE)`` can still cause transparent huge pages to be obtained + even if this mode is specified everywhere; within_size Only allocate <size> huge page if it will be fully within i_size. diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index c04e6b8eb2b1..8b49eab937d0 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -177,6 +177,7 @@ core_pattern %E executable path %c maximum size of core file by resource limit RLIMIT_CORE %C CPU the task ran on + %F pidfd number %<OTHER> both are dropped ======== ========================================== @@ -889,7 +890,7 @@ bit 1 print system memory info bit 2 print timer info bit 3 print locks info if ``CONFIG_LOCKDEP`` is on bit 4 print ftrace buffer -bit 5 print all printk messages in buffer +bit 5 replay all messages on consoles at the end of panic bit 6 print all CPUs backtrace (if available in the arch) bit 7 print only tasks in uninterruptible (blocked) state ===== ============================================ @@ -899,6 +900,24 @@ So for example to print tasks and memory info on panic, user can:: echo 3 > /proc/sys/kernel/panic_print +panic_sys_info +============== + +A comma separated list of extra information to be dumped on panic, +for example, "tasks,mem,timers,...". It is a human readable alternative +to 'panic_print'. Possible values are: + +============= =================================================== +tasks print all tasks info +mem print system memory info +timer print timers info +lock print locks info if CONFIG_LOCKDEP is on +ftrace print ftrace buffer +all_bt print all CPUs backtrace (if available in the arch) +blocked_tasks print only tasks in uninterruptible (blocked) state +============= =================================================== + + panic_on_rcu_stall ================== @@ -1106,7 +1125,8 @@ printk_ratelimit_burst While long term we enforce one message per `printk_ratelimit`_ seconds, we do allow a burst of messages to pass through. ``printk_ratelimit_burst`` specifies the number of messages we can -send before ratelimiting kicks in. +send before ratelimiting kicks in. After `printk_ratelimit`_ seconds +have elapsed, another burst of messages may be sent. The default value is 10 messages. |