From 570432470275c3da15b85362bc1461945b9c1919 Mon Sep 17 00:00:00 2001 From: Mauro Carvalho Chehab Date: Mon, 22 Apr 2019 16:48:00 -0300 Subject: docs: admin-guide: move sysctl directory to it The stuff under sysctl describes /sys interface from userspace point of view. So, add it to the admin-guide and remove the :orphan: from its index file. Signed-off-by: Mauro Carvalho Chehab --- CREDITS | 2 +- Documentation/admin-guide/index.rst | 1 + Documentation/admin-guide/kernel-parameters.txt | 2 +- Documentation/admin-guide/mm/index.rst | 2 +- Documentation/admin-guide/mm/ksm.rst | 2 +- Documentation/admin-guide/sysctl/abi.rst | 67 ++ Documentation/admin-guide/sysctl/fs.rst | 384 ++++++++ Documentation/admin-guide/sysctl/index.rst | 98 ++ Documentation/admin-guide/sysctl/kernel.rst | 1177 +++++++++++++++++++++++ Documentation/admin-guide/sysctl/net.rst | 461 +++++++++ Documentation/admin-guide/sysctl/sunrpc.rst | 25 + Documentation/admin-guide/sysctl/user.rst | 78 ++ Documentation/admin-guide/sysctl/vm.rst | 964 +++++++++++++++++++ Documentation/core-api/printk-formats.rst | 2 +- Documentation/filesystems/proc.txt | 2 +- Documentation/networking/ip-sysctl.txt | 2 +- Documentation/sysctl/abi.rst | 67 -- Documentation/sysctl/fs.rst | 384 -------- Documentation/sysctl/index.rst | 100 -- Documentation/sysctl/kernel.rst | 1177 ----------------------- Documentation/sysctl/net.rst | 461 --------- Documentation/sysctl/sunrpc.rst | 25 - Documentation/sysctl/user.rst | 78 -- Documentation/sysctl/vm.rst | 964 ------------------- Documentation/vm/unevictable-lru.rst | 2 +- fs/proc/Kconfig | 2 +- kernel/panic.c | 2 +- mm/swap.c | 2 +- 28 files changed, 3266 insertions(+), 3267 deletions(-) create mode 100644 Documentation/admin-guide/sysctl/abi.rst create mode 100644 Documentation/admin-guide/sysctl/fs.rst create mode 100644 Documentation/admin-guide/sysctl/index.rst create mode 100644 Documentation/admin-guide/sysctl/kernel.rst create mode 100644 Documentation/admin-guide/sysctl/net.rst create mode 100644 Documentation/admin-guide/sysctl/sunrpc.rst create mode 100644 Documentation/admin-guide/sysctl/user.rst create mode 100644 Documentation/admin-guide/sysctl/vm.rst delete mode 100644 Documentation/sysctl/abi.rst delete mode 100644 Documentation/sysctl/fs.rst delete mode 100644 Documentation/sysctl/index.rst delete mode 100644 Documentation/sysctl/kernel.rst delete mode 100644 Documentation/sysctl/net.rst delete mode 100644 Documentation/sysctl/sunrpc.rst delete mode 100644 Documentation/sysctl/user.rst delete mode 100644 Documentation/sysctl/vm.rst diff --git a/CREDITS b/CREDITS index beac0c81d081..401c5092bbf9 100644 --- a/CREDITS +++ b/CREDITS @@ -3120,7 +3120,7 @@ S: France N: Rik van Riel E: riel@redhat.com W: http://www.surriel.com/ -D: Linux-MM site, Documentation/sysctl/*, swap/mm readaround +D: Linux-MM site, Documentation/admin-guide/sysctl/*, swap/mm readaround D: kswapd fixes, random kernel hacker, rmap VM, D: nl.linux.org administrator, minor scheduler additions S: Red Hat Boston diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 64e97a969857..5c6ae1ccee1a 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst @@ -16,6 +16,7 @@ etc. README kernel-parameters devices + sysctl/index This section describes CPU vulnerabilities and their mitigations. diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index e8e28cac32a3..b323f5d4366a 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3144,7 +3144,7 @@ numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA. 'node', 'default' can be specified This can be set from sysctl after boot. - See Documentation/sysctl/vm.rst for details. + See Documentation/admin-guide/sysctl/vm.rst for details. ohci1394_dma=early [HW] enable debugging via the ohci1394 driver. See Documentation/debugging-via-ohci1394.txt for more diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index f5e92f33f96e..5f61a6c429e0 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -11,7 +11,7 @@ processes address space and many other cool things. Linux memory management is a complex system with many configurable settings. Most of these settings are available via ``/proc`` filesystem and can be quired and adjusted using ``sysctl``. These APIs -are described in Documentation/sysctl/vm.rst and in `man 5 proc`_. +are described in Documentation/admin-guide/sysctl/vm.rst and in `man 5 proc`_. .. _man 5 proc: http://man7.org/linux/man-pages/man5/proc.5.html diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index 7b2b8767c0b4..874eb0c77d34 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -59,7 +59,7 @@ MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE. If a region of memory must be split into at least one new MADV_MERGEABLE or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process -will exceed ``vm.max_map_count`` (see Documentation/sysctl/vm.rst). +will exceed ``vm.max_map_count`` (see Documentation/admin-guide/sysctl/vm.rst). Like other madvise calls, they are intended for use on mapped areas of the user address space: they will report ENOMEM if the specified range diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst new file mode 100644 index 000000000000..599bcde7f0b7 --- /dev/null +++ b/Documentation/admin-guide/sysctl/abi.rst @@ -0,0 +1,67 @@ +================================ +Documentation for /proc/sys/abi/ +================================ + +kernel version 2.6.0.test2 + +Copyright (c) 2003, Fabian Frederick + +For general info: index.rst. + +------------------------------------------------------------------------------ + +This path is binary emulation relevant aka personality types aka abi. +When a process is executed, it's linked to an exec_domain whose +personality is defined using values available from /proc/sys/abi. +You can find further details about abi in include/linux/personality.h. + +Here are the files featuring in 2.6 kernel: + +- defhandler_coff +- defhandler_elf +- defhandler_lcall7 +- defhandler_libcso +- fake_utsname +- trace + +defhandler_coff +--------------- + +defined value: + PER_SCOSVR3:: + + 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE + +defhandler_elf +-------------- + +defined value: + PER_LINUX:: + + 0 + +defhandler_lcall7 +----------------- + +defined value : + PER_SVR4:: + + 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +defhandler_libsco +----------------- + +defined value: + PER_SVR4:: + + 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, + +fake_utsname +------------ + +Unused + +trace +----- + +Unused diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst new file mode 100644 index 000000000000..2a45119e3331 --- /dev/null +++ b/Documentation/admin-guide/sysctl/fs.rst @@ -0,0 +1,384 @@ +=============================== +Documentation for /proc/sys/fs/ +=============================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in intro.rst. + +------------------------------------------------------------------------------ + +This file contains documentation for the sysctl files in +/proc/sys/fs/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +1. /proc/sys/fs +=============== + +Currently, these files are in /proc/sys/fs: + +- aio-max-nr +- aio-nr +- dentry-state +- dquot-max +- dquot-nr +- file-max +- file-nr +- inode-max +- inode-nr +- inode-state +- nr_open +- overflowuid +- overflowgid +- pipe-user-pages-hard +- pipe-user-pages-soft +- protected_fifos +- protected_hardlinks +- protected_regular +- protected_symlinks +- suid_dumpable +- super-max +- super-nr + + +aio-nr & aio-max-nr +------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + + +dentry-state +------------ + +From linux/include/linux/dcache.h:: + + struct dentry_stat_t dentry_stat { + int nr_dentry; + int nr_unused; + int age_limit; /* age in seconds */ + int want_pages; /* pages requested by system */ + int nr_negative; /* # of unused negative dentries */ + int dummy; /* Reserved for future use */ + }; + +Dentries are dynamically allocated and deallocated. + +nr_dentry shows the total number of dentries allocated (active ++ unused). nr_unused shows the number of dentries that are not +actively used, but are saved in the LRU list for future reuse. + +Age_limit is the age in seconds after which dcache entries +can be reclaimed when memory is short and want_pages is +nonzero when shrink_dcache_pages() has been called and the +dcache isn't pruned yet. + +nr_negative shows the number of unused dentries that are also +negative dentries which do not map to any files. Instead, +they help speeding up rejection of non-existing files provided +by the users. + + +dquot-max & dquot-nr +-------------------- + +The file dquot-max shows the maximum number of cached disk +quota entries. + +The file dquot-nr shows the number of allocated disk quota +entries and the number of free disk quota entries. + +If the number of free cached disk quotas is very low and +you have some awesome number of simultaneous system users, +you might want to raise the limit. + + +file-max & file-nr +------------------ + +The value in file-max denotes the maximum number of file- +handles that the Linux kernel will allocate. When you get lots +of error messages about running out of file handles, you might +want to increase this limit. + +Historically,the kernel was able to allocate file handles +dynamically, but not to free them again. The three values in +file-nr denote the number of allocated file handles, the number +of allocated but unused file handles, and the maximum number of +file handles. Linux 2.6 always reports 0 as the number of free +file handles -- this is not an error, it just means that the +number of allocated file handles exactly matches the number of +used file handles. + +Attempts to allocate more file descriptors than file-max are +reported with printk, look for "VFS: file-max limit +reached". + + +nr_open +------- + +This denotes the maximum number of file-handles a process can +allocate. Default value is 1024*1024 (1048576) which should be +enough for most machines. Actual limit depends on RLIMIT_NOFILE +resource limit. + + +inode-max, inode-nr & inode-state +--------------------------------- + +As with file handles, the kernel allocates the inode structures +dynamically, but can't free them yet. + +The value in inode-max denotes the maximum number of inode +handlers. This value should be 3-4 times larger than the value +in file-max, since stdin, stdout and network sockets also +need an inode struct to handle them. When you regularly run +out of inodes, you need to increase this value. + +The file inode-nr contains the first two items from +inode-state, so we'll skip to that file... + +Inode-state contains three actual numbers and four dummies. +The actual numbers are, in order of appearance, nr_inodes, +nr_free_inodes and preshrink. + +Nr_inodes stands for the number of inodes the system has +allocated, this can be slightly more than inode-max because +Linux allocates them one pageful at a time. + +Nr_free_inodes represents the number of free inodes (?) and +preshrink is nonzero when the nr_inodes > inode-max and the +system needs to prune the inode list instead of allocating +more. + + +overflowgid & overflowuid +------------------------- + +Some filesystems only support 16-bit UIDs and GIDs, although in Linux +UIDs and GIDs are 32 bits. When one of these filesystems is mounted +with writes enabled, any UID or GID that would exceed 65535 is translated +to a fixed value before being written to disk. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + + +pipe-user-pages-hard +-------------------- + +Maximum total number of pages a non-privileged user may allocate for pipes. +Once this limit is reached, no new pipes may be allocated until usage goes +below the limit again. When set to 0, no limit is applied, which is the default +setting. + + +pipe-user-pages-soft +-------------------- + +Maximum total number of pages a non-privileged user may allocate for pipes +before the pipe size gets limited to a single page. Once this limit is reached, +new pipes will be limited to a single page in size for this user in order to +limit total memory usage, and trying to increase them using fcntl() will be +denied until usage goes below the limit again. The default value allows to +allocate up to 1024 pipes at their default size. When set to 0, no limit is +applied. + + +protected_fifos +--------------- + +The intent of this protection is to avoid unintentional writes to +an attacker-controlled FIFO, where a program expected to create a regular +file. + +When set to "0", writing to FIFOs is unrestricted. + +When set to "1" don't allow O_CREAT open on FIFOs that we don't own +in world writable sticky directories, unless they are owned by the +owner of the directory. + +When set to "2" it also applies to group writable sticky directories. + +This protection is based on the restrictions in Openwall. + + +protected_hardlinks +-------------------- + +A long-standing class of security issues is the hardlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given hardlink (i.e. a +root process follows a hardlink created by another user). Additionally, +on systems without separated partitions, this stops unauthorized users +from "pinning" vulnerable setuid/setgid files against being upgraded by +the administrator, or linking to special files. + +When set to "0", hardlink creation behavior is unrestricted. + +When set to "1" hardlinks cannot be created by users if they do not +already own the source file, or do not have read/write access to it. + +This protection is based on the restrictions in Openwall and grsecurity. + + +protected_regular +----------------- + +This protection is similar to protected_fifos, but it +avoids writes to an attacker-controlled regular file, where a program +expected to create one. + +When set to "0", writing to regular files is unrestricted. + +When set to "1" don't allow O_CREAT open on regular files that we +don't own in world writable sticky directories, unless they are +owned by the owner of the directory. + +When set to "2" it also applies to group writable sticky directories. + + +protected_symlinks +------------------ + +A long-standing class of security issues is the symlink-based +time-of-check-time-of-use race, most commonly seen in world-writable +directories like /tmp. The common method of exploitation of this flaw +is to cross privilege boundaries when following a given symlink (i.e. a +root process follows a symlink belonging to another user). For a likely +incomplete list of hundreds of examples across the years, please see: +http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp + +When set to "0", symlink following behavior is unrestricted. + +When set to "1" symlinks are permitted to be followed only when outside +a sticky world-writable directory, or when the uid of the symlink and +follower match, or when the directory owner matches the symlink's owner. + +This protection is based on the restrictions in Openwall and grsecurity. + + +suid_dumpable: +-------------- + +This value can be used to query and set the core dump mode for setuid +or otherwise protected/tainted binaries. The modes are + += ========== =============================================================== +0 (default) traditional behaviour. Any process which has changed + privilege levels or is execute only will not be dumped. +1 (debug) all processes dump core when possible. The core dump is + owned by the current user and no security is applied. This is + intended for system debugging situations only. + Ptrace is unchecked. + This is insecure as it allows regular users to examine the + memory contents of privileged processes. +2 (suidsafe) any binary which normally would not be dumped is dumped + anyway, but only if the "core_pattern" kernel sysctl is set to + either a pipe handler or a fully qualified path. (For more + details on this limitation, see CVE-2006-2451.) This mode is + appropriate when administrators are attempting to debug + problems in a normal environment, and either have a core dump + pipe handler that knows to treat privileged core dumps with + care, or specific directory defined for catching core dumps. + If a core dump happens without a pipe handler or fully + qualified path, a message will be emitted to syslog warning + about the lack of a correct setting. += ========== =============================================================== + + +super-max & super-nr +-------------------- + +These numbers control the maximum number of superblocks, and +thus the maximum number of mounted filesystems the kernel +can have. You only need to increase super-max if you need to +mount more filesystems than the current value in super-max +allows you to. + + +aio-nr & aio-max-nr +------------------- + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + + +mount-max +--------- + +This denotes the maximum number of mounts that may exist +in a mount namespace. + + + +2. /proc/sys/fs/binfmt_misc +=========================== + +Documentation for the files in /proc/sys/fs/binfmt_misc is +in Documentation/admin-guide/binfmt-misc.rst. + + +3. /proc/sys/fs/mqueue - POSIX message queues filesystem +======================================================== + + +The "mqueue" filesystem provides the necessary kernel features to enable the +creation of a user space library that implements the POSIX message queues +API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System +Interfaces specification.) + +The "mqueue" filesystem contains values for determining/setting the amount of +resources used by the file system. + +/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the +maximum number of message queues allowed on the system. + +/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the +maximum number of messages in a queue value. In fact it is the limiting value +for another (user) limit which is set in mq_open invocation. This attribute of +a queue must be less or equal then msg_max. + +/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the +maximum message size value (it is every message queue's attribute set during +its creation). + +/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the +default number of messages in a queue value if attr parameter of mq_open(2) is +NULL. If it exceed msg_max, the default value is initialized msg_max. + +/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting +the default message size value if attr parameter of mq_open(2) is NULL. If it +exceed msgsize_max, the default value is initialized msgsize_max. + +4. /proc/sys/fs/epoll - Configuration options for the epoll interface +===================================================================== + +This directory contains configuration options for the epoll(7) interface. + +max_user_watches +---------------- + +Every epoll file descriptor can store a number of files to be monitored +for event readiness. Each one of these monitored files constitutes a "watch". +This configuration option sets the maximum number of "watches" that are +allowed for each user. +Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes +on a 64bit one. +The current default value for max_user_watches is the 1/32 of the available +low memory, divided for the "watch" cost in bytes. diff --git a/Documentation/admin-guide/sysctl/index.rst b/Documentation/admin-guide/sysctl/index.rst new file mode 100644 index 000000000000..03346f98c7b9 --- /dev/null +++ b/Documentation/admin-guide/sysctl/index.rst @@ -0,0 +1,98 @@ +=========================== +Documentation for /proc/sys +=========================== + +Copyright (c) 1998, 1999, Rik van Riel + +------------------------------------------------------------------------------ + +'Why', I hear you ask, 'would anyone even _want_ documentation +for them sysctl files? If anybody really needs it, it's all in +the source...' + +Well, this documentation is written because some people either +don't know they need to tweak something, or because they don't +have the time or knowledge to read the source code. + +Furthermore, the programmers who built sysctl have built it to +be actually used, not just for the fun of programming it :-) + +------------------------------------------------------------------------------ + +Legal blurb: + +As usual, there are two main things to consider: + +1. you get what you pay for +2. it's free + +The consequences are that I won't guarantee the correctness of +this document, and if you come to me complaining about how you +screwed up your system because of wrong documentation, I won't +feel sorry for you. I might even laugh at you... + +But of course, if you _do_ manage to screw up your system using +only the sysctl options used in this file, I'd like to hear of +it. Not only to have a great laugh, but also to make sure that +you're the last RTFMing person to screw up. + +In short, e-mail your suggestions, corrections and / or horror +stories to: + +Rik van Riel. + +-------------------------------------------------------------- + +Introduction +============ + +Sysctl is a means of configuring certain aspects of the kernel +at run-time, and the /proc/sys/ directory is there so that you +don't even need special tools to do it! +In fact, there are only four things needed to use these config +facilities: + +- a running Linux system +- root access +- common sense (this is especially hard to come by these days) +- knowledge of what all those values mean + +As a quick 'ls /proc/sys' will show, the directory consists of +several (arch-dependent?) subdirs. Each subdir is mainly about +one part of the kernel, so you can do configuration on a piece +by piece basis, or just some 'thematic frobbing'. + +This documentation is about: + +=============== =============================================================== +abi/ execution domains & personalities +debug/ +dev/ device specific information (eg dev/cdrom/info) +fs/ specific filesystems + filehandle, inode, dentry and quota tuning + binfmt_misc +kernel/ global kernel info / tuning + miscellaneous stuff +net/ networking stuff, for documentation look in: + +proc/ +sunrpc/ SUN Remote Procedure Call (NFS) +vm/ memory management tuning + buffer and cache management +user/ Per user per user namespace limits +=============== =============================================================== + +These are the subdirs I have on my system. There might be more +or other subdirs in another setup. If you see another dir, I'd +really like to hear about it :-) + +.. toctree:: + :maxdepth: 1 + + abi + fs + kernel + net + sunrpc + user + vm diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst new file mode 100644 index 000000000000..a0c1d4ce403a --- /dev/null +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -0,0 +1,1177 @@ +=================================== +Documentation for /proc/sys/kernel/ +=================================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2009, Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains documentation for the sysctl files in +/proc/sys/kernel/ and is valid for Linux kernel version 2.2. + +The files in this directory can be used to tune and monitor +miscellaneous and general things in the operation of the Linux +kernel. Since some of the files _can_ be used to screw up your +system, it is advisable to read both documentation and source +before actually making adjustments. + +Currently, these files might (depending on your configuration) +show up in /proc/sys/kernel: + +- acct +- acpi_video_flags +- auto_msgmni +- bootloader_type [ X86 only ] +- bootloader_version [ X86 only ] +- cap_last_cap +- core_pattern +- core_pipe_limit +- core_uses_pid +- ctrl-alt-del +- dmesg_restrict +- domainname +- hostname +- hotplug +- hardlockup_all_cpu_backtrace +- hardlockup_panic +- hung_task_panic +- hung_task_check_count +- hung_task_timeout_secs +- hung_task_check_interval_secs +- hung_task_warnings +- hyperv_record_panic_msg +- kexec_load_disabled +- kptr_restrict +- l2cr [ PPC only ] +- modprobe ==> Documentation/debugging-modules.txt +- modules_disabled +- msg_next_id [ sysv ipc ] +- msgmax +- msgmnb +- msgmni +- nmi_watchdog +- osrelease +- ostype +- overflowgid +- overflowuid +- panic +- panic_on_oops +- panic_on_stackoverflow +- panic_on_unrecovered_nmi +- panic_on_warn +- panic_print +- panic_on_rcu_stall +- perf_cpu_time_max_percent +- perf_event_paranoid +- perf_event_max_stack +- perf_event_mlock_kb +- perf_event_max_contexts_per_stack +- pid_max +- powersave-nap [ PPC only ] +- printk +- printk_delay +- printk_ratelimit +- printk_ratelimit_burst +- pty ==> Documentation/filesystems/devpts.txt +- randomize_va_space +- real-root-dev ==> Documentation/admin-guide/initrd.rst +- reboot-cmd [ SPARC only ] +- rtsig-max +- rtsig-nr +- sched_energy_aware +- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst +- sem +- sem_next_id [ sysv ipc ] +- sg-big-buff [ generic SCSI device (sg) ] +- shm_next_id [ sysv ipc ] +- shm_rmid_forced +- shmall +- shmmax [ sysv ipc ] +- shmmni +- softlockup_all_cpu_backtrace +- soft_watchdog +- stack_erasing +- stop-a [ SPARC only ] +- sysrq ==> Documentation/admin-guide/sysrq.rst +- sysctl_writes_strict +- tainted ==> Documentation/admin-guide/tainted-kernels.rst +- threads-max +- unknown_nmi_panic +- watchdog +- watchdog_thresh +- version + + +acct: +===== + +highwater lowwater frequency + +If BSD-style process accounting is enabled these values control +its behaviour. If free space on filesystem where the log lives +goes below % accounting suspends. If free space gets +above % accounting resumes. determines +how often do we check the amount of free space (value is in +seconds). Default: +4 2 30 +That is, suspend accounting if there left <= 2% free; resume it +if we got >=4%; consider information about amount of free space +valid for 30 seconds. + + +acpi_video_flags: +================= + +flags + +See Doc*/kernel/power/video.txt, it allows mode of video boot to be +set during run time. + + +auto_msgmni: +============ + +This variable has no effect and may be removed in future kernel +releases. Reading it always returns 0. +Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni +upon memory add/remove or upon ipc namespace creation/removal. +Echoing "1" into this file enabled msgmni automatic recomputing. +Echoing "0" turned it off. auto_msgmni default value was 1. + + +bootloader_type: +================ + +x86 bootloader identification + +This gives the bootloader type number as indicated by the bootloader, +shifted left by 4, and OR'd with the low four bits of the bootloader +version. The reason for this encoding is that this used to match the +type_of_loader field in the kernel header; the encoding is kept for +backwards compatibility. That is, if the full bootloader type number +is 0x15 and the full version number is 0x234, this file will contain +the value 340 = 0x154. + +See the type_of_loader and ext_loader_type fields in +Documentation/x86/boot.rst for additional information. + + +bootloader_version: +=================== + +x86 bootloader version + +The complete bootloader version number. In the example above, this +file will contain the value 564 = 0x234. + +See the type_of_loader and ext_loader_ver fields in +Documentation/x86/boot.rst for additional information. + + +cap_last_cap: +============= + +Highest valid capability of the running kernel. Exports +CAP_LAST_CAP from the kernel. + + +core_pattern: +============= + +core_pattern is used to specify a core dumpfile pattern name. + +* max length 127 characters; default value is "core" +* core_pattern is used as a pattern template for the output filename; + certain string patterns (beginning with '%') are substituted with + their actual values. +* backward compatibility with core_uses_pid: + + If core_pattern does not include "%p" (default does not) + and core_uses_pid is set, then .PID will be appended to + the filename. + +* corename format specifiers:: + + % '%' is dropped + %% output one '%' + %p pid + %P global pid (init PID namespace) + %i tid + %I global tid (init PID namespace) + %u uid (in initial user namespace) + %g gid (in initial user namespace) + %d dump mode, matches PR_SET_DUMPABLE and + /proc/sys/fs/suid_dumpable + %s signal number + %t UNIX time of dump + %h hostname + %e executable filename (may be shortened) + %E executable path + % both are dropped + +* If the first character of the pattern is a '|', the kernel will treat + the rest of the pattern as a command to run. The core dump will be + written to the standard input of that program instead of to a file. + + +core_pipe_limit: +================ + +This sysctl is only applicable when core_pattern is configured to pipe +core files to a user space helper (when the first character of +core_pattern is a '|', see above). When collecting cores via a pipe +to an application, it is occasionally useful for the collecting +application to gather data about the crashing process from its +/proc/pid directory. In order to do this safely, the kernel must wait +for the collecting process to exit, so as not to remove the crashing +processes proc files prematurely. This in turn creates the +possibility that a misbehaving userspace collecting process can block +the reaping of a crashed process simply by never exiting. This sysctl +defends against that. It defines how many concurrent crashing +processes may be piped to user space applications in parallel. If +this value is exceeded, then those crashing processes above that value +are noted via the kernel log and their cores are skipped. 0 is a +special value, indicating that unlimited processes may be captured in +parallel, but that no waiting will take place (i.e. the collecting +process is not guaranteed access to /proc//). This +value defaults to 0. + + +core_uses_pid: +============== + +The default coredump filename is "core". By setting +core_uses_pid to 1, the coredump filename becomes core.PID. +If core_pattern does not include "%p" (default does not) +and core_uses_pid is set, then .PID will be appended to +the filename. + + +ctrl-alt-del: +============= + +When the value in this file is 0, ctrl-alt-del is trapped and +sent to the init(1) program to handle a graceful restart. +When, however, the value is > 0, Linux's reaction to a Vulcan +Nerve Pinch (tm) will be an immediate reboot, without even +syncing its dirty buffers. + +Note: + when a program (like dosemu) has the keyboard in 'raw' + mode, the ctrl-alt-del is intercepted by the program before it + ever reaches the kernel tty layer, and it's up to the program + to decide what to do with it. + + +dmesg_restrict: +=============== + +This toggle indicates whether unprivileged users are prevented +from using dmesg(8) to view messages from the kernel's log buffer. +When dmesg_restrict is set to (0) there are no restrictions. When +dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use +dmesg(8). + +The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the +default value of dmesg_restrict. + + +domainname & hostname: +====================== + +These files can be used to set the NIS/YP domainname and the +hostname of your box in exactly the same way as the commands +domainname and hostname, i.e.:: + + # echo "darkstar" > /proc/sys/kernel/hostname + # echo "mydomain" > /proc/sys/kernel/domainname + +has the same effect as:: + + # hostname "darkstar" + # domainname "mydomain" + +Note, however, that the classic darkstar.frop.org has the +hostname "darkstar" and DNS (Internet Domain Name Server) +domainname "frop.org", not to be confused with the NIS (Network +Information Service) or YP (Yellow Pages) domainname. These two +domain names are in general different. For a detailed discussion +see the hostname(1) man page. + + +hardlockup_all_cpu_backtrace: +============================= + +This value controls the hard lockup detector behavior when a hard +lockup condition is detected as to whether or not to gather further +debug information. If enabled, arch-specific all-CPU stack dumping +will be initiated. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. + + +hardlockup_panic: +================= + +This parameter can be used to control whether the kernel panics +when a hard lockup is detected. + + 0 - don't panic on hard lockup + 1 - panic on hard lockup + +See Documentation/lockup-watchdogs.txt for more information. This can +also be set using the nmi_watchdog kernel parameter. + + +hotplug: +======== + +Path for the hotplug policy agent. +Default value is "/sbin/hotplug". + + +hung_task_panic: +================ + +Controls the kernel's behavior when a hung task is detected. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: continue operation. This is the default behavior. + +1: panic immediately. + + +hung_task_check_count: +====================== + +The upper bound on the number of tasks that are checked. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + + +hung_task_timeout_secs: +======================= + +When a task in D state did not get scheduled +for more than this value report a warning. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0: means infinite timeout - no checking done. + +Possible values to set are in range {0..LONG_MAX/HZ}. + + +hung_task_check_interval_secs: +============================== + +Hung task check interval. If hung task checking is enabled +(see hung_task_timeout_secs), the check is done every +hung_task_check_interval_secs seconds. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +0 (default): means use hung_task_timeout_secs as checking interval. +Possible values to set are in range {0..LONG_MAX/HZ}. + + +hung_task_warnings: +=================== + +The maximum number of warnings to report. During a check interval +if a hung task is detected, this value is decreased by 1. +When this value reaches 0, no more warnings will be reported. +This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. + +-1: report an infinite number of warnings. + + +hyperv_record_panic_msg: +======================== + +Controls whether the panic kmsg data should be reported to Hyper-V. + +0: do not report panic kmsg data. + +1: report the panic kmsg data. This is the default behavior. + + +kexec_load_disabled: +==================== + +A toggle indicating if the kexec_load syscall has been disabled. This +value defaults to 0 (false: kexec_load enabled), but can be set to 1 +(true: kexec_load disabled). Once true, kexec can no longer be used, and +the toggle cannot be set back to false. This allows a kexec image to be +loaded before disabling the syscall, allowing a system to set up (and +later use) an image without it being altered. Generally used together +with the "modules_disabled" sysctl. + + +kptr_restrict: +============== + +This toggle indicates whether restrictions are placed on +exposing kernel addresses via /proc and other interfaces. + +When kptr_restrict is set to 0 (the default) the address is hashed before +printing. (This is the equivalent to %p.) + +When kptr_restrict is set to (1), kernel pointers printed using the %pK +format specifier will be replaced with 0's unless the user has CAP_SYSLOG +and effective user and group ids are equal to the real ids. This is +because %pK checks are done at read() time rather than open() time, so +if permissions are elevated between the open() and the read() (e.g via +a setuid binary) then %pK will not leak kernel pointers to unprivileged +users. Note, this is a temporary solution only. The correct long-term +solution is to do the permission checks at open() time. Consider removing +world read permissions from files that use %pK, and using dmesg_restrict +to protect against uses of %pK in dmesg(8) if leaking kernel pointer +values to unprivileged users is a concern. + +When kptr_restrict is set to (2), kernel pointers printed using +%pK will be replaced with 0's regardless of privileges. + + +l2cr: (PPC only) +================ + +This flag controls the L2 cache of G3 processor boards. If +0, the cache is disabled. Enabled if nonzero. + + +modules_disabled: +================= + +A toggle value indicating if modules are allowed to be loaded +in an otherwise modular kernel. This toggle defaults to off +(0), but can be set true (1). Once true, modules can be +neither loaded nor unloaded, and the toggle cannot be set back +to false. Generally used with the "kexec_load_disabled" toggle. + + +msg_next_id, sem_next_id, and shm_next_id: +========================================== + +These three toggles allows to specify desired id for next allocated IPC +object: message, semaphore or shared memory respectively. + +By default they are equal to -1, which means generic allocation logic. +Possible values to set are in range {0..INT_MAX}. + +Notes: + 1) kernel doesn't guarantee, that new object will have desired id. So, + it's up to userspace, how to handle an object with "wrong" id. + 2) Toggle with non-default value will be set back to -1 by kernel after + successful IPC object allocation. If an IPC object allocation syscall + fails, it is undefined if the value remains unmodified or is reset to -1. + + +nmi_watchdog: +============= + +This parameter can be used to control the NMI watchdog +(i.e. the hard lockup detector) on x86 systems. + +0 - disable the hard lockup detector + +1 - enable the hard lockup detector + +The hard lockup detector monitors each CPU for its ability to respond to +timer interrupts. The mechanism utilizes CPU performance counter registers +that are programmed to generate Non-Maskable Interrupts (NMIs) periodically +while a CPU is busy. Hence, the alternative name 'NMI watchdog'. + +The NMI watchdog is disabled by default if the kernel is running as a guest +in a KVM virtual machine. This default can be overridden by adding:: + + nmi_watchdog=1 + +to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). + + +numa_balancing: +=============== + +Enables/disables automatic page fault based NUMA memory +balancing. Memory is moved automatically to nodes +that access it often. + +Enables/disables automatic NUMA memory balancing. On NUMA machines, there +is a performance penalty if remote memory is accessed by a CPU. When this +feature is enabled the kernel samples what task thread is accessing memory +by periodically unmapping pages and later trapping a page fault. At the +time of the page fault, it is determined if the data being accessed should +be migrated to a local memory node. + +The unmapping of pages and trapping faults incur additional overhead that +ideally is offset by improved memory locality but there is no universal +guarantee. If the target workload is already bound to NUMA nodes then this +feature should be disabled. Otherwise, if the system overhead from the +feature is too high then the rate the kernel samples for NUMA hinting +faults may be controlled by the numa_balancing_scan_period_min_ms, +numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, +numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. + +numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb +=============================================================================================================================== + + +Automatic NUMA balancing scans tasks address space and unmaps pages to +detect if pages are properly placed or if the data should be migrated to a +memory node local to where the task is running. Every "scan delay" the task +scans the next "scan size" number of pages in its address space. When the +end of the address space is reached the scanner restarts from the beginning. + +In combination, the "scan delay" and "scan size" determine the scan rate. +When "scan delay" decreases, the scan rate increases. The scan delay and +hence the scan rate of every task is adaptive and depends on historical +behaviour. If pages are properly placed then the scan delay increases, +otherwise the scan delay decreases. The "scan size" is not adaptive but +the higher the "scan size", the higher the scan rate. + +Higher scan rates incur higher system overhead as page faults must be +trapped and potentially data must be migrated. However, the higher the scan +rate, the more quickly a tasks memory is migrated to a local node if the +workload pattern changes and minimises performance impact due to remote +memory accesses. These sysctls control the thresholds for scan delays and +the number of pages scanned. + +numa_balancing_scan_period_min_ms is the minimum time in milliseconds to +scan a tasks virtual memory. It effectively controls the maximum scanning +rate for each task. + +numa_balancing_scan_delay_ms is the starting "scan delay" used for a task +when it initially forks. + +numa_balancing_scan_period_max_ms is the maximum time in milliseconds to +scan a tasks virtual memory. It effectively controls the minimum scanning +rate for each task. + +numa_balancing_scan_size_mb is how many megabytes worth of pages are +scanned for a given scan. + + +osrelease, ostype & version: +============================ + +:: + + # cat osrelease + 2.1.88 + # cat ostype + Linux + # cat version + #5 Wed Feb 25 21:49:24 MET 1998 + +The files osrelease and ostype should be clear enough. Version +needs a little more clarification however. The '#5' means that +this is the fifth kernel built from this source base and the +date behind it indicates the time the kernel was built. +The only way to tune these values is to rebuild the kernel :-) + + +overflowgid & overflowuid: +========================== + +if your architecture did not always support 32-bit UIDs (i.e. arm, +i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to +applications that use the old 16-bit UID/GID system calls, if the +actual UID or GID would exceed 65535. + +These sysctls allow you to change the value of the fixed UID and GID. +The default is 65534. + + +panic: +====== + +The value in this file represents the number of seconds the kernel +waits before rebooting on a panic. When you use the software watchdog, +the recommended setting is 60. + + +panic_on_io_nmi: +================ + +Controls the kernel's behavior when a CPU receives an NMI caused by +an IO error. + +0: try to continue operation (default) + +1: panic immediately. The IO error triggered an NMI. This indicates a + serious system condition which could result in IO data corruption. + Rather than continuing, panicking might be a better choice. Some + servers issue this sort of NMI when the dump button is pushed, + and you can use this option to take a crash dump. + + +panic_on_oops: +============== + +Controls the kernel's behaviour when an oops or BUG is encountered. + +0: try to continue operation + +1: panic immediately. If the `panic` sysctl is also non-zero then the + machine will be rebooted. + + +panic_on_stackoverflow: +======================= + +Controls the kernel's behavior when detecting the overflows of +kernel, IRQ and exception stacks except a user stack. +This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. + +0: try to continue operation. + +1: panic immediately. + + +panic_on_unrecovered_nmi: +========================= + +The default Linux behaviour on an NMI of either memory or unknown is +to continue operation. For many environments such as scientific +computing it is preferable that the box is taken out and the error +dealt with than an uncorrected parity/ECC error get propagated. + +A small number of systems do generate NMI's for bizarre random reasons +such as power management so the default is off. That sysctl works like +the existing panic controls already in that directory. + + +panic_on_warn: +============== + +Calls panic() in the WARN() path when set to 1. This is useful to avoid +a kernel rebuild when attempting to kdump at the location of a WARN(). + +0: only WARN(), default behaviour. + +1: call panic() after printing out WARN() location. + + +panic_print: +============ + +Bitmask for printing system info when panic happens. User can chose +combination of the following bits: + +===== ======================================== +bit 0 print all tasks info +bit 1 print system memory info +bit 2 print timer info +bit 3 print locks info if CONFIG_LOCKDEP is on +bit 4 print ftrace buffer +===== ======================================== + +So for example to print tasks and memory info on panic, user can:: + + echo 3 > /proc/sys/kernel/panic_print + + +panic_on_rcu_stall: +=================== + +When set to 1, calls panic() after RCU stall detection messages. This +is useful to define the root cause of RCU stalls using a vmcore. + +0: do not panic() when RCU stall takes place, default behavior. + +1: panic() after printing RCU stall messages. + + +perf_cpu_time_max_percent: +========================== + +Hints to the kernel how much CPU time it should be allowed to +use to handle perf sampling events. If the perf subsystem +is informed that its samples are exceeding this limit, it +will drop its sampling frequency to attempt to reduce its CPU +usage. + +Some perf sampling happens in NMIs. If these samples +unexpectedly take too long to execute, the NMIs can become +stacked up next to each other so much that nothing else is +allowed to execute. + +0: + disable the mechanism. Do not monitor or correct perf's + sampling rate no matter how CPU time it takes. + +1-100: + attempt to throttle perf's sample rate to this + percentage of CPU. Note: the kernel calculates an + "expected" length of each sample event. 100 here means + 100% of that expected length. Even if this is set to + 100, you may still see sample throttling if this + length is exceeded. Set to 0 if you truly do not care + how much CPU is consumed. + + +perf_event_paranoid: +==================== + +Controls use of the performance events system by unprivileged +users (without CAP_SYS_ADMIN). The default value is 2. + +=== ================================================================== + -1 Allow use of (almost) all events by all users + + Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK + +>=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN + + Disallow raw tracepoint access by users without CAP_SYS_ADMIN + +>=1 Disallow CPU event access by users without CAP_SYS_ADMIN + +>=2 Disallow kernel profiling by users without CAP_SYS_ADMIN +=== ================================================================== + + +perf_event_max_stack: +===================== + +Controls maximum number of stack frames to copy for (attr.sample_type & +PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using +'perf record -g' or 'perf trace --call-graph fp'. + +This can only be done when no events are in use that have callchains +enabled, otherwise writing to this file will return -EBUSY. + +The default value is 127. + + +perf_event_mlock_kb: +==================== + +Control size of per-cpu ring buffer not counted agains mlock limit. + +The default value is 512 + 1 page + + +perf_event_max_contexts_per_stack: +================================== + +Controls maximum number of stack frame context entries for +(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for +instance, when using 'perf record -g' or 'perf trace --call-graph fp'. + +This can only be done when no events are in use that have callchains +enabled, otherwise writing to this file will return -EBUSY. + +The default value is 8. + + +pid_max: +======== + +PID allocation wrap value. When the kernel's next PID value +reaches this value, it wraps back to a minimum PID value. +PIDs of value pid_max or larger are not allocated. + + +ns_last_pid: +============ + +The last pid allocated in the current (the one task using this sysctl +lives in) pid namespace. When selecting a pid for a next task on fork +kernel tries to allocate a number starting from this one. + + +powersave-nap: (PPC only) +========================= + +If set, Linux-PPC will use the 'nap' mode of powersaving, +otherwise the 'doze' mode will be used. + +============================================================== + +printk: +======= + +The four values in printk denote: console_loglevel, +default_message_loglevel, minimum_console_loglevel and +default_console_loglevel respectively. + +These values influence printk() behavior when printing or +logging error messages. See 'man 2 syslog' for more info on +the different loglevels. + +- console_loglevel: + messages with a higher priority than + this will be printed to the console +- default_message_loglevel: + messages without an explicit priority + will be printed with this priority +- minimum_console_loglevel: + minimum (highest) value to which + console_loglevel can be set +- default_console_loglevel: + default value for console_loglevel + + +printk_delay: +============= + +Delay each printk message in printk_delay milliseconds + +Value from 0 - 10000 is allowed. + + +printk_ratelimit: +================= + +Some warning messages are rate limited. printk_ratelimit specifies +the minimum length of time between these messages (in jiffies), by +default we allow one every 5 seconds. + +A value of 0 will disable rate limiting. + + +printk_ratelimit_burst: +======================= + +While long term we enforce one message per printk_ratelimit +seconds, we do allow a burst of messages to pass through. +printk_ratelimit_burst specifies the number of messages we can +send before ratelimiting kicks in. + + +printk_devkmsg: +=============== + +Control the logging to /dev/kmsg from userspace: + +ratelimit: + default, ratelimited + +on: unlimited logging to /dev/kmsg from userspace + +off: logging to /dev/kmsg disabled + +The kernel command line parameter printk.devkmsg= overrides this and is +a one-time setting until next reboot: once set, it cannot be changed by +this sysctl interface anymore. + + +randomize_va_space: +=================== + +This option can be used to select the type of process address +space randomization that is used in the system, for architectures +that support this feature. + +== =========================================================================== +0 Turn the process address space randomization off. This is the + default for architectures that do not support this feature anyways, + and kernels that are booted with the "norandmaps" parameter. + +1 Make the addresses of mmap base, stack and VDSO page randomized. + This, among other things, implies that shared libraries will be + loaded to random addresses. Also for PIE-linked binaries, the + location of code start is randomized. This is the default if the + CONFIG_COMPAT_BRK option is enabled. + +2 Additionally enable heap randomization. This is the default if + CONFIG_COMPAT_BRK is disabled. + + There are a few legacy applications out there (such as some ancient + versions of libc.so.5 from 1996) that assume that brk area starts + just after the end of the code+bss. These applications break when + start of the brk area is randomized. There are however no known + non-legacy applications that would be broken this way, so for most + systems it is safe to choose full randomization. + + Systems with ancient and/or broken binaries should be configured + with CONFIG_COMPAT_BRK enabled, which excludes the heap from process + address space randomization. +== =========================================================================== + + +reboot-cmd: (Sparc only) +======================== + +??? This seems to be a way to give an argument to the Sparc +ROM/Flash boot loader. Maybe to tell it what to do after +rebooting. ??? + + +rtsig-max & rtsig-nr: +===================== + +The file rtsig-max can be used to tune the maximum number +of POSIX realtime (queued) signals that can be outstanding +in the system. + +rtsig-nr shows the number of RT signals currently queued. + + +sched_energy_aware: +=================== + +Enables/disables Energy Aware Scheduling (EAS). EAS starts +automatically on platforms where it can run (that is, +platforms with asymmetric CPU topologies and having an Energy +Model available). If your platform happens to meet the +requirements for EAS but you do not want to use it, change +this value to 0. + + +sched_schedstats: +================= + +Enables/disables scheduler statistics. Enabling this feature +incurs a small amount of overhead in the scheduler but is +useful for debugging and performance tuning. + + +sg-big-buff: +============ + +This file shows the size of the generic SCSI (sg) buffer. +You can't tune it just yet, but you could change it on +compile time by editing include/scsi/sg.h and changing +the value of SG_BIG_BUFF. + +There shouldn't be any reason to change this value. If +you can come up with one, you probably know what you +are doing anyway :) + + +shmall: +======= + +This parameter sets the total amount of shared memory pages that +can be used system wide. Hence, SHMALL should always be at least +ceil(shmmax/PAGE_SIZE). + +If you are not sure what the default PAGE_SIZE is on your Linux +system, you can run the following command: + + # getconf PAGE_SIZE + + +shmmax: +======= + +This value can be used to query and set the run time limit +on the maximum shared memory segment size that can be created. +Shared memory segments up to 1Gb are now supported in the +kernel. This value defaults to SHMMAX. + + +shm_rmid_forced: +================ + +Linux lets you set resource limits, including how much memory one +process can consume, via setrlimit(2). Unfortunately, shared memory +segments are allowed to exist without association with any process, and +thus might not be counted against any resource limits. If enabled, +shared memory segments are automatically destroyed when their attach +count becomes zero after a detach or a process termination. It will +also destroy segments that were created, but never attached to, on exit +from the process. The only use left for IPC_RMID is to immediately +destroy an unattached segment. Of course, this breaks the way things are +defined, so some applications might stop working. Note that this +feature will do you no good unless you also configure your resource +limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't +need this. + +Note that if you change this from 0 to 1, already created segments +without users and with a dead originative process will be destroyed. + + +sysctl_writes_strict: +===================== + +Control how file position affects the behavior of updating sysctl values +via the /proc/sys interface: + + == ====================================================================== + -1 Legacy per-write sysctl value handling, with no printk warnings. + Each write syscall must fully contain the sysctl value to be + written, and multiple writes on the same sysctl file descriptor + will rewrite the sysctl value, regardless of file position. + 0 Same behavior as above, but warn about processes that perform writes + to a sysctl file descriptor when the file position is not 0. + 1 (default) Respect file position when writing sysctl strings. Multiple + writes will append to the sysctl value buffer. Anything past the max + length of the sysctl value buffer will be ignored. Writes to numeric + sysctl entries must always be at file position 0 and the value must + be fully contained in the buffer sent in the write syscall. + == ====================================================================== + + +softlockup_all_cpu_backtrace: +============================= + +This value controls the soft lockup detector thread's behavior +when a soft lockup condition is detected as to whether or not +to gather further debug information. If enabled, each cpu will +be issued an NMI and instructed to capture stack trace. + +This feature is only applicable for architectures which support +NMI. + +0: do nothing. This is the default behavior. + +1: on detection capture more debug information. + + +soft_watchdog: +============== + +This parameter can be used to control the soft lockup detector. + + 0 - disable the soft lockup detector + + 1 - enable the soft lockup detector + +The soft lockup detector monitors CPUs for threads that are hogging the CPUs +without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads +from running. The mechanism depends on the CPUs ability to respond to timer +interrupts which are needed for the 'watchdog/N' threads to be woken up by +the watchdog timer function, otherwise the NMI watchdog - if enabled - can +detect a hard lockup condition. + + +stack_erasing: +============== + +This parameter can be used to control kernel stack erasing at the end +of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. + +That erasing reduces the information which kernel stack leak bugs +can reveal and blocks some uninitialized stack variable attacks. +The tradeoff is the performance impact: on a single CPU system kernel +compilation sees a 1% slowdown, other systems and workloads may vary. + + 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. + + 1: kernel stack erasing is enabled (default), it is performed before + returning to the userspace at the end of syscalls. + + +tainted +======= + +Non-zero if the kernel has been tainted. Numeric values, which can be +ORed together. The letters are seen in "Tainted" line of Oops reports. + +====== ===== ============================================================== + 1 `(P)` proprietary module was loaded + 2 `(F)` module was force loaded + 4 `(S)` SMP kernel oops on an officially SMP incapable processor + 8 `(R)` module was force unloaded + 16 `(M)` processor reported a Machine Check Exception (MCE) + 32 `(B)` bad page referenced or some unexpected page flags + 64 `(U)` taint requested by userspace application + 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG + 256 `(A)` an ACPI table was overridden by user + 512 `(W)` kernel issued warning + 1024 `(C)` staging driver was loaded + 2048 `(I)` workaround for bug in platform firmware applied + 4096 `(O)` externally-built ("out-of-tree") module was loaded + 8192 `(E)` unsigned module was loaded + 16384 `(L)` soft lockup occurred + 32768 `(K)` kernel has been live patched + 65536 `(X)` Auxiliary taint, defined and used by for distros +131072 `(T)` The kernel was built with the struct randomization plugin +====== ===== ============================================================== + +See Documentation/admin-guide/tainted-kernels.rst for more information. + + +threads-max: +============ + +This value controls the maximum number of threads that can be created +using fork(). + +During initialization the kernel sets this value such that even if the +maximum number of threads is created, the thread structures occupy only +a part (1/8th) of the available RAM pages. + +The minimum value that can be written to threads-max is 20. + +The maximum value that can be written to threads-max is given by the +constant FUTEX_TID_MASK (0x3fffffff). + +If a value outside of this range is written to threads-max an error +EINVAL occurs. + +The value written is checked against the available RAM pages. If the +thread structures would occupy too much (more than 1/8th) of the +available RAM pages threads-max is reduced accordingly. + + +unknown_nmi_panic: +================== + +The value in this file affects behavior of handling NMI. When the +value is non-zero, unknown NMI is trapped and then panic occurs. At +that time, kernel debugging information is displayed on console. + +NMI switch that most IA32 servers have fires unknown NMI up, for +example. If a system hangs up, try pressing the NMI switch. + + +watchdog: +========= + +This parameter can be used to disable or enable the soft lockup detector +_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. + + 0 - disable both lockup detectors + + 1 - enable both lockup detectors + +The soft lockup detector and the NMI watchdog can also be disabled or +enabled individually, using the soft_watchdog and nmi_watchdog parameters. +If the watchdog parameter is read, for example by executing:: + + cat /proc/sys/kernel/watchdog + +the output of this command (0 or 1) shows the logical OR of soft_watchdog +and nmi_watchdog. + + +watchdog_cpumask: +================= + +This value can be used to control on which cpus the watchdog may run. +The default cpumask is all possible cores, but if NO_HZ_FULL is +enabled in the kernel config, and cores are specified with the +nohz_full= boot argument, those cores are excluded by default. +Offline cores can be included in this mask, and if the core is later +brought online, the watchdog will be started based on the mask value. + +Typically this value would only be touched in the nohz_full case +to re-enable cores that by default were not running the watchdog, +if a kernel lockup was suspected on those cores. + +The argument value is the standard cpulist format for cpumasks, +so for example to enable the watchdog on cores 0, 2, 3, and 4 you +might say:: + + echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask + + +watchdog_thresh: +================ + +This value can be used to control the frequency of hrtimer and NMI +events and the soft and hard lockup thresholds. The default threshold +is 10 seconds. + +The softlockup threshold is (2 * watchdog_thresh). Setting this +tunable to zero will disable lockup detection altogether. diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst new file mode 100644 index 000000000000..a7d44e71019d --- /dev/null +++ b/Documentation/admin-guide/sysctl/net.rst @@ -0,0 +1,461 @@ +================================ +Documentation for /proc/sys/net/ +================================ + +Copyright + +Copyright (c) 1999 + + - Terrehon Bowden + - Bodo Bauer + +Copyright (c) 2000 + + - Jorge Nerin + +Copyright (c) 2009 + + - Shen Feng + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/net + +The interface to the networking parts of the kernel is located in +/proc/sys/net. The following table shows all possible subdirectories. You may +see only some of them, depending on your kernel's configuration. + + +Table : Subdirectories in /proc/sys/net + + ========= =================== = ========== ================== + Directory Content Directory Content + ========= =================== = ========== ================== + core General parameter appletalk Appletalk protocol + unix Unix domain sockets netrom NET/ROM + 802 E802 protocol ax25 AX25 + ethernet Ethernet protocol rose X.25 PLP layer + ipv4 IP version 4 x25 X.25 protocol + ipx IPX token-ring IBM token ring + bridge Bridging decnet DEC net + ipv6 IP version 6 tipc TIPC + ========= =================== = ========== ================== + +1. /proc/sys/net/core - Network core options +============================================ + +bpf_jit_enable +-------------- + +This enables the BPF Just in Time (JIT) compiler. BPF is a flexible +and efficient infrastructure allowing to execute bytecode at various +hook points. It is used in a number of Linux kernel subsystems such +as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints) +and security (e.g. seccomp). LLVM has a BPF back end that can compile +restricted C into a sequence of BPF instructions. After program load +through bpf(2) and passing a verifier in the kernel, a JIT will then +translate these BPF proglets into native CPU instructions. There are +two flavors of JITs, the newer eBPF JIT currently supported on: + + - x86_64 + - x86_32 + - arm64 + - arm32 + - ppc64 + - sparc64 + - mips64 + - s390x + - riscv + +And the older cBPF JIT supported on the following archs: + + - mips + - ppc + - sparc + +eBPF JITs are a superset of cBPF JITs, meaning the kernel will +migrate cBPF instructions into eBPF instructions and then JIT +compile them transparently. Older cBPF JITs can only translate +tcpdump filters, seccomp rules, etc, but not mentioned eBPF +programs loaded through bpf(2). + +Values: + + - 0 - disable the JIT (default value) + - 1 - enable the JIT + - 2 - enable the JIT and ask the compiler to emit traces on kernel log. + +bpf_jit_harden +-------------- + +This enables hardening for the BPF JIT compiler. Supported are eBPF +JIT backends. Enabling hardening trades off performance, but can +mitigate JIT spraying. + +Values: + + - 0 - disable JIT hardening (default value) + - 1 - enable JIT hardening for unprivileged users only + - 2 - enable JIT hardening for all users + +bpf_jit_kallsyms +---------------- + +When BPF JIT compiler is enabled, then compiled images are unknown +addresses to the kernel, meaning they neither show up in traces nor +in /proc/kallsyms. This enables export of these addresses, which can +be used for debugging/tracing. If bpf_jit_harden is enabled, this +feature is disabled. + +Values : + + - 0 - disable JIT kallsyms export (default value) + - 1 - enable JIT kallsyms export for privileged users only + +bpf_jit_limit +------------- + +This enforces a global limit for memory allocations to the BPF JIT +compiler in order to reject unprivileged JIT requests once it has +been surpassed. bpf_jit_limit contains the value of the global limit +in bytes. + +dev_weight +---------- + +The maximum number of packets that kernel can handle on a NAPI interrupt, +it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware +aggregated packet is counted as one packet in this context. + +Default: 64 + +dev_weight_rx_bias +------------------ + +RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function +of the driver for the per softirq cycle netdev_budget. This parameter influences +the proportion of the configured netdev_budget that is spent on RPS based packet +processing during RX softirq cycles. It is further meant for making current +dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. +(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based +on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). + +Default: 1 + +dev_weight_tx_bias +------------------ + +Scales the maximum number of packets that can be processed during a TX softirq cycle. +Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric +net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. + +Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). + +Default: 1 + +default_qdisc +------------- + +The default queuing discipline to use for network devices. This allows +overriding the default of pfifo_fast with an alternative. Since the default +queuing discipline is created without additional parameters so is best suited +to queuing disciplines that work well without configuration like stochastic +fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use +queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin +which require setting up classes and bandwidths. Note that physical multiqueue +interfaces still use mq as root qdisc, which in turn uses this default for its +leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead +default to noqueue. + +Default: pfifo_fast + +busy_read +--------- + +Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for packets on the device queue. +This sets the default value of the SO_BUSY_POLL socket option. +Can be set or overridden per socket by setting socket option SO_BUSY_POLL, +which is the preferred method of enabling. If you need to enable the feature +globally via sysctl, a value of 50 is recommended. + +Will increase power usage. + +Default: 0 (off) + +busy_poll +---------------- +Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) +Approximate time in us to busy loop waiting for events. +Recommended value depends on the number of sockets you poll on. +For several sockets 50, for several hundreds 100. +For more than that you probably want to use epoll. +Note that only sockets with SO_BUSY_POLL set will be busy polled, +so you want to either selectively set SO_BUSY_POLL on those sockets or set +sysctl.net.busy_read globally. + +Will increase power usage. + +Default: 0 (off) + +rmem_default +------------ + +The default setting of the socket receive buffer in bytes. + +rmem_max +-------- + +The maximum receive socket buffer size in bytes. + +tstamp_allow_data +----------------- +Allow processes to receive tx timestamps looped together with the original +packet contents. If disabled, transmit timestamp requests from unprivileged +processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. + +Default: 1 (on) + + +wmem_default +------------ + +The default setting (in bytes) of the socket send buffer. + +wmem_max +-------- + +The maximum send socket buffer size in bytes. + +message_burst and message_cost +------------------------------ + +These parameters are used to limit the warning messages written to the kernel +log from the networking code. They enforce a rate limit to make a +denial-of-service attack impossible. A higher message_cost factor, results in +fewer messages that will be written. Message_burst controls when messages will +be dropped. The default settings limit warning messages to one every five +seconds. + +warnings +-------- + +This sysctl is now unused. + +This was used to control console messages from the networking stack that +occur because of problems on the network like duplicate address or bad +checksums. + +These messages are now emitted at KERN_DEBUG and can generally be enabled +and controlled by the dynamic_debug facility. + +netdev_budget +------------- + +Maximum number of packets taken from all interfaces in one polling cycle (NAPI +poll). In one polling cycle interfaces which are registered to polling are +probed in a round-robin manner. Also, a polling cycle may not exceed +netdev_budget_usecs microseconds, even if netdev_budget has not been +exhausted. + +netdev_budget_usecs +--------------------- + +Maximum number of microseconds in one NAPI polling cycle. Polling +will exit when either netdev_budget_usecs have elapsed during the +poll cycle or the number of packets processed reaches netdev_budget. + +netdev_max_backlog +------------------ + +Maximum number of packets, queued on the INPUT side, when the interface +receives packets faster than kernel can process them. + +netdev_rss_key +-------------- + +RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is +randomly generated. +Some user space might need to gather its content even if drivers do not +provide ethtool -x support yet. + +:: + + myhost:~# cat /proc/sys/net/core/netdev_rss_key + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) + +File contains nul bytes if no driver ever called netdev_rss_key_fill() function. + +Note: + /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, + but most drivers only use 40 bytes of it. + +:: + + myhost:~# ethtool -x eth0 + RX flow hash indirection table for eth0 with 8 RX ring(s): + 0: 0 1 2 3 4 5 6 7 + RSS hash key: + 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 + +netdev_tstamp_prequeue +---------------------- + +If set to 0, RX packet timestamps can be sampled after RPS processing, when +the target CPU processes packets. It might give some delay on timestamps, but +permit to distribute the load on several cpus. + +If set to 1 (default), timestamps are sampled as soon as possible, before +queueing. + +optmem_max +---------- + +Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence +of struct cmsghdr structures with appended data. + +fb_tunnels_only_for_init_net +---------------------------- + +Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, +sit0, ip6tnl0, ip6gre0) are automatically created when a new +network namespace is created, if corresponding tunnel is present +in initial network namespace. +If set to 1, these devices are not automatically created, and +user space is responsible for creating them if needed. + +Default : 0 (for compatibility reasons) + +devconf_inherit_init_net +------------------------ + +Controls if a new network namespace should inherit all current +settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By +default, we keep the current behavior: for IPv4 we inherit all current +settings from init_net and for IPv6 we reset all settings to default. + +If set to 1, both IPv4 and IPv6 settings are forced to inherit from +current ones in init_net. If set to 2, both IPv4 and IPv6 settings are +forced to reset to their default values. + +Default : 0 (for compatibility reasons) + +2. /proc/sys/net/unix - Parameters for Unix domain sockets +---------------------------------------------------------- + +There is only one file in this directory. +unix_dgram_qlen limits the max number of datagrams queued in Unix domain +socket's buffer. It will not take effect unless PF_UNIX flag is specified. + + +3. /proc/sys/net/ipv4 - IPV4 settings +------------------------------------- +Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for +descriptions of these entries. + + +4. Appletalk +------------ + +The /proc/sys/net/appletalk directory holds the Appletalk configuration data +when Appletalk is loaded. The configurable parameters are: + +aarp-expiry-time +---------------- + +The amount of time we keep an ARP entry before expiring it. Used to age out +old hosts. + +aarp-resolve-time +----------------- + +The amount of time we will spend trying to resolve an Appletalk address. + +aarp-retransmit-limit +--------------------- + +The number of times we will retransmit a query before giving up. + +aarp-tick-time +-------------- + +Controls the rate at which expires are checked. + +The directory /proc/net/appletalk holds the list of active Appletalk sockets +on a machine. + +The fields indicate the DDP type, the local address (in network:node format) +the remote address, the size of the transmit pending queue, the size of the +received queue (bytes waiting for applications to read) the state and the uid +owning the socket. + +/proc/net/atalk_iface lists all the interfaces configured for appletalk.It +shows the name of the interface, its Appletalk address, the network range on +that address (or network number for phase 1 networks), and the status of the +interface. + +/proc/net/atalk_route lists each known network route. It lists the target +(network) that the route leads to, the router (may be directly connected), the +route flags, and the device the route is using. + + +5. IPX +------ + +The IPX protocol has no tunable values in proc/sys/net. + +The IPX protocol does, however, provide proc/net/ipx. This lists each IPX +socket giving the local and remote addresses in Novell format (that is +network:node:port). In accordance with the strange Novell tradition, +everything but the port is in hex. Not_Connected is displayed for sockets that +are not tied to a specific remote address. The Tx and Rx queue sizes indicate +the number of bytes pending for transmission and reception. The state +indicates the state the socket is in and the uid is the owning uid of the +socket. + +The /proc/net/ipx_interface file lists all IPX interfaces. For each interface +it gives the network number, the node number, and indicates if the network is +the primary network. It also indicates which device it is bound to (or +Internal for internal networks) and the Frame Type if appropriate. Linux +supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for +IPX. + +The /proc/net/ipx_route table holds a list of IPX routes. For each route it +gives the destination network, the router node (or Directly) and the network +address of the router (or Connected) for internal networks. + +6. TIPC +------- + +tipc_rmem +--------- + +The TIPC protocol now has a tunable for the receive memory, similar to the +tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) + +:: + + # cat /proc/sys/net/tipc/tipc_rmem + 4252725 34021800 68043600 + # + +The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values +are scaled (shifted) versions of that same value. Note that the min value +is not at this point in time used in any meaningful way, but the triplet is +preserved in order to be consistent with things like tcp_rmem. + +named_timeout +------------- + +TIPC name table updates are distributed asynchronously in a cluster, without +any form of transaction handling. This means that different race scenarios are +possible. One such is that a name withdrawal sent out by one node and received +by another node may arrive after a second, overlapping name publication already +has been accepted from a third node, although the conflicting updates +originally may have been issued in the correct sequential order. +If named_timeout is nonzero, failed topology updates will be placed on a defer +queue until another event arrives that clears the error, or until the timeout +expires. Value is in milliseconds. diff --git a/Documentation/admin-guide/sysctl/sunrpc.rst b/Documentation/admin-guide/sysctl/sunrpc.rst new file mode 100644 index 000000000000..09780a682afd --- /dev/null +++ b/Documentation/admin-guide/sysctl/sunrpc.rst @@ -0,0 +1,25 @@ +=================================== +Documentation for /proc/sys/sunrpc/ +=================================== + +kernel version 2.2.10 + +Copyright (c) 1998, 1999, Rik van Riel + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/sunrpc and is valid for Linux kernel version 2.2. + +The files in this directory can be used to (re)set the debug +flags of the SUN Remote Procedure Call (RPC) subsystem in +the Linux kernel. This stuff is used for NFS, KNFSD and +maybe a few other things as well. + +The files in there are used to control the debugging flags: +rpc_debug, nfs_debug, nfsd_debug and nlm_debug. + +These flags are for kernel hackers only. You should read the +source code in net/sunrpc/ for more information. diff --git a/Documentation/admin-guide/sysctl/user.rst b/Documentation/admin-guide/sysctl/user.rst new file mode 100644 index 000000000000..650eaa03f15e --- /dev/null +++ b/Documentation/admin-guide/sysctl/user.rst @@ -0,0 +1,78 @@ +================================= +Documentation for /proc/sys/user/ +================================= + +kernel version 4.9.0 + +Copyright (c) 2016 Eric Biederman + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/user. + +The files in this directory can be used to override the default +limits on the number of namespaces and other objects that have +per user per user namespace limits. + +The primary purpose of these limits is to stop programs that +malfunction and attempt to create a ridiculous number of objects, +before the malfunction becomes a system wide problem. It is the +intention that the defaults of these limits are set high enough that +no program in normal operation should run into these limits. + +The creation of per user per user namespace objects are charged to +the user in the user namespace who created the object and +verified to be below the per user limit in that user namespace. + +The creation of objects is also charged to all of the users +who created user namespaces the creation of the object happens +in (user namespaces can be nested) and verified to be below the per user +limits in the user namespaces of those users. + +This recursive counting of created objects ensures that creating a +user namespace does not allow a user to escape their current limits. + +Currently, these files are in /proc/sys/user: + +max_cgroup_namespaces +===================== + + The maximum number of cgroup namespaces that any user in the current + user namespace may create. + +max_ipc_namespaces +================== + + The maximum number of ipc namespaces that any user in the current + user namespace may create. + +max_mnt_namespaces +================== + + The maximum number of mount namespaces that any user in the current + user namespace may create. + +max_net_namespaces +================== + + The maximum number of network namespaces that any user in the + current user namespace may create. + +max_pid_namespaces +================== + + The maximum number of pid namespaces that any user in the current + user namespace may create. + +max_user_namespaces +=================== + + The maximum number of user namespaces that any user in the current + user namespace may create. + +max_uts_namespaces +================== + + The maximum number of user namespaces that any user in the current + user namespace may create. diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst new file mode 100644 index 000000000000..5aceb5cd5ce7 --- /dev/null +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -0,0 +1,964 @@ +=============================== +Documentation for /proc/sys/vm/ +=============================== + +kernel version 2.6.29 + +Copyright (c) 1998, 1999, Rik van Riel + +Copyright (c) 2008 Peter W. Morreale + +For general info and legal blurb, please look in index.rst. + +------------------------------------------------------------------------------ + +This file contains the documentation for the sysctl files in +/proc/sys/vm and is valid for Linux kernel version 2.6.29. + +The files in this directory can be used to tune the operation +of the virtual memory (VM) subsystem of the Linux kernel and +the writeout of dirty data to disk. + +Default values and initialization routines for most of these +files can be found in mm/swap.c. + +Currently, these files are in /proc/sys/vm: + +- admin_reserve_kbytes +- block_dump +- compact_memory +- compact_unevictable_allowed +- dirty_background_bytes +- dirty_background_ratio +- dirty_bytes +- dirty_expire_centisecs +- dirty_ratio +- dirtytime_expire_seconds +- dirty_writeback_centisecs +- drop_caches +- extfrag_threshold +- hugetlb_shm_group +- laptop_mode +- legacy_va_layout +- lowmem_reserve_ratio +- max_map_count +- memory_failure_early_kill +- memory_failure_recovery +- min_free_kbytes +- min_slab_ratio +- min_unmapped_ratio +- mmap_min_addr +- mmap_rnd_bits +- mmap_rnd_compat_bits +- nr_hugepages +- nr_hugepages_mempolicy +- nr_overcommit_hugepages +- nr_trim_pages (only if CONFIG_MMU=n) +- numa_zonelist_order +- oom_dump_tasks +- oom_kill_allocating_task +- overcommit_kbytes +- overcommit_memory +- overcommit_ratio +- page-cluster +- panic_on_oom +- percpu_pagelist_fraction +- stat_interval +- stat_refresh +- numa_stat +- swappiness +- unprivileged_userfaultfd +- user_reserve_kbytes +- vfs_cache_pressure +- watermark_boost_factor +- watermark_scale_factor +- zone_reclaim_mode + + +admin_reserve_kbytes +==================== + +The amount of free memory in the system that should be reserved for users +with the capability cap_sys_admin. + +admin_reserve_kbytes defaults to min(3% of free pages, 8MB) + +That should provide enough for the admin to log in and kill a process, +if necessary, under the default overcommit 'guess' mode. + +Systems running under overcommit 'never' should increase this to account +for the full Virtual Memory Size of programs used to recover. Otherwise, +root may not be able to log in to recover the system. + +How do you calculate a minimum useful reserve? + +sshd or login + bash (or some other shell) + top (or ps, kill, etc.) + +For overcommit 'guess', we can sum resident set sizes (RSS). +On x86_64 this is about 8MB. + +For overcommit 'never', we can take the max of their virtual sizes (VSZ) +and add the sum of their RSS. +On x86_64 this is about 128MB. + +Changing this takes effect whenever an application requests memory. + + +block_dump +========== + +block_dump enables block I/O debugging when set to a nonzero value. More +information on block I/O debugging is in Documentation/laptops/laptop-mode.rst. + + +compact_memory +============== + +Available only when CONFIG_COMPACTION is set. When 1 is written to the file, +all zones are compacted such that free memory is available in contiguous +blocks where possible. This can be important for example in the allocation of +huge pages although processes will also directly compact memory as required. + + +compact_unevictable_allowed +=========================== + +Available only when CONFIG_COMPACTION is set. When set to 1, compaction is +allowed to examine the unevictable lru (mlocked pages) for pages to compact. +This should be used on systems where stalls for minor page faults are an +acceptable trade for large contiguous free memory. Set to 0 to prevent +compaction from moving pages that are unevictable. Default value is 1. + + +dirty_background_bytes +====================== + +Contains the amount of dirty memory at which the background kernel +flusher threads will start writeback. + +Note: + dirty_background_bytes is the counterpart of dirty_background_ratio. Only + one of them may be specified at a time. When one sysctl is written it is + immediately taken into account to evaluate the dirty memory limits and the + other appears as 0 when read. + + +dirty_background_ratio +====================== + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which the background kernel +flusher threads will start writing out dirty data. + +The total available memory is not equal to total system memory. + + +dirty_bytes +=========== + +Contains the amount of dirty memory at which a process generating disk writes +will itself start writeback. + +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. + +Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any +value lower than this limit will be ignored and the old configuration will be +retained. + + +dirty_expire_centisecs +====================== + +This tunable is used to define when dirty data is old enough to be eligible +for writeout by the kernel flusher threads. It is expressed in 100'ths +of a second. Data which has been dirty in-memory for longer than this +interval will be written out next time a flusher thread wakes up. + + +dirty_ratio +=========== + +Contains, as a percentage of total available memory that contains free pages +and reclaimable pages, the number of pages at which a process which is +generating disk writes will itself start writing out dirty data. + +The total available memory is not equal to total system memory. + + +dirtytime_expire_seconds +======================== + +When a lazytime inode is constantly having its pages dirtied, the inode with +an updated timestamp will never get chance to be written out. And, if the +only thing that has happened on the file system is a dirtytime inode caused +by an atime update, a worker will be scheduled to make sure that inode +eventually gets pushed out to disk. This tunable is used to define when dirty +inode is old enough to be eligible for writeback by the kernel flusher threads. +And, it is also used as the interval to wakeup dirtytime_writeback thread. + + +dirty_writeback_centisecs +========================= + +The kernel flusher threads will periodically wake up and write `old` data +out to disk. This tunable expresses the interval between those wakeups, in +100'ths of a second. + +Setting this to zero disables periodic writeback altogether. + + +drop_caches +=========== + +Writing to this will cause the kernel to drop clean caches, as well as +reclaimable slab objects like dentries and inodes. Once dropped, their +memory becomes free. + +To free pagecache:: + + echo 1 > /proc/sys/vm/drop_caches + +To free reclaimable slab objects (includes dentries and inodes):: + + echo 2 > /proc/sys/vm/drop_caches + +To free slab objects and pagecache:: + + echo 3 > /proc/sys/vm/drop_caches + +This is a non-destructive operation and will not free any dirty objects. +To increase the number of objects freed by this operation, the user may run +`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the +number of dirty objects on the system and create more candidates to be +dropped. + +This file is not a means to control the growth of the various kernel caches +(inodes, dentries, pagecache, etc...) These objects are automatically +reclaimed by the kernel when memory is needed elsewhere on the system. + +Use of this file can cause performance problems. Since it discards cached +objects, it may cost a significant amount of I/O and CPU to recreate the +dropped objects, especially if they were under heavy use. Because of this, +use outside of a testing or debugging environment is not recommended. + +You may see informational messages in your kernel log when this file is +used:: + + cat (1234): drop_caches: 3 + +These are informational only. They do not mean that anything is wrong +with your system. To disable them, echo 4 (bit 2) into drop_caches. + + +extfrag_threshold +================= + +This parameter affects whether the kernel will compact memory or direct +reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in +debugfs shows what the fragmentation index for each order is in each zone in +the system. Values tending towards 0 imply allocations would fail due to lack +of memory, values towards 1000 imply failures are due to fragmentation and -1 +implies that the allocation will succeed as long as watermarks are met. + +The kernel will not compact memory in a zone if the +fragmentation index is <= extfrag_threshold. The default value is 500. + + +highmem_is_dirtyable +==================== + +Available only for systems with CONFIG_HIGHMEM enabled (32b systems). + +This parameter controls whether the high memory is considered for dirty +writers throttling. This is not the case by default which means that +only the amount of memory directly visible/usable by the kernel can +be dirtied. As a result, on systems with a large amount of memory and +lowmem basically depleted writers might be throttled too early and +streaming writes can get very slow. + +Changing the value to non zero would allow more memory to be dirtied +and thus allow writers to write more data which can be flushed to the +storage more effectively. Note this also comes with a risk of pre-mature +OOM killer because some writers (e.g. direct block device writes) can +only use the low memory and they can fill it up with dirty data without +any throttling. + + +hugetlb_shm_group +================= + +hugetlb_shm_group contains group id that is allowed to create SysV +shared memory segment using hugetlb page. + + +laptop_mode +=========== + +laptop_mode is a knob that controls "laptop mode". All the things that are +controlled by this knob are discussed in Documentation/laptops/laptop-mode.rst. + + +legacy_va_layout +================ + +If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel +will use the legacy (2.4) layout for all processes. + + +lowmem_reserve_ratio +==================== + +For some specialised workloads on highmem machines it is dangerous for +the kernel to allow process memory to be allocated from the "lowmem" +zone. This is because that memory could then be pinned via the mlock() +system call, or by unavailability of swapspace. + +And on large highmem machines this lack of reclaimable lowmem memory +can be fatal. + +So the Linux page allocator has a mechanism which prevents allocations +which *could* use highmem from using too much lowmem. This means that +a certain amount of lowmem is defended from the possibility of being +captured into pinned user memory. + +(The same argument applies to the old 16 megabyte ISA DMA region. This +mechanism will also defend that region from allocations which could use +highmem or lowmem). + +The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is +in defending these lower zones. + +If you have a machine which uses highmem or ISA DMA and your +applications are using mlock(), or if you are running with no swap then +you probably should change the lowmem_reserve_ratio setting. + +The lowmem_reserve_ratio is an array. You can see them by reading this file:: + + % cat /proc/sys/vm/lowmem_reserve_ratio + 256 256 32 + +But, these values are not used directly. The kernel calculates # of protection +pages for each zones from them. These are shown as array of protection pages +in /proc/zoneinfo like followings. (This is an example of x86-64 box). +Each zone has an array of protection pages like this:: + + Node 0, zone DMA + pages free 1355 + min 3 + low 3 + high 4 + : + : + numa_other 0 + protection: (0, 2004, 2004, 2004) + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + pagesets + cpu: 0 pcp: 0 + : + +These protections are added to score to judge whether this zone should be used +for page allocation or should be reclaimed. + +In this example, if normal pages (index=2) are required to this DMA zone and +watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should +not be used because pages_free(1355) is smaller than watermark + protection[2] +(4 + 2004 = 2008). If this protection value is 0, this zone would be used for +normal page requirement. If requirement is DMA zone(index=0), protection[0] +(=0) is used. + +zone[i]'s protection[j] is calculated by following expression:: + + (i < j): + zone[i]->protection[j] + = (total sums of managed_pages from zone[i+1] to zone[j] on the node) + / lowmem_reserve_ratio[i]; + (i = j): + (should not be protected. = 0; + (i > j): + (not necessary, but looks 0) + +The default values of lowmem_reserve_ratio[i] are + + === ==================================== + 256 (if zone[i] means DMA or DMA32 zone) + 32 (others) + === ==================================== + +As above expression, they are reciprocal number of ratio. +256 means 1/256. # of protection pages becomes about "0.39%" of total managed +pages of higher zones on the node. + +If you would like to protect more pages, smaller values are effective. +The minimum value is 1 (1/1 -> 100%). The value less than 1 completely +disables protection of the pages. + + +max_map_count: +============== + +This file contains the maximum number of memory map areas a process +may have. Memory map areas are used as a side-effect of calling +malloc, directly by mmap, mprotect, and madvise, and also when loading +shared libraries. + +While most applications need less than a thousand maps, certain +programs, particularly malloc debuggers, may consume lots of them, +e.g., up to one or two maps per allocation. + +The default value is 65536. + + +memory_failure_early_kill: +========================== + +Control how to kill processes when uncorrected memory error (typically +a 2bit error in a memory module) is detected in the background by hardware +that cannot be handled by the kernel. In some cases (like the page +still having a valid copy on disk) the kernel will handle the failure +transparently without affecting any applications. But if there is +no other uptodate copy of the data it will kill to prevent any data +corruptions from propagating. + +1: Kill all processes that have the corrupted and not reloadable page mapped +as soon as the corruption is detected. Note this is not supported +for a few types of pages, like kernel internally allocated data or +the swap cache, but works for the majority of user pages. + +0: Only unmap the corrupted page from all processes and only kill a process +who tries to access it. + +The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can +handle this if they want to. + +This is only active on architectures/platforms with advanced machine +check handling and depends on the hardware capabilities. + +Applications can override this setting individually with the PR_MCE_KILL prctl + + +memory_failure_recovery +======================= + +Enable memory failure recovery (when supported by the platform) + +1: Attempt recovery. + +0: Always panic on a memory failure. + + +min_free_kbytes +=============== + +This is used to force the Linux VM to keep a minimum number +of kilobytes free. The VM uses this number to compute a +watermark[WMARK_MIN] value for each lowmem zone in the system. +Each lowmem zone gets a number of reserved free pages based +proportionally on its size. + +Some minimal amount of memory is needed to satisfy PF_MEMALLOC +allocations; if you set this to lower than 1024KB, your system will +become subtly broken, and prone to deadlock under high loads. + +Setting this too high will OOM your machine instantly. + + +min_slab_ratio +============== + +This is available only on NUMA kernels. + +A percentage of the total pages in each zone. On Zone reclaim +(fallback from the local zone occurs) slabs will be reclaimed if more +than this percentage of pages in a zone are reclaimable slab pages. +This insures that the slab growth stays under control even in NUMA +systems that rarely perform global reclaim. + +The default is 5 percent. + +Note that slab reclaim is triggered in a per zone / node fashion. +The process of reclaiming slab memory is currently not node specific +and may not be fast. + + +min_unmapped_ratio +================== + +This is available only on NUMA kernels. + +This is a percentage of the total pages in each zone. Zone reclaim will +only occur if more than this percentage of pages are in a state that +zone_reclaim_mode allows to be reclaimed. + +If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared +against all file-backed unmapped pages including swapcache pages and tmpfs +files. Otherwise, only unmapped pages backed by normal files but not tmpfs +files and similar are considered. + +The default is 1 percent. + + +mmap_min_addr +============= + +This file indicates the amount of address space which a user process will +be restricted from mmapping. Since kernel null dereference bugs could +accidentally operate based on the information in the first couple of pages +of memory userspace processes should not be allowed to write to them. By +default this value is set to 0 and no protections will be enforced by the +security module. Setting this value to something like 64k will allow the +vast majority of applications to work correctly and provide defense in depth +against future potential kernel bugs. + + +mmap_rnd_bits +============= + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations on architectures which support +tuning address space randomization. This value will be bounded +by the architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_bits tunable + + +mmap_rnd_compat_bits +==================== + +This value can be used to select the number of bits to use to +determine the random offset to the base address of vma regions +resulting from mmap allocations for applications run in +compatibility mode on architectures which support tuning address +space randomization. This value will be bounded by the +architecture's minimum and maximum supported values. + +This value can be changed after boot using the +/proc/sys/vm/mmap_rnd_compat_bits tunable + + +nr_hugepages +============ + +Change the minimum size of the hugepage pool. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_hugepages_mempolicy +====================== + +Change the size of the hugepage pool at run-time on a specific +set of NUMA nodes. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_overcommit_hugepages +======================= + +Change the maximum size of the hugepage pool. The maximum is +nr_hugepages + nr_overcommit_hugepages. + +See Documentation/admin-guide/mm/hugetlbpage.rst + + +nr_trim_pages +============= + +This is available only on NOMMU kernels. + +This value adjusts the excess page trimming behaviour of power-of-2 aligned +NOMMU mmap allocations. + +A value of 0 disables trimming of allocations entirely, while a value of 1 +trims excess pages aggressively. Any value >= 1 acts as the watermark where +trimming of allocations is initiated. + +The default value is 1. + +See Documentation/nommu-mmap.txt for more information. + + +numa_zonelist_order +=================== + +This sysctl is only for NUMA and it is deprecated. Anything but +Node order will fail! + +'where the memory is allocated from' is controlled by zonelists. + +(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. +you may be able to read ZONE_DMA as ZONE_DMA32...) + +In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. +ZONE_NORMAL -> ZONE_DMA +This means that a memory allocation request for GFP_KERNEL will +get memory from ZONE_DMA only when ZONE_NORMAL is not available. + +In NUMA case, you can think of following 2 types of order. +Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: + + (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL + (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. + +Type(A) offers the best locality for processes on Node(0), but ZONE_DMA +will be used before ZONE_NORMAL exhaustion. This increases possibility of +out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. + +Type(B) cannot offer the best locality but is more robust against OOM of +the DMA zone. + +Type(A) is called as "Node" order. Type (B) is "Zone" order. + +"Node order" orders the zonelists by node, then by zone within each node. +Specify "[Nn]ode" for node order + +"Zone Order" orders the zonelists by zone type, then by node within each +zone. Specify "[Zz]one" for zone order. + +Specify "[Dd]efault" to request automatic configuration. + +On 32-bit, the Normal zone needs to be preserved for allocations accessible +by the kernel, so "zone" order will be selected. + +On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" +order will be selected. + +Default order is recommended unless this is causing problems for your +system/application. + + +oom_dump_tasks +============== + +Enables a system-wide task dump (excluding kernel threads) to be produced +when the kernel performs an OOM-killing and includes such information as +pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj +score, and name. This is helpful to determine why the OOM killer was +invoked, to identify the rogue task that caused it, and to determine why +the OOM killer chose the task it did to kill. + +If this is set to zero, this information is suppressed. On very +large systems with thousands of tasks it may not be feasible to dump +the memory state information for each one. Such systems should not +be forced to incur a performance penalty in OOM conditions when the +information may not be desired. + +If this is set to non-zero, this information is shown whenever the +OOM killer actually kills a memory-hogging task. + +The default value is 1 (enabled). + + +oom_kill_allocating_task +======================== + +This enables or disables killing the OOM-triggering task in +out-of-memory situations. + +If this is set to zero, the OOM killer will scan through the entire +tasklist and select a task based on heuristics to kill. This normally +selects a rogue memory-hogging task that frees up a large amount of +memory when killed. + +If this is set to non-zero, the OOM killer simply kills the task that +triggered the out-of-memory condition. This avoids the expensive +tasklist scan. + +If panic_on_oom is selected, it takes precedence over whatever value +is used in oom_kill_allocating_task. + +The default value is 0. + + +overcommit_kbytes +================= + +When overcommit_memory is set to 2, the committed address space is not +permitted to exceed swap plus this amount of physical RAM. See below. + +Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one +of them may be specified at a time. Setting one disables the other (which +then appears as 0 when read). + + +overcommit_memory +================= + +This value contains a flag that enables memory overcommitment. + +When this flag is 0, the kernel attempts to estimate the amount +of free memory left when userspace requests more memory. + +When this flag is 1, the kernel pretends there is always enough +memory until it actually runs out. + +When this flag is 2, the kernel uses a "never overcommit" +policy that attempts to prevent any overcommit of memory. +Note that user_reserve_kbytes affects this policy. + +This feature can be very useful because there are a lot of +programs that malloc() huge amounts of memory "just-in-case" +and don't use much of it. + +The default value is 0. + +See Documentation/vm/overcommit-accounting.rst and +mm/util.c::__vm_enough_memory() for more information. + + +overcommit_ratio +================ + +When overcommit_memory is set to 2, the committed address +space is not permitted to exceed swap plus this percentage +of physical RAM. See above. + + +page-cluster +============ + +page-cluster controls the number of pages up to which consecutive pages +are read in from swap in a single attempt. This is the swap counterpart +to page cache readahead. +The mentioned consecutivity is not in terms of virtual/physical addresses, +but consecutive on swap space - that means they were swapped out together. + +It is a logarithmic value - setting it to zero means "1 page", setting +it to 1 means "2 pages", setting it to 2 means "4 pages", etc. +Zero disables swap readahead completely. + +The default value is three (eight pages at a time). There may be some +small benefits in tuning this to a different value if your workload is +swap-intensive. + +Lower values mean lower latencies for initial faults, but at the same time +extra faults and I/O delays for following faults if they would have been part of +that consecutive pages readahead would have brought in. + + +panic_on_oom +============ + +This enables or disables panic on out-of-memory feature. + +If this is set to 0, the kernel will kill some rogue process, +called oom_killer. Usually, oom_killer can kill rogue processes and +system will survive. + +If this is set to 1, the kernel panics when out-of-memory happens. +However, if a process limits using nodes by mempolicy/cpusets, +and those nodes become memory exhaustion status, one process +may be killed by oom-killer. No panic occurs in this case. +Because other nodes' memory may be free. This means system total status +may be not fatal yet. + +If this is set to 2, the kernel panics compulsorily even on the +above-mentioned. Even oom happens under memory cgroup, the whole +system panics. + +The default value is 0. + +1 and 2 are for failover of clustering. Please select either +according to your policy of failover. + +panic_on_oom=2+kdump gives you very strong tool to investigate +why oom happens. You can get snapshot. + + +percpu_pagelist_fraction +======================== + +This is the fraction of pages at most (high mark pcp->high) in each zone that +are allocated for each per cpu page list. The min value for this is 8. It +means that we don't allow more than 1/8th of pages in each zone to be +allocated in any single per_cpu_pagelist. This entry only changes the value +of hot per cpu pagelists. User can specify a number like 100 to allocate +1/100th of each zone to each per cpu page list. + +The batch value of each per cpu pagelist is also updated as a result. It is +set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) + +The initial value is zero. Kernel does not use this value at boot time to set +the high water marks for each per cpu page list. If the user writes '0' to this +sysctl, it will revert to this default behavior. + + +stat_interval +============= + +The time interval between which vm statistics are updated. The default +is 1 second. + + +stat_refresh +============ + +Any read or write (by root only) flushes all the per-cpu vm statistics +into their global totals, for more accurate reports when testing +e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo + +As a side-effect, it also checks for negative totals (elsewhere reported +as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. +(At time of writing, a few stats are known sometimes to be found negative, +with no ill effects: errors and warnings on these stats are suppressed.) + + +numa_stat +========= + +This interface allows runtime configuration of numa statistics. + +When page allocation performance becomes a bottleneck and you can tolerate +some possible tool breakage and decreased numa counter precision, you can +do:: + + echo 0 > /proc/sys/vm/numa_stat + +When page allocation performance is not a bottleneck and you want all +tooling to work, you can do:: + + echo 1 > /proc/sys/vm/numa_stat + + +swappiness +========== + +This control is used to define how aggressive the kernel will swap +memory pages. Higher values will increase aggressiveness, lower values +decrease the amount of swap. A value of 0 instructs the kernel not to +initiate swap until the amount of free and file-backed pages is less +than the high water mark in a zone. + +The default value is 60. + + +unprivileged_userfaultfd +======================== + +This flag controls whether unprivileged users can use the userfaultfd +system calls. Set this to 1 to allow unprivileged users to use the +userfaultfd system calls, or set this to 0 to restrict userfaultfd to only +privileged users (with SYS_CAP_PTRACE capability). + +The default value is 1. + + +user_reserve_kbytes +=================== + +When overcommit_memory is set to 2, "never overcommit" mode, reserve +min(3% of current process size, user_reserve_kbytes) of free memory. +This is intended to prevent a user from starting a single memory hogging +process, such that they cannot recover (kill the hog). + +user_reserve_kbytes defaults to min(3% of the current process size, 128MB). + +If this is reduced to zero, then the user will be allowed to allocate +all free memory with a single process, minus admin_reserve_kbytes. +Any subsequent attempts to execute a command will result in +"fork: Cannot allocate memory". + +Changing this takes effect whenever an application requests memory. + + +vfs_cache_pressure +================== + +This percentage value controls the tendency of the kernel to reclaim +the memory which is used for caching of directory and inode objects. + +At the default value of vfs_cache_pressure=100 the kernel will attempt to +reclaim dentries and inodes at a "fair" rate with respect to pagecache and +swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer +to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will +never reclaim dentries and inodes due to memory pressure and this can easily +lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 +causes the kernel to prefer to reclaim dentries and inodes. + +Increasing vfs_cache_pressure significantly beyond 100 may have negative +performance impact. Reclaim code needs to take various locks to find freeable +directory and inode objects. With vfs_cache_pressure=1000, it will look for +ten times more freeable objects than there are. + + +watermark_boost_factor +====================== + +This factor controls the level of reclaim when memory is being fragmented. +It defines the percentage of the high watermark of a zone that will be +reclaimed if pages of different mobility are being mixed within pageblocks. +The intent is that compaction has less work to do in the future and to +increase the success rate of future high-order allocations such as SLUB +allocations, THP and hugetlbfs pages. + +To make it sensible with respect to the watermark_scale_factor +parameter, the unit is in fractions of 10,000. The default value of +15,000 on !DISCONTIGMEM configurations means that up to 150% of the high +watermark will be reclaimed in the event of a pageblock being mixed due +to fragmentation. The level of reclaim is determined by the number of +fragmentation events that occurred in the recent past. If this value is +smaller than a pageblock then a pageblocks worth of pages will be reclaimed +(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. + + +watermark_scale_factor +====================== + +This factor controls the aggressiveness of kswapd. It defines the +amount of memory left in a node/system before kswapd is woken up and +how much memory needs to be free before kswapd goes back to sleep. + +The unit is in fractions of 10,000. The default value of 10 means the +distances between watermarks are 0.1% of the available memory in the +node/system. The maximum value is 1000, or 10% of memory. + +A high rate of threads entering direct reclaim (allocstall) or kswapd +going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate +that the number of free pages kswapd maintains for latency reasons is +too small for the allocation bursts occurring in the system. This knob +can then be used to tune kswapd aggressiveness accordingly. + + +zone_reclaim_mode +================= + +Zone_reclaim_mode allows someone to set more or less aggressive approaches to +reclaim memory when a zone runs out of memory. If it is set to zero then no +zone reclaim occurs. Allocations will be satisfied from other zones / nodes +in the system. + +This is value OR'ed together of + += =================================== +1 Zone reclaim on +2 Zone reclaim writes dirty pages out +4 Zone reclaim swaps pages += =================================== + +zone_reclaim_mode is disabled by default. For file servers or workloads +that benefit from having their data cached, zone_reclaim_mode should be +left disabled as the caching effect is likely to be more important than +data locality. + +zone_reclaim may be enabled if it's known that the workload is partitioned +such that each partition fits within a NUMA node and that accessing remote +memory would cause a measurable performance reduction. The page allocator +will then reclaim easily reusable pages (those page cache pages that are +currently not used) before allocating off node pages. + +Allowing zone reclaim to write out pages stops processes that are +writing large amounts of data from dirtying pages on other nodes. Zone +reclaim will write out dirty pages if a zone fills up and so effectively +throttle the process. This may decrease the performance of a single process +since it cannot use all of system memory to buffer the outgoing writes +anymore but it preserve the memory on other nodes so that the performance +of other processes running on other nodes will not be affected. + +Allowing regular swap effectively restricts allocations to the local +node unless explicitly overridden by memory policies or cpuset +configurations. diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst index 1d8e748f909f..c6224d039bcb 100644 --- a/Documentation/core-api/printk-formats.rst +++ b/Documentation/core-api/printk-formats.rst @@ -119,7 +119,7 @@ Kernel Pointers For printing kernel pointers which should be hidden from unprivileged users. The behaviour of %pK depends on the kptr_restrict sysctl - see -Documentation/sysctl/kernel.rst for more details. +Documentation/admin-guide/sysctl/kernel.rst for more details. Unmodified Addresses -------------------- diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d750b6926899..fb4735fd73b0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -1500,7 +1500,7 @@ review the kernel documentation in the directory /usr/src/linux/Documentation. This chapter is heavily based on the documentation included in the pre 2.2 kernels, and became part of it in version 2.2.1 of the Linux kernel. -Please see: Documentation/sysctl/ directory for descriptions of these +Please see: Documentation/admin-guide/sysctl/ directory for descriptions of these entries. ------------------------------------------------------------------------------ diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 5c3399cde1c4..df33674799b5 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -2287,7 +2287,7 @@ addr_scope_policy - INTEGER /proc/sys/net/core/* - Please see: Documentation/sysctl/net.rst for descriptions of these entries. + Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. /proc/sys/net/unix/* diff --git a/Documentation/sysctl/abi.rst b/Documentation/sysctl/abi.rst deleted file mode 100644 index 599bcde7f0b7..000000000000 --- a/Documentation/sysctl/abi.rst +++ /dev/null @@ -1,67 +0,0 @@ -================================ -Documentation for /proc/sys/abi/ -================================ - -kernel version 2.6.0.test2 - -Copyright (c) 2003, Fabian Frederick - -For general info: index.rst. - ------------------------------------------------------------------------------- - -This path is binary emulation relevant aka personality types aka abi. -When a process is executed, it's linked to an exec_domain whose -personality is defined using values available from /proc/sys/abi. -You can find further details about abi in include/linux/personality.h. - -Here are the files featuring in 2.6 kernel: - -- defhandler_coff -- defhandler_elf -- defhandler_lcall7 -- defhandler_libcso -- fake_utsname -- trace - -defhandler_coff ---------------- - -defined value: - PER_SCOSVR3:: - - 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE - -defhandler_elf --------------- - -defined value: - PER_LINUX:: - - 0 - -defhandler_lcall7 ------------------ - -defined value : - PER_SVR4:: - - 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -defhandler_libsco ------------------ - -defined value: - PER_SVR4:: - - 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -fake_utsname ------------- - -Unused - -trace ------ - -Unused diff --git a/Documentation/sysctl/fs.rst b/Documentation/sysctl/fs.rst deleted file mode 100644 index 2a45119e3331..000000000000 --- a/Documentation/sysctl/fs.rst +++ /dev/null @@ -1,384 +0,0 @@ -=============================== -Documentation for /proc/sys/fs/ -=============================== - -kernel version 2.2.10 - -Copyright (c) 1998, 1999, Rik van Riel - -Copyright (c) 2009, Shen Feng - -For general info and legal blurb, please look in intro.rst. - ------------------------------------------------------------------------------- - -This file contains documentation for the sysctl files in -/proc/sys/fs/ and is valid for Linux kernel version 2.2. - -The files in this directory can be used to tune and monitor -miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your -system, it is advisable to read both documentation and source -before actually making adjustments. - -1. /proc/sys/fs -=============== - -Currently, these files are in /proc/sys/fs: - -- aio-max-nr -- aio-nr -- dentry-state -- dquot-max -- dquot-nr -- file-max -- file-nr -- inode-max -- inode-nr -- inode-state -- nr_open -- overflowuid -- overflowgid -- pipe-user-pages-hard -- pipe-user-pages-soft -- protected_fifos -- protected_hardlinks -- protected_regular -- protected_symlinks -- suid_dumpable -- super-max -- super-nr - - -aio-nr & aio-max-nr -------------------- - -aio-nr is the running total of the number of events specified on the -io_setup system call for all currently active aio contexts. If aio-nr -reaches aio-max-nr then io_setup will fail with EAGAIN. Note that -raising aio-max-nr does not result in the pre-allocation or re-sizing -of any kernel data structures. - - -dentry-state ------------- - -From linux/include/linux/dcache.h:: - - struct dentry_stat_t dentry_stat { - int nr_dentry; - int nr_unused; - int age_limit; /* age in seconds */ - int want_pages; /* pages requested by system */ - int nr_negative; /* # of unused negative dentries */ - int dummy; /* Reserved for future use */ - }; - -Dentries are dynamically allocated and deallocated. - -nr_dentry shows the total number of dentries allocated (active -+ unused). nr_unused shows the number of dentries that are not -actively used, but are saved in the LRU list for future reuse. - -Age_limit is the age in seconds after which dcache entries -can be reclaimed when memory is short and want_pages is -nonzero when shrink_dcache_pages() has been called and the -dcache isn't pruned yet. - -nr_negative shows the number of unused dentries that are also -negative dentries which do not map to any files. Instead, -they help speeding up rejection of non-existing files provided -by the users. - - -dquot-max & dquot-nr --------------------- - -The file dquot-max shows the maximum number of cached disk -quota entries. - -The file dquot-nr shows the number of allocated disk quota -entries and the number of free disk quota entries. - -If the number of free cached disk quotas is very low and -you have some awesome number of simultaneous system users, -you might want to raise the limit. - - -file-max & file-nr ------------------- - -The value in file-max denotes the maximum number of file- -handles that the Linux kernel will allocate. When you get lots -of error messages about running out of file handles, you might -want to increase this limit. - -Historically,the kernel was able to allocate file handles -dynamically, but not to free them again. The three values in -file-nr denote the number of allocated file handles, the number -of allocated but unused file handles, and the maximum number of -file handles. Linux 2.6 always reports 0 as the number of free -file handles -- this is not an error, it just means that the -number of allocated file handles exactly matches the number of -used file handles. - -Attempts to allocate more file descriptors than file-max are -reported with printk, look for "VFS: file-max limit -reached". - - -nr_open -------- - -This denotes the maximum number of file-handles a process can -allocate. Default value is 1024*1024 (1048576) which should be -enough for most machines. Actual limit depends on RLIMIT_NOFILE -resource limit. - - -inode-max, inode-nr & inode-state ---------------------------------- - -As with file handles, the kernel allocates the inode structures -dynamically, but can't free them yet. - -The value in inode-max denotes the maximum number of inode -handlers. This value should be 3-4 times larger than the value -in file-max, since stdin, stdout and network sockets also -need an inode struct to handle them. When you regularly run -out of inodes, you need to increase this value. - -The file inode-nr contains the first two items from -inode-state, so we'll skip to that file... - -Inode-state contains three actual numbers and four dummies. -The actual numbers are, in order of appearance, nr_inodes, -nr_free_inodes and preshrink. - -Nr_inodes stands for the number of inodes the system has -allocated, this can be slightly more than inode-max because -Linux allocates them one pageful at a time. - -Nr_free_inodes represents the number of free inodes (?) and -preshrink is nonzero when the nr_inodes > inode-max and the -system needs to prune the inode list instead of allocating -more. - - -overflowgid & overflowuid -------------------------- - -Some filesystems only support 16-bit UIDs and GIDs, although in Linux -UIDs and GIDs are 32 bits. When one of these filesystems is mounted -with writes enabled, any UID or GID that would exceed 65535 is translated -to a fixed value before being written to disk. - -These sysctls allow you to change the value of the fixed UID and GID. -The default is 65534. - - -pipe-user-pages-hard --------------------- - -Maximum total number of pages a non-privileged user may allocate for pipes. -Once this limit is reached, no new pipes may be allocated until usage goes -below the limit again. When set to 0, no limit is applied, which is the default -setting. - - -pipe-user-pages-soft --------------------- - -Maximum total number of pages a non-privileged user may allocate for pipes -before the pipe size gets limited to a single page. Once this limit is reached, -new pipes will be limited to a single page in size for this user in order to -limit total memory usage, and trying to increase them using fcntl() will be -denied until usage goes below the limit again. The default value allows to -allocate up to 1024 pipes at their default size. When set to 0, no limit is -applied. - - -protected_fifos ---------------- - -The intent of this protection is to avoid unintentional writes to -an attacker-controlled FIFO, where a program expected to create a regular -file. - -When set to "0", writing to FIFOs is unrestricted. - -When set to "1" don't allow O_CREAT open on FIFOs that we don't own -in world writable sticky directories, unless they are owned by the -owner of the directory. - -When set to "2" it also applies to group writable sticky directories. - -This protection is based on the restrictions in Openwall. - - -protected_hardlinks --------------------- - -A long-standing class of security issues is the hardlink-based -time-of-check-time-of-use race, most commonly seen in world-writable -directories like /tmp. The common method of exploitation of this flaw -is to cross privilege boundaries when following a given hardlink (i.e. a -root process follows a hardlink created by another user). Additionally, -on systems without separated partitions, this stops unauthorized users -from "pinning" vulnerable setuid/setgid files against being upgraded by -the administrator, or linking to special files. - -When set to "0", hardlink creation behavior is unrestricted. - -When set to "1" hardlinks cannot be created by users if they do not -already own the source file, or do not have read/write access to it. - -This protection is based on the restrictions in Openwall and grsecurity. - - -protected_regular ------------------ - -This protection is similar to protected_fifos, but it -avoids writes to an attacker-controlled regular file, where a program -expected to create one. - -When set to "0", writing to regular files is unrestricted. - -When set to "1" don't allow O_CREAT open on regular files that we -don't own in world writable sticky directories, unless they are -owned by the owner of the directory. - -When set to "2" it also applies to group writable sticky directories. - - -protected_symlinks ------------------- - -A long-standing class of security issues is the symlink-based -time-of-check-time-of-use race, most commonly seen in world-writable -directories like /tmp. The common method of exploitation of this flaw -is to cross privilege boundaries when following a given symlink (i.e. a -root process follows a symlink belonging to another user). For a likely -incomplete list of hundreds of examples across the years, please see: -http://cve.mitre.org/cgi-bin/cvekey.cgi?keyword=/tmp - -When set to "0", symlink following behavior is unrestricted. - -When set to "1" symlinks are permitted to be followed only when outside -a sticky world-writable directory, or when the uid of the symlink and -follower match, or when the directory owner matches the symlink's owner. - -This protection is based on the restrictions in Openwall and grsecurity. - - -suid_dumpable: --------------- - -This value can be used to query and set the core dump mode for setuid -or otherwise protected/tainted binaries. The modes are - -= ========== =============================================================== -0 (default) traditional behaviour. Any process which has changed - privilege levels or is execute only will not be dumped. -1 (debug) all processes dump core when possible. The core dump is - owned by the current user and no security is applied. This is - intended for system debugging situations only. - Ptrace is unchecked. - This is insecure as it allows regular users to examine the - memory contents of privileged processes. -2 (suidsafe) any binary which normally would not be dumped is dumped - anyway, but only if the "core_pattern" kernel sysctl is set to - either a pipe handler or a fully qualified path. (For more - details on this limitation, see CVE-2006-2451.) This mode is - appropriate when administrators are attempting to debug - problems in a normal environment, and either have a core dump - pipe handler that knows to treat privileged core dumps with - care, or specific directory defined for catching core dumps. - If a core dump happens without a pipe handler or fully - qualified path, a message will be emitted to syslog warning - about the lack of a correct setting. -= ========== =============================================================== - - -super-max & super-nr --------------------- - -These numbers control the maximum number of superblocks, and -thus the maximum number of mounted filesystems the kernel -can have. You only need to increase super-max if you need to -mount more filesystems than the current value in super-max -allows you to. - - -aio-nr & aio-max-nr -------------------- - -aio-nr shows the current system-wide number of asynchronous io -requests. aio-max-nr allows you to change the maximum value -aio-nr can grow to. - - -mount-max ---------- - -This denotes the maximum number of mounts that may exist -in a mount namespace. - - - -2. /proc/sys/fs/binfmt_misc -=========================== - -Documentation for the files in /proc/sys/fs/binfmt_misc is -in Documentation/admin-guide/binfmt-misc.rst. - - -3. /proc/sys/fs/mqueue - POSIX message queues filesystem -======================================================== - - -The "mqueue" filesystem provides the necessary kernel features to enable the -creation of a user space library that implements the POSIX message queues -API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System -Interfaces specification.) - -The "mqueue" filesystem contains values for determining/setting the amount of -resources used by the file system. - -/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the -maximum number of message queues allowed on the system. - -/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the -maximum number of messages in a queue value. In fact it is the limiting value -for another (user) limit which is set in mq_open invocation. This attribute of -a queue must be less or equal then msg_max. - -/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the -maximum message size value (it is every message queue's attribute set during -its creation). - -/proc/sys/fs/mqueue/msg_default is a read/write file for setting/getting the -default number of messages in a queue value if attr parameter of mq_open(2) is -NULL. If it exceed msg_max, the default value is initialized msg_max. - -/proc/sys/fs/mqueue/msgsize_default is a read/write file for setting/getting -the default message size value if attr parameter of mq_open(2) is NULL. If it -exceed msgsize_max, the default value is initialized msgsize_max. - -4. /proc/sys/fs/epoll - Configuration options for the epoll interface -===================================================================== - -This directory contains configuration options for the epoll(7) interface. - -max_user_watches ----------------- - -Every epoll file descriptor can store a number of files to be monitored -for event readiness. Each one of these monitored files constitutes a "watch". -This configuration option sets the maximum number of "watches" that are -allowed for each user. -Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes -on a 64bit one. -The current default value for max_user_watches is the 1/32 of the available -low memory, divided for the "watch" cost in bytes. diff --git a/Documentation/sysctl/index.rst b/Documentation/sysctl/index.rst deleted file mode 100644 index efbcde8c1c9c..000000000000 --- a/Documentation/sysctl/index.rst +++ /dev/null @@ -1,100 +0,0 @@ -:orphan: - -=========================== -Documentation for /proc/sys -=========================== - -Copyright (c) 1998, 1999, Rik van Riel - ------------------------------------------------------------------------------- - -'Why', I hear you ask, 'would anyone even _want_ documentation -for them sysctl files? If anybody really needs it, it's all in -the source...' - -Well, this documentation is written because some people either -don't know they need to tweak something, or because they don't -have the time or knowledge to read the source code. - -Furthermore, the programmers who built sysctl have built it to -be actually used, not just for the fun of programming it :-) - ------------------------------------------------------------------------------- - -Legal blurb: - -As usual, there are two main things to consider: - -1. you get what you pay for -2. it's free - -The consequences are that I won't guarantee the correctness of -this document, and if you come to me complaining about how you -screwed up your system because of wrong documentation, I won't -feel sorry for you. I might even laugh at you... - -But of course, if you _do_ manage to screw up your system using -only the sysctl options used in this file, I'd like to hear of -it. Not only to have a great laugh, but also to make sure that -you're the last RTFMing person to screw up. - -In short, e-mail your suggestions, corrections and / or horror -stories to: - -Rik van Riel. - --------------------------------------------------------------- - -Introduction -============ - -Sysctl is a means of configuring certain aspects of the kernel -at run-time, and the /proc/sys/ directory is there so that you -don't even need special tools to do it! -In fact, there are only four things needed to use these config -facilities: - -- a running Linux system -- root access -- common sense (this is especially hard to come by these days) -- knowledge of what all those values mean - -As a quick 'ls /proc/sys' will show, the directory consists of -several (arch-dependent?) subdirs. Each subdir is mainly about -one part of the kernel, so you can do configuration on a piece -by piece basis, or just some 'thematic frobbing'. - -This documentation is about: - -=============== =============================================================== -abi/ execution domains & personalities -debug/ -dev/ device specific information (eg dev/cdrom/info) -fs/ specific filesystems - filehandle, inode, dentry and quota tuning - binfmt_misc -kernel/ global kernel info / tuning - miscellaneous stuff -net/ networking stuff, for documentation look in: - -proc/ -sunrpc/ SUN Remote Procedure Call (NFS) -vm/ memory management tuning - buffer and cache management -user/ Per user per user namespace limits -=============== =============================================================== - -These are the subdirs I have on my system. There might be more -or other subdirs in another setup. If you see another dir, I'd -really like to hear about it :-) - -.. toctree:: - :maxdepth: 1 - - abi - fs - kernel - net - sunrpc - user - vm diff --git a/Documentation/sysctl/kernel.rst b/Documentation/sysctl/kernel.rst deleted file mode 100644 index a0c1d4ce403a..000000000000 --- a/Documentation/sysctl/kernel.rst +++ /dev/null @@ -1,1177 +0,0 @@ -=================================== -Documentation for /proc/sys/kernel/ -=================================== - -kernel version 2.2.10 - -Copyright (c) 1998, 1999, Rik van Riel - -Copyright (c) 2009, Shen Feng - -For general info and legal blurb, please look in index.rst. - ------------------------------------------------------------------------------- - -This file contains documentation for the sysctl files in -/proc/sys/kernel/ and is valid for Linux kernel version 2.2. - -The files in this directory can be used to tune and monitor -miscellaneous and general things in the operation of the Linux -kernel. Since some of the files _can_ be used to screw up your -system, it is advisable to read both documentation and source -before actually making adjustments. - -Currently, these files might (depending on your configuration) -show up in /proc/sys/kernel: - -- acct -- acpi_video_flags -- auto_msgmni -- bootloader_type [ X86 only ] -- bootloader_version [ X86 only ] -- cap_last_cap -- core_pattern -- core_pipe_limit -- core_uses_pid -- ctrl-alt-del -- dmesg_restrict -- domainname -- hostname -- hotplug -- hardlockup_all_cpu_backtrace -- hardlockup_panic -- hung_task_panic -- hung_task_check_count -- hung_task_timeout_secs -- hung_task_check_interval_secs -- hung_task_warnings -- hyperv_record_panic_msg -- kexec_load_disabled -- kptr_restrict -- l2cr [ PPC only ] -- modprobe ==> Documentation/debugging-modules.txt -- modules_disabled -- msg_next_id [ sysv ipc ] -- msgmax -- msgmnb -- msgmni -- nmi_watchdog -- osrelease -- ostype -- overflowgid -- overflowuid -- panic -- panic_on_oops -- panic_on_stackoverflow -- panic_on_unrecovered_nmi -- panic_on_warn -- panic_print -- panic_on_rcu_stall -- perf_cpu_time_max_percent -- perf_event_paranoid -- perf_event_max_stack -- perf_event_mlock_kb -- perf_event_max_contexts_per_stack -- pid_max -- powersave-nap [ PPC only ] -- printk -- printk_delay -- printk_ratelimit -- printk_ratelimit_burst -- pty ==> Documentation/filesystems/devpts.txt -- randomize_va_space -- real-root-dev ==> Documentation/admin-guide/initrd.rst -- reboot-cmd [ SPARC only ] -- rtsig-max -- rtsig-nr -- sched_energy_aware -- seccomp/ ==> Documentation/userspace-api/seccomp_filter.rst -- sem -- sem_next_id [ sysv ipc ] -- sg-big-buff [ generic SCSI device (sg) ] -- shm_next_id [ sysv ipc ] -- shm_rmid_forced -- shmall -- shmmax [ sysv ipc ] -- shmmni -- softlockup_all_cpu_backtrace -- soft_watchdog -- stack_erasing -- stop-a [ SPARC only ] -- sysrq ==> Documentation/admin-guide/sysrq.rst -- sysctl_writes_strict -- tainted ==> Documentation/admin-guide/tainted-kernels.rst -- threads-max -- unknown_nmi_panic -- watchdog -- watchdog_thresh -- version - - -acct: -===== - -highwater lowwater frequency - -If BSD-style process accounting is enabled these values control -its behaviour. If free space on filesystem where the log lives -goes below % accounting suspends. If free space gets -above % accounting resumes. determines -how often do we check the amount of free space (value is in -seconds). Default: -4 2 30 -That is, suspend accounting if there left <= 2% free; resume it -if we got >=4%; consider information about amount of free space -valid for 30 seconds. - - -acpi_video_flags: -================= - -flags - -See Doc*/kernel/power/video.txt, it allows mode of video boot to be -set during run time. - - -auto_msgmni: -============ - -This variable has no effect and may be removed in future kernel -releases. Reading it always returns 0. -Up to Linux 3.17, it enabled/disabled automatic recomputing of msgmni -upon memory add/remove or upon ipc namespace creation/removal. -Echoing "1" into this file enabled msgmni automatic recomputing. -Echoing "0" turned it off. auto_msgmni default value was 1. - - -bootloader_type: -================ - -x86 bootloader identification - -This gives the bootloader type number as indicated by the bootloader, -shifted left by 4, and OR'd with the low four bits of the bootloader -version. The reason for this encoding is that this used to match the -type_of_loader field in the kernel header; the encoding is kept for -backwards compatibility. That is, if the full bootloader type number -is 0x15 and the full version number is 0x234, this file will contain -the value 340 = 0x154. - -See the type_of_loader and ext_loader_type fields in -Documentation/x86/boot.rst for additional information. - - -bootloader_version: -=================== - -x86 bootloader version - -The complete bootloader version number. In the example above, this -file will contain the value 564 = 0x234. - -See the type_of_loader and ext_loader_ver fields in -Documentation/x86/boot.rst for additional information. - - -cap_last_cap: -============= - -Highest valid capability of the running kernel. Exports -CAP_LAST_CAP from the kernel. - - -core_pattern: -============= - -core_pattern is used to specify a core dumpfile pattern name. - -* max length 127 characters; default value is "core" -* core_pattern is used as a pattern template for the output filename; - certain string patterns (beginning with '%') are substituted with - their actual values. -* backward compatibility with core_uses_pid: - - If core_pattern does not include "%p" (default does not) - and core_uses_pid is set, then .PID will be appended to - the filename. - -* corename format specifiers:: - - % '%' is dropped - %% output one '%' - %p pid - %P global pid (init PID namespace) - %i tid - %I global tid (init PID namespace) - %u uid (in initial user namespace) - %g gid (in initial user namespace) - %d dump mode, matches PR_SET_DUMPABLE and - /proc/sys/fs/suid_dumpable - %s signal number - %t UNIX time of dump - %h hostname - %e executable filename (may be shortened) - %E executable path - % both are dropped - -* If the first character of the pattern is a '|', the kernel will treat - the rest of the pattern as a command to run. The core dump will be - written to the standard input of that program instead of to a file. - - -core_pipe_limit: -================ - -This sysctl is only applicable when core_pattern is configured to pipe -core files to a user space helper (when the first character of -core_pattern is a '|', see above). When collecting cores via a pipe -to an application, it is occasionally useful for the collecting -application to gather data about the crashing process from its -/proc/pid directory. In order to do this safely, the kernel must wait -for the collecting process to exit, so as not to remove the crashing -processes proc files prematurely. This in turn creates the -possibility that a misbehaving userspace collecting process can block -the reaping of a crashed process simply by never exiting. This sysctl -defends against that. It defines how many concurrent crashing -processes may be piped to user space applications in parallel. If -this value is exceeded, then those crashing processes above that value -are noted via the kernel log and their cores are skipped. 0 is a -special value, indicating that unlimited processes may be captured in -parallel, but that no waiting will take place (i.e. the collecting -process is not guaranteed access to /proc//). This -value defaults to 0. - - -core_uses_pid: -============== - -The default coredump filename is "core". By setting -core_uses_pid to 1, the coredump filename becomes core.PID. -If core_pattern does not include "%p" (default does not) -and core_uses_pid is set, then .PID will be appended to -the filename. - - -ctrl-alt-del: -============= - -When the value in this file is 0, ctrl-alt-del is trapped and -sent to the init(1) program to handle a graceful restart. -When, however, the value is > 0, Linux's reaction to a Vulcan -Nerve Pinch (tm) will be an immediate reboot, without even -syncing its dirty buffers. - -Note: - when a program (like dosemu) has the keyboard in 'raw' - mode, the ctrl-alt-del is intercepted by the program before it - ever reaches the kernel tty layer, and it's up to the program - to decide what to do with it. - - -dmesg_restrict: -=============== - -This toggle indicates whether unprivileged users are prevented -from using dmesg(8) to view messages from the kernel's log buffer. -When dmesg_restrict is set to (0) there are no restrictions. When -dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use -dmesg(8). - -The kernel config option CONFIG_SECURITY_DMESG_RESTRICT sets the -default value of dmesg_restrict. - - -domainname & hostname: -====================== - -These files can be used to set the NIS/YP domainname and the -hostname of your box in exactly the same way as the commands -domainname and hostname, i.e.:: - - # echo "darkstar" > /proc/sys/kernel/hostname - # echo "mydomain" > /proc/sys/kernel/domainname - -has the same effect as:: - - # hostname "darkstar" - # domainname "mydomain" - -Note, however, that the classic darkstar.frop.org has the -hostname "darkstar" and DNS (Internet Domain Name Server) -domainname "frop.org", not to be confused with the NIS (Network -Information Service) or YP (Yellow Pages) domainname. These two -domain names are in general different. For a detailed discussion -see the hostname(1) man page. - - -hardlockup_all_cpu_backtrace: -============================= - -This value controls the hard lockup detector behavior when a hard -lockup condition is detected as to whether or not to gather further -debug information. If enabled, arch-specific all-CPU stack dumping -will be initiated. - -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. - - -hardlockup_panic: -================= - -This parameter can be used to control whether the kernel panics -when a hard lockup is detected. - - 0 - don't panic on hard lockup - 1 - panic on hard lockup - -See Documentation/lockup-watchdogs.txt for more information. This can -also be set using the nmi_watchdog kernel parameter. - - -hotplug: -======== - -Path for the hotplug policy agent. -Default value is "/sbin/hotplug". - - -hung_task_panic: -================ - -Controls the kernel's behavior when a hung task is detected. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0: continue operation. This is the default behavior. - -1: panic immediately. - - -hung_task_check_count: -====================== - -The upper bound on the number of tasks that are checked. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - - -hung_task_timeout_secs: -======================= - -When a task in D state did not get scheduled -for more than this value report a warning. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0: means infinite timeout - no checking done. - -Possible values to set are in range {0..LONG_MAX/HZ}. - - -hung_task_check_interval_secs: -============================== - -Hung task check interval. If hung task checking is enabled -(see hung_task_timeout_secs), the check is done every -hung_task_check_interval_secs seconds. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - -0 (default): means use hung_task_timeout_secs as checking interval. -Possible values to set are in range {0..LONG_MAX/HZ}. - - -hung_task_warnings: -=================== - -The maximum number of warnings to report. During a check interval -if a hung task is detected, this value is decreased by 1. -When this value reaches 0, no more warnings will be reported. -This file shows up if CONFIG_DETECT_HUNG_TASK is enabled. - --1: report an infinite number of warnings. - - -hyperv_record_panic_msg: -======================== - -Controls whether the panic kmsg data should be reported to Hyper-V. - -0: do not report panic kmsg data. - -1: report the panic kmsg data. This is the default behavior. - - -kexec_load_disabled: -==================== - -A toggle indicating if the kexec_load syscall has been disabled. This -value defaults to 0 (false: kexec_load enabled), but can be set to 1 -(true: kexec_load disabled). Once true, kexec can no longer be used, and -the toggle cannot be set back to false. This allows a kexec image to be -loaded before disabling the syscall, allowing a system to set up (and -later use) an image without it being altered. Generally used together -with the "modules_disabled" sysctl. - - -kptr_restrict: -============== - -This toggle indicates whether restrictions are placed on -exposing kernel addresses via /proc and other interfaces. - -When kptr_restrict is set to 0 (the default) the address is hashed before -printing. (This is the equivalent to %p.) - -When kptr_restrict is set to (1), kernel pointers printed using the %pK -format specifier will be replaced with 0's unless the user has CAP_SYSLOG -and effective user and group ids are equal to the real ids. This is -because %pK checks are done at read() time rather than open() time, so -if permissions are elevated between the open() and the read() (e.g via -a setuid binary) then %pK will not leak kernel pointers to unprivileged -users. Note, this is a temporary solution only. The correct long-term -solution is to do the permission checks at open() time. Consider removing -world read permissions from files that use %pK, and using dmesg_restrict -to protect against uses of %pK in dmesg(8) if leaking kernel pointer -values to unprivileged users is a concern. - -When kptr_restrict is set to (2), kernel pointers printed using -%pK will be replaced with 0's regardless of privileges. - - -l2cr: (PPC only) -================ - -This flag controls the L2 cache of G3 processor boards. If -0, the cache is disabled. Enabled if nonzero. - - -modules_disabled: -================= - -A toggle value indicating if modules are allowed to be loaded -in an otherwise modular kernel. This toggle defaults to off -(0), but can be set true (1). Once true, modules can be -neither loaded nor unloaded, and the toggle cannot be set back -to false. Generally used with the "kexec_load_disabled" toggle. - - -msg_next_id, sem_next_id, and shm_next_id: -========================================== - -These three toggles allows to specify desired id for next allocated IPC -object: message, semaphore or shared memory respectively. - -By default they are equal to -1, which means generic allocation logic. -Possible values to set are in range {0..INT_MAX}. - -Notes: - 1) kernel doesn't guarantee, that new object will have desired id. So, - it's up to userspace, how to handle an object with "wrong" id. - 2) Toggle with non-default value will be set back to -1 by kernel after - successful IPC object allocation. If an IPC object allocation syscall - fails, it is undefined if the value remains unmodified or is reset to -1. - - -nmi_watchdog: -============= - -This parameter can be used to control the NMI watchdog -(i.e. the hard lockup detector) on x86 systems. - -0 - disable the hard lockup detector - -1 - enable the hard lockup detector - -The hard lockup detector monitors each CPU for its ability to respond to -timer interrupts. The mechanism utilizes CPU performance counter registers -that are programmed to generate Non-Maskable Interrupts (NMIs) periodically -while a CPU is busy. Hence, the alternative name 'NMI watchdog'. - -The NMI watchdog is disabled by default if the kernel is running as a guest -in a KVM virtual machine. This default can be overridden by adding:: - - nmi_watchdog=1 - -to the guest kernel command line (see Documentation/admin-guide/kernel-parameters.rst). - - -numa_balancing: -=============== - -Enables/disables automatic page fault based NUMA memory -balancing. Memory is moved automatically to nodes -that access it often. - -Enables/disables automatic NUMA memory balancing. On NUMA machines, there -is a performance penalty if remote memory is accessed by a CPU. When this -feature is enabled the kernel samples what task thread is accessing memory -by periodically unmapping pages and later trapping a page fault. At the -time of the page fault, it is determined if the data being accessed should -be migrated to a local memory node. - -The unmapping of pages and trapping faults incur additional overhead that -ideally is offset by improved memory locality but there is no universal -guarantee. If the target workload is already bound to NUMA nodes then this -feature should be disabled. Otherwise, if the system overhead from the -feature is too high then the rate the kernel samples for NUMA hinting -faults may be controlled by the numa_balancing_scan_period_min_ms, -numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, -numa_balancing_scan_size_mb, and numa_balancing_settle_count sysctls. - -numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb -=============================================================================================================================== - - -Automatic NUMA balancing scans tasks address space and unmaps pages to -detect if pages are properly placed or if the data should be migrated to a -memory node local to where the task is running. Every "scan delay" the task -scans the next "scan size" number of pages in its address space. When the -end of the address space is reached the scanner restarts from the beginning. - -In combination, the "scan delay" and "scan size" determine the scan rate. -When "scan delay" decreases, the scan rate increases. The scan delay and -hence the scan rate of every task is adaptive and depends on historical -behaviour. If pages are properly placed then the scan delay increases, -otherwise the scan delay decreases. The "scan size" is not adaptive but -the higher the "scan size", the higher the scan rate. - -Higher scan rates incur higher system overhead as page faults must be -trapped and potentially data must be migrated. However, the higher the scan -rate, the more quickly a tasks memory is migrated to a local node if the -workload pattern changes and minimises performance impact due to remote -memory accesses. These sysctls control the thresholds for scan delays and -the number of pages scanned. - -numa_balancing_scan_period_min_ms is the minimum time in milliseconds to -scan a tasks virtual memory. It effectively controls the maximum scanning -rate for each task. - -numa_balancing_scan_delay_ms is the starting "scan delay" used for a task -when it initially forks. - -numa_balancing_scan_period_max_ms is the maximum time in milliseconds to -scan a tasks virtual memory. It effectively controls the minimum scanning -rate for each task. - -numa_balancing_scan_size_mb is how many megabytes worth of pages are -scanned for a given scan. - - -osrelease, ostype & version: -============================ - -:: - - # cat osrelease - 2.1.88 - # cat ostype - Linux - # cat version - #5 Wed Feb 25 21:49:24 MET 1998 - -The files osrelease and ostype should be clear enough. Version -needs a little more clarification however. The '#5' means that -this is the fifth kernel built from this source base and the -date behind it indicates the time the kernel was built. -The only way to tune these values is to rebuild the kernel :-) - - -overflowgid & overflowuid: -========================== - -if your architecture did not always support 32-bit UIDs (i.e. arm, -i386, m68k, sh, and sparc32), a fixed UID and GID will be returned to -applications that use the old 16-bit UID/GID system calls, if the -actual UID or GID would exceed 65535. - -These sysctls allow you to change the value of the fixed UID and GID. -The default is 65534. - - -panic: -====== - -The value in this file represents the number of seconds the kernel -waits before rebooting on a panic. When you use the software watchdog, -the recommended setting is 60. - - -panic_on_io_nmi: -================ - -Controls the kernel's behavior when a CPU receives an NMI caused by -an IO error. - -0: try to continue operation (default) - -1: panic immediately. The IO error triggered an NMI. This indicates a - serious system condition which could result in IO data corruption. - Rather than continuing, panicking might be a better choice. Some - servers issue this sort of NMI when the dump button is pushed, - and you can use this option to take a crash dump. - - -panic_on_oops: -============== - -Controls the kernel's behaviour when an oops or BUG is encountered. - -0: try to continue operation - -1: panic immediately. If the `panic` sysctl is also non-zero then the - machine will be rebooted. - - -panic_on_stackoverflow: -======================= - -Controls the kernel's behavior when detecting the overflows of -kernel, IRQ and exception stacks except a user stack. -This file shows up if CONFIG_DEBUG_STACKOVERFLOW is enabled. - -0: try to continue operation. - -1: panic immediately. - - -panic_on_unrecovered_nmi: -========================= - -The default Linux behaviour on an NMI of either memory or unknown is -to continue operation. For many environments such as scientific -computing it is preferable that the box is taken out and the error -dealt with than an uncorrected parity/ECC error get propagated. - -A small number of systems do generate NMI's for bizarre random reasons -such as power management so the default is off. That sysctl works like -the existing panic controls already in that directory. - - -panic_on_warn: -============== - -Calls panic() in the WARN() path when set to 1. This is useful to avoid -a kernel rebuild when attempting to kdump at the location of a WARN(). - -0: only WARN(), default behaviour. - -1: call panic() after printing out WARN() location. - - -panic_print: -============ - -Bitmask for printing system info when panic happens. User can chose -combination of the following bits: - -===== ======================================== -bit 0 print all tasks info -bit 1 print system memory info -bit 2 print timer info -bit 3 print locks info if CONFIG_LOCKDEP is on -bit 4 print ftrace buffer -===== ======================================== - -So for example to print tasks and memory info on panic, user can:: - - echo 3 > /proc/sys/kernel/panic_print - - -panic_on_rcu_stall: -=================== - -When set to 1, calls panic() after RCU stall detection messages. This -is useful to define the root cause of RCU stalls using a vmcore. - -0: do not panic() when RCU stall takes place, default behavior. - -1: panic() after printing RCU stall messages. - - -perf_cpu_time_max_percent: -========================== - -Hints to the kernel how much CPU time it should be allowed to -use to handle perf sampling events. If the perf subsystem -is informed that its samples are exceeding this limit, it -will drop its sampling frequency to attempt to reduce its CPU -usage. - -Some perf sampling happens in NMIs. If these samples -unexpectedly take too long to execute, the NMIs can become -stacked up next to each other so much that nothing else is -allowed to execute. - -0: - disable the mechanism. Do not monitor or correct perf's - sampling rate no matter how CPU time it takes. - -1-100: - attempt to throttle perf's sample rate to this - percentage of CPU. Note: the kernel calculates an - "expected" length of each sample event. 100 here means - 100% of that expected length. Even if this is set to - 100, you may still see sample throttling if this - length is exceeded. Set to 0 if you truly do not care - how much CPU is consumed. - - -perf_event_paranoid: -==================== - -Controls use of the performance events system by unprivileged -users (without CAP_SYS_ADMIN). The default value is 2. - -=== ================================================================== - -1 Allow use of (almost) all events by all users - - Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK - ->=0 Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN - - Disallow raw tracepoint access by users without CAP_SYS_ADMIN - ->=1 Disallow CPU event access by users without CAP_SYS_ADMIN - ->=2 Disallow kernel profiling by users without CAP_SYS_ADMIN -=== ================================================================== - - -perf_event_max_stack: -===================== - -Controls maximum number of stack frames to copy for (attr.sample_type & -PERF_SAMPLE_CALLCHAIN) configured events, for instance, when using -'perf record -g' or 'perf trace --call-graph fp'. - -This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. - -The default value is 127. - - -perf_event_mlock_kb: -==================== - -Control size of per-cpu ring buffer not counted agains mlock limit. - -The default value is 512 + 1 page - - -perf_event_max_contexts_per_stack: -================================== - -Controls maximum number of stack frame context entries for -(attr.sample_type & PERF_SAMPLE_CALLCHAIN) configured events, for -instance, when using 'perf record -g' or 'perf trace --call-graph fp'. - -This can only be done when no events are in use that have callchains -enabled, otherwise writing to this file will return -EBUSY. - -The default value is 8. - - -pid_max: -======== - -PID allocation wrap value. When the kernel's next PID value -reaches this value, it wraps back to a minimum PID value. -PIDs of value pid_max or larger are not allocated. - - -ns_last_pid: -============ - -The last pid allocated in the current (the one task using this sysctl -lives in) pid namespace. When selecting a pid for a next task on fork -kernel tries to allocate a number starting from this one. - - -powersave-nap: (PPC only) -========================= - -If set, Linux-PPC will use the 'nap' mode of powersaving, -otherwise the 'doze' mode will be used. - -============================================================== - -printk: -======= - -The four values in printk denote: console_loglevel, -default_message_loglevel, minimum_console_loglevel and -default_console_loglevel respectively. - -These values influence printk() behavior when printing or -logging error messages. See 'man 2 syslog' for more info on -the different loglevels. - -- console_loglevel: - messages with a higher priority than - this will be printed to the console -- default_message_loglevel: - messages without an explicit priority - will be printed with this priority -- minimum_console_loglevel: - minimum (highest) value to which - console_loglevel can be set -- default_console_loglevel: - default value for console_loglevel - - -printk_delay: -============= - -Delay each printk message in printk_delay milliseconds - -Value from 0 - 10000 is allowed. - - -printk_ratelimit: -================= - -Some warning messages are rate limited. printk_ratelimit specifies -the minimum length of time between these messages (in jiffies), by -default we allow one every 5 seconds. - -A value of 0 will disable rate limiting. - - -printk_ratelimit_burst: -======================= - -While long term we enforce one message per printk_ratelimit -seconds, we do allow a burst of messages to pass through. -printk_ratelimit_burst specifies the number of messages we can -send before ratelimiting kicks in. - - -printk_devkmsg: -=============== - -Control the logging to /dev/kmsg from userspace: - -ratelimit: - default, ratelimited - -on: unlimited logging to /dev/kmsg from userspace - -off: logging to /dev/kmsg disabled - -The kernel command line parameter printk.devkmsg= overrides this and is -a one-time setting until next reboot: once set, it cannot be changed by -this sysctl interface anymore. - - -randomize_va_space: -=================== - -This option can be used to select the type of process address -space randomization that is used in the system, for architectures -that support this feature. - -== =========================================================================== -0 Turn the process address space randomization off. This is the - default for architectures that do not support this feature anyways, - and kernels that are booted with the "norandmaps" parameter. - -1 Make the addresses of mmap base, stack and VDSO page randomized. - This, among other things, implies that shared libraries will be - loaded to random addresses. Also for PIE-linked binaries, the - location of code start is randomized. This is the default if the - CONFIG_COMPAT_BRK option is enabled. - -2 Additionally enable heap randomization. This is the default if - CONFIG_COMPAT_BRK is disabled. - - There are a few legacy applications out there (such as some ancient - versions of libc.so.5 from 1996) that assume that brk area starts - just after the end of the code+bss. These applications break when - start of the brk area is randomized. There are however no known - non-legacy applications that would be broken this way, so for most - systems it is safe to choose full randomization. - - Systems with ancient and/or broken binaries should be configured - with CONFIG_COMPAT_BRK enabled, which excludes the heap from process - address space randomization. -== =========================================================================== - - -reboot-cmd: (Sparc only) -======================== - -??? This seems to be a way to give an argument to the Sparc -ROM/Flash boot loader. Maybe to tell it what to do after -rebooting. ??? - - -rtsig-max & rtsig-nr: -===================== - -The file rtsig-max can be used to tune the maximum number -of POSIX realtime (queued) signals that can be outstanding -in the system. - -rtsig-nr shows the number of RT signals currently queued. - - -sched_energy_aware: -=================== - -Enables/disables Energy Aware Scheduling (EAS). EAS starts -automatically on platforms where it can run (that is, -platforms with asymmetric CPU topologies and having an Energy -Model available). If your platform happens to meet the -requirements for EAS but you do not want to use it, change -this value to 0. - - -sched_schedstats: -================= - -Enables/disables scheduler statistics. Enabling this feature -incurs a small amount of overhead in the scheduler but is -useful for debugging and performance tuning. - - -sg-big-buff: -============ - -This file shows the size of the generic SCSI (sg) buffer. -You can't tune it just yet, but you could change it on -compile time by editing include/scsi/sg.h and changing -the value of SG_BIG_BUFF. - -There shouldn't be any reason to change this value. If -you can come up with one, you probably know what you -are doing anyway :) - - -shmall: -======= - -This parameter sets the total amount of shared memory pages that -can be used system wide. Hence, SHMALL should always be at least -ceil(shmmax/PAGE_SIZE). - -If you are not sure what the default PAGE_SIZE is on your Linux -system, you can run the following command: - - # getconf PAGE_SIZE - - -shmmax: -======= - -This value can be used to query and set the run time limit -on the maximum shared memory segment size that can be created. -Shared memory segments up to 1Gb are now supported in the -kernel. This value defaults to SHMMAX. - - -shm_rmid_forced: -================ - -Linux lets you set resource limits, including how much memory one -process can consume, via setrlimit(2). Unfortunately, shared memory -segments are allowed to exist without association with any process, and -thus might not be counted against any resource limits. If enabled, -shared memory segments are automatically destroyed when their attach -count becomes zero after a detach or a process termination. It will -also destroy segments that were created, but never attached to, on exit -from the process. The only use left for IPC_RMID is to immediately -destroy an unattached segment. Of course, this breaks the way things are -defined, so some applications might stop working. Note that this -feature will do you no good unless you also configure your resource -limits (in particular, RLIMIT_AS and RLIMIT_NPROC). Most systems don't -need this. - -Note that if you change this from 0 to 1, already created segments -without users and with a dead originative process will be destroyed. - - -sysctl_writes_strict: -===================== - -Control how file position affects the behavior of updating sysctl values -via the /proc/sys interface: - - == ====================================================================== - -1 Legacy per-write sysctl value handling, with no printk warnings. - Each write syscall must fully contain the sysctl value to be - written, and multiple writes on the same sysctl file descriptor - will rewrite the sysctl value, regardless of file position. - 0 Same behavior as above, but warn about processes that perform writes - to a sysctl file descriptor when the file position is not 0. - 1 (default) Respect file position when writing sysctl strings. Multiple - writes will append to the sysctl value buffer. Anything past the max - length of the sysctl value buffer will be ignored. Writes to numeric - sysctl entries must always be at file position 0 and the value must - be fully contained in the buffer sent in the write syscall. - == ====================================================================== - - -softlockup_all_cpu_backtrace: -============================= - -This value controls the soft lockup detector thread's behavior -when a soft lockup condition is detected as to whether or not -to gather further debug information. If enabled, each cpu will -be issued an NMI and instructed to capture stack trace. - -This feature is only applicable for architectures which support -NMI. - -0: do nothing. This is the default behavior. - -1: on detection capture more debug information. - - -soft_watchdog: -============== - -This parameter can be used to control the soft lockup detector. - - 0 - disable the soft lockup detector - - 1 - enable the soft lockup detector - -The soft lockup detector monitors CPUs for threads that are hogging the CPUs -without rescheduling voluntarily, and thus prevent the 'watchdog/N' threads -from running. The mechanism depends on the CPUs ability to respond to timer -interrupts which are needed for the 'watchdog/N' threads to be woken up by -the watchdog timer function, otherwise the NMI watchdog - if enabled - can -detect a hard lockup condition. - - -stack_erasing: -============== - -This parameter can be used to control kernel stack erasing at the end -of syscalls for kernels built with CONFIG_GCC_PLUGIN_STACKLEAK. - -That erasing reduces the information which kernel stack leak bugs -can reveal and blocks some uninitialized stack variable attacks. -The tradeoff is the performance impact: on a single CPU system kernel -compilation sees a 1% slowdown, other systems and workloads may vary. - - 0: kernel stack erasing is disabled, STACKLEAK_METRICS are not updated. - - 1: kernel stack erasing is enabled (default), it is performed before - returning to the userspace at the end of syscalls. - - -tainted -======= - -Non-zero if the kernel has been tainted. Numeric values, which can be -ORed together. The letters are seen in "Tainted" line of Oops reports. - -====== ===== ============================================================== - 1 `(P)` proprietary module was loaded - 2 `(F)` module was force loaded - 4 `(S)` SMP kernel oops on an officially SMP incapable processor - 8 `(R)` module was force unloaded - 16 `(M)` processor reported a Machine Check Exception (MCE) - 32 `(B)` bad page referenced or some unexpected page flags - 64 `(U)` taint requested by userspace application - 128 `(D)` kernel died recently, i.e. there was an OOPS or BUG - 256 `(A)` an ACPI table was overridden by user - 512 `(W)` kernel issued warning - 1024 `(C)` staging driver was loaded - 2048 `(I)` workaround for bug in platform firmware applied - 4096 `(O)` externally-built ("out-of-tree") module was loaded - 8192 `(E)` unsigned module was loaded - 16384 `(L)` soft lockup occurred - 32768 `(K)` kernel has been live patched - 65536 `(X)` Auxiliary taint, defined and used by for distros -131072 `(T)` The kernel was built with the struct randomization plugin -====== ===== ============================================================== - -See Documentation/admin-guide/tainted-kernels.rst for more information. - - -threads-max: -============ - -This value controls the maximum number of threads that can be created -using fork(). - -During initialization the kernel sets this value such that even if the -maximum number of threads is created, the thread structures occupy only -a part (1/8th) of the available RAM pages. - -The minimum value that can be written to threads-max is 20. - -The maximum value that can be written to threads-max is given by the -constant FUTEX_TID_MASK (0x3fffffff). - -If a value outside of this range is written to threads-max an error -EINVAL occurs. - -The value written is checked against the available RAM pages. If the -thread structures would occupy too much (more than 1/8th) of the -available RAM pages threads-max is reduced accordingly. - - -unknown_nmi_panic: -================== - -The value in this file affects behavior of handling NMI. When the -value is non-zero, unknown NMI is trapped and then panic occurs. At -that time, kernel debugging information is displayed on console. - -NMI switch that most IA32 servers have fires unknown NMI up, for -example. If a system hangs up, try pressing the NMI switch. - - -watchdog: -========= - -This parameter can be used to disable or enable the soft lockup detector -_and_ the NMI watchdog (i.e. the hard lockup detector) at the same time. - - 0 - disable both lockup detectors - - 1 - enable both lockup detectors - -The soft lockup detector and the NMI watchdog can also be disabled or -enabled individually, using the soft_watchdog and nmi_watchdog parameters. -If the watchdog parameter is read, for example by executing:: - - cat /proc/sys/kernel/watchdog - -the output of this command (0 or 1) shows the logical OR of soft_watchdog -and nmi_watchdog. - - -watchdog_cpumask: -================= - -This value can be used to control on which cpus the watchdog may run. -The default cpumask is all possible cores, but if NO_HZ_FULL is -enabled in the kernel config, and cores are specified with the -nohz_full= boot argument, those cores are excluded by default. -Offline cores can be included in this mask, and if the core is later -brought online, the watchdog will be started based on the mask value. - -Typically this value would only be touched in the nohz_full case -to re-enable cores that by default were not running the watchdog, -if a kernel lockup was suspected on those cores. - -The argument value is the standard cpulist format for cpumasks, -so for example to enable the watchdog on cores 0, 2, 3, and 4 you -might say:: - - echo 0,2-4 > /proc/sys/kernel/watchdog_cpumask - - -watchdog_thresh: -================ - -This value can be used to control the frequency of hrtimer and NMI -events and the soft and hard lockup thresholds. The default threshold -is 10 seconds. - -The softlockup threshold is (2 * watchdog_thresh). Setting this -tunable to zero will disable lockup detection altogether. diff --git a/Documentation/sysctl/net.rst b/Documentation/sysctl/net.rst deleted file mode 100644 index a7d44e71019d..000000000000 --- a/Documentation/sysctl/net.rst +++ /dev/null @@ -1,461 +0,0 @@ -================================ -Documentation for /proc/sys/net/ -================================ - -Copyright - -Copyright (c) 1999 - - - Terrehon Bowden - - Bodo Bauer - -Copyright (c) 2000 - - - Jorge Nerin - -Copyright (c) 2009 - - - Shen Feng - -For general info and legal blurb, please look in index.rst. - ------------------------------------------------------------------------------- - -This file contains the documentation for the sysctl files in -/proc/sys/net - -The interface to the networking parts of the kernel is located in -/proc/sys/net. The following table shows all possible subdirectories. You may -see only some of them, depending on your kernel's configuration. - - -Table : Subdirectories in /proc/sys/net - - ========= =================== = ========== ================== - Directory Content Directory Content - ========= =================== = ========== ================== - core General parameter appletalk Appletalk protocol - unix Unix domain sockets netrom NET/ROM - 802 E802 protocol ax25 AX25 - ethernet Ethernet protocol rose X.25 PLP layer - ipv4 IP version 4 x25 X.25 protocol - ipx IPX token-ring IBM token ring - bridge Bridging decnet DEC net - ipv6 IP version 6 tipc TIPC - ========= =================== = ========== ================== - -1. /proc/sys/net/core - Network core options -============================================ - -bpf_jit_enable --------------- - -This enables the BPF Just in Time (JIT) compiler. BPF is a flexible -and efficient infrastructure allowing to execute bytecode at various -hook points. It is used in a number of Linux kernel subsystems such -as networking (e.g. XDP, tc), tracing (e.g. kprobes, uprobes, tracepoints) -and security (e.g. seccomp). LLVM has a BPF back end that can compile -restricted C into a sequence of BPF instructions. After program load -through bpf(2) and passing a verifier in the kernel, a JIT will then -translate these BPF proglets into native CPU instructions. There are -two flavors of JITs, the newer eBPF JIT currently supported on: - - - x86_64 - - x86_32 - - arm64 - - arm32 - - ppc64 - - sparc64 - - mips64 - - s390x - - riscv - -And the older cBPF JIT supported on the following archs: - - - mips - - ppc - - sparc - -eBPF JITs are a superset of cBPF JITs, meaning the kernel will -migrate cBPF instructions into eBPF instructions and then JIT -compile them transparently. Older cBPF JITs can only translate -tcpdump filters, seccomp rules, etc, but not mentioned eBPF -programs loaded through bpf(2). - -Values: - - - 0 - disable the JIT (default value) - - 1 - enable the JIT - - 2 - enable the JIT and ask the compiler to emit traces on kernel log. - -bpf_jit_harden --------------- - -This enables hardening for the BPF JIT compiler. Supported are eBPF -JIT backends. Enabling hardening trades off performance, but can -mitigate JIT spraying. - -Values: - - - 0 - disable JIT hardening (default value) - - 1 - enable JIT hardening for unprivileged users only - - 2 - enable JIT hardening for all users - -bpf_jit_kallsyms ----------------- - -When BPF JIT compiler is enabled, then compiled images are unknown -addresses to the kernel, meaning they neither show up in traces nor -in /proc/kallsyms. This enables export of these addresses, which can -be used for debugging/tracing. If bpf_jit_harden is enabled, this -feature is disabled. - -Values : - - - 0 - disable JIT kallsyms export (default value) - - 1 - enable JIT kallsyms export for privileged users only - -bpf_jit_limit -------------- - -This enforces a global limit for memory allocations to the BPF JIT -compiler in order to reject unprivileged JIT requests once it has -been surpassed. bpf_jit_limit contains the value of the global limit -in bytes. - -dev_weight ----------- - -The maximum number of packets that kernel can handle on a NAPI interrupt, -it's a Per-CPU variable. For drivers that support LRO or GRO_HW, a hardware -aggregated packet is counted as one packet in this context. - -Default: 64 - -dev_weight_rx_bias ------------------- - -RPS (e.g. RFS, aRFS) processing is competing with the registered NAPI poll function -of the driver for the per softirq cycle netdev_budget. This parameter influences -the proportion of the configured netdev_budget that is spent on RPS based packet -processing during RX softirq cycles. It is further meant for making current -dev_weight adaptable for asymmetric CPU needs on RX/TX side of the network stack. -(see dev_weight_tx_bias) It is effective on a per CPU basis. Determination is based -on dev_weight and is calculated multiplicative (dev_weight * dev_weight_rx_bias). - -Default: 1 - -dev_weight_tx_bias ------------------- - -Scales the maximum number of packets that can be processed during a TX softirq cycle. -Effective on a per CPU basis. Allows scaling of current dev_weight for asymmetric -net stack processing needs. Be careful to avoid making TX softirq processing a CPU hog. - -Calculation is based on dev_weight (dev_weight * dev_weight_tx_bias). - -Default: 1 - -default_qdisc -------------- - -The default queuing discipline to use for network devices. This allows -overriding the default of pfifo_fast with an alternative. Since the default -queuing discipline is created without additional parameters so is best suited -to queuing disciplines that work well without configuration like stochastic -fair queue (sfq), CoDel (codel) or fair queue CoDel (fq_codel). Don't use -queuing disciplines like Hierarchical Token Bucket or Deficit Round Robin -which require setting up classes and bandwidths. Note that physical multiqueue -interfaces still use mq as root qdisc, which in turn uses this default for its -leaves. Virtual devices (like e.g. lo or veth) ignore this setting and instead -default to noqueue. - -Default: pfifo_fast - -busy_read ---------- - -Low latency busy poll timeout for socket reads. (needs CONFIG_NET_RX_BUSY_POLL) -Approximate time in us to busy loop waiting for packets on the device queue. -This sets the default value of the SO_BUSY_POLL socket option. -Can be set or overridden per socket by setting socket option SO_BUSY_POLL, -which is the preferred method of enabling. If you need to enable the feature -globally via sysctl, a value of 50 is recommended. - -Will increase power usage. - -Default: 0 (off) - -busy_poll ----------------- -Low latency busy poll timeout for poll and select. (needs CONFIG_NET_RX_BUSY_POLL) -Approximate time in us to busy loop waiting for events. -Recommended value depends on the number of sockets you poll on. -For several sockets 50, for several hundreds 100. -For more than that you probably want to use epoll. -Note that only sockets with SO_BUSY_POLL set will be busy polled, -so you want to either selectively set SO_BUSY_POLL on those sockets or set -sysctl.net.busy_read globally. - -Will increase power usage. - -Default: 0 (off) - -rmem_default ------------- - -The default setting of the socket receive buffer in bytes. - -rmem_max --------- - -The maximum receive socket buffer size in bytes. - -tstamp_allow_data ------------------ -Allow processes to receive tx timestamps looped together with the original -packet contents. If disabled, transmit timestamp requests from unprivileged -processes are dropped unless socket option SOF_TIMESTAMPING_OPT_TSONLY is set. - -Default: 1 (on) - - -wmem_default ------------- - -The default setting (in bytes) of the socket send buffer. - -wmem_max --------- - -The maximum send socket buffer size in bytes. - -message_burst and message_cost ------------------------------- - -These parameters are used to limit the warning messages written to the kernel -log from the networking code. They enforce a rate limit to make a -denial-of-service attack impossible. A higher message_cost factor, results in -fewer messages that will be written. Message_burst controls when messages will -be dropped. The default settings limit warning messages to one every five -seconds. - -warnings --------- - -This sysctl is now unused. - -This was used to control console messages from the networking stack that -occur because of problems on the network like duplicate address or bad -checksums. - -These messages are now emitted at KERN_DEBUG and can generally be enabled -and controlled by the dynamic_debug facility. - -netdev_budget -------------- - -Maximum number of packets taken from all interfaces in one polling cycle (NAPI -poll). In one polling cycle interfaces which are registered to polling are -probed in a round-robin manner. Also, a polling cycle may not exceed -netdev_budget_usecs microseconds, even if netdev_budget has not been -exhausted. - -netdev_budget_usecs ---------------------- - -Maximum number of microseconds in one NAPI polling cycle. Polling -will exit when either netdev_budget_usecs have elapsed during the -poll cycle or the number of packets processed reaches netdev_budget. - -netdev_max_backlog ------------------- - -Maximum number of packets, queued on the INPUT side, when the interface -receives packets faster than kernel can process them. - -netdev_rss_key --------------- - -RSS (Receive Side Scaling) enabled drivers use a 40 bytes host key that is -randomly generated. -Some user space might need to gather its content even if drivers do not -provide ethtool -x support yet. - -:: - - myhost:~# cat /proc/sys/net/core/netdev_rss_key - 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8: ... (52 bytes total) - -File contains nul bytes if no driver ever called netdev_rss_key_fill() function. - -Note: - /proc/sys/net/core/netdev_rss_key contains 52 bytes of key, - but most drivers only use 40 bytes of it. - -:: - - myhost:~# ethtool -x eth0 - RX flow hash indirection table for eth0 with 8 RX ring(s): - 0: 0 1 2 3 4 5 6 7 - RSS hash key: - 84:50:f4:00:a8:15:d1:a7:e9:7f:1d:60:35:c7:47:25:42:97:74:ca:56:bb:b6:a1:d8:43:e3:c9:0c:fd:17:55:c2:3a:4d:69:ed:f1:42:89 - -netdev_tstamp_prequeue ----------------------- - -If set to 0, RX packet timestamps can be sampled after RPS processing, when -the target CPU processes packets. It might give some delay on timestamps, but -permit to distribute the load on several cpus. - -If set to 1 (default), timestamps are sampled as soon as possible, before -queueing. - -optmem_max ----------- - -Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence -of struct cmsghdr structures with appended data. - -fb_tunnels_only_for_init_net ----------------------------- - -Controls if fallback tunnels (like tunl0, gre0, gretap0, erspan0, -sit0, ip6tnl0, ip6gre0) are automatically created when a new -network namespace is created, if corresponding tunnel is present -in initial network namespace. -If set to 1, these devices are not automatically created, and -user space is responsible for creating them if needed. - -Default : 0 (for compatibility reasons) - -devconf_inherit_init_net ------------------------- - -Controls if a new network namespace should inherit all current -settings under /proc/sys/net/{ipv4,ipv6}/conf/{all,default}/. By -default, we keep the current behavior: for IPv4 we inherit all current -settings from init_net and for IPv6 we reset all settings to default. - -If set to 1, both IPv4 and IPv6 settings are forced to inherit from -current ones in init_net. If set to 2, both IPv4 and IPv6 settings are -forced to reset to their default values. - -Default : 0 (for compatibility reasons) - -2. /proc/sys/net/unix - Parameters for Unix domain sockets ----------------------------------------------------------- - -There is only one file in this directory. -unix_dgram_qlen limits the max number of datagrams queued in Unix domain -socket's buffer. It will not take effect unless PF_UNIX flag is specified. - - -3. /proc/sys/net/ipv4 - IPV4 settings -------------------------------------- -Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for -descriptions of these entries. - - -4. Appletalk ------------- - -The /proc/sys/net/appletalk directory holds the Appletalk configuration data -when Appletalk is loaded. The configurable parameters are: - -aarp-expiry-time ----------------- - -The amount of time we keep an ARP entry before expiring it. Used to age out -old hosts. - -aarp-resolve-time ------------------ - -The amount of time we will spend trying to resolve an Appletalk address. - -aarp-retransmit-limit ---------------------- - -The number of times we will retransmit a query before giving up. - -aarp-tick-time --------------- - -Controls the rate at which expires are checked. - -The directory /proc/net/appletalk holds the list of active Appletalk sockets -on a machine. - -The fields indicate the DDP type, the local address (in network:node format) -the remote address, the size of the transmit pending queue, the size of the -received queue (bytes waiting for applications to read) the state and the uid -owning the socket. - -/proc/net/atalk_iface lists all the interfaces configured for appletalk.It -shows the name of the interface, its Appletalk address, the network range on -that address (or network number for phase 1 networks), and the status of the -interface. - -/proc/net/atalk_route lists each known network route. It lists the target -(network) that the route leads to, the router (may be directly connected), the -route flags, and the device the route is using. - - -5. IPX ------- - -The IPX protocol has no tunable values in proc/sys/net. - -The IPX protocol does, however, provide proc/net/ipx. This lists each IPX -socket giving the local and remote addresses in Novell format (that is -network:node:port). In accordance with the strange Novell tradition, -everything but the port is in hex. Not_Connected is displayed for sockets that -are not tied to a specific remote address. The Tx and Rx queue sizes indicate -the number of bytes pending for transmission and reception. The state -indicates the state the socket is in and the uid is the owning uid of the -socket. - -The /proc/net/ipx_interface file lists all IPX interfaces. For each interface -it gives the network number, the node number, and indicates if the network is -the primary network. It also indicates which device it is bound to (or -Internal for internal networks) and the Frame Type if appropriate. Linux -supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for -IPX. - -The /proc/net/ipx_route table holds a list of IPX routes. For each route it -gives the destination network, the router node (or Directly) and the network -address of the router (or Connected) for internal networks. - -6. TIPC -------- - -tipc_rmem ---------- - -The TIPC protocol now has a tunable for the receive memory, similar to the -tcp_rmem - i.e. a vector of 3 INTEGERs: (min, default, max) - -:: - - # cat /proc/sys/net/tipc/tipc_rmem - 4252725 34021800 68043600 - # - -The max value is set to CONN_OVERLOAD_LIMIT, and the default and min values -are scaled (shifted) versions of that same value. Note that the min value -is not at this point in time used in any meaningful way, but the triplet is -preserved in order to be consistent with things like tcp_rmem. - -named_timeout -------------- - -TIPC name table updates are distributed asynchronously in a cluster, without -any form of transaction handling. This means that different race scenarios are -possible. One such is that a name withdrawal sent out by one node and received -by another node may arrive after a second, overlapping name publication already -has been accepted from a third node, although the conflicting updates -originally may have been issued in the correct sequential order. -If named_timeout is nonzero, failed topology updates will be placed on a defer -queue until another event arrives that clears the error, or until the timeout -expires. Value is in milliseconds. diff --git a/Documentation/sysctl/sunrpc.rst b/Documentation/sysctl/sunrpc.rst deleted file mode 100644 index 09780a682afd..000000000000 --- a/Documentation/sysctl/sunrpc.rst +++ /dev/null @@ -1,25 +0,0 @@ -=================================== -Documentation for /proc/sys/sunrpc/ -=================================== - -kernel version 2.2.10 - -Copyright (c) 1998, 1999, Rik van Riel - -For general info and legal blurb, please look in index.rst. - ------------------------------------------------------------------------------- - -This file contains the documentation for the sysctl files in -/proc/sys/sunrpc and is valid for Linux kernel version 2.2. - -The files in this directory can be used to (re)set the debug -flags of the SUN Remote Procedure Call (RPC) subsystem in -the Linux kernel. This stuff is used for NFS, KNFSD and -maybe a few other things as well. - -The files in there are used to control the debugging flags: -rpc_debug, nfs_debug, nfsd_debug and nlm_debug. - -These flags are for kernel hackers only. You should read the -source code in net/sunrpc/ for more information. diff --git a/Documentation/sysctl/user.rst b/Documentation/sysctl/user.rst deleted file mode 100644 index 650eaa03f15e..000000000000 --- a/Documentation/sysctl/user.rst +++ /dev/null @@ -1,78 +0,0 @@ -================================= -Documentation for /proc/sys/user/ -================================= - -kernel version 4.9.0 - -Copyright (c) 2016 Eric Biederman - ------------------------------------------------------------------------------- - -This file contains the documentation for the sysctl files in -/proc/sys/user. - -The files in this directory can be used to override the default -limits on the number of namespaces and other objects that have -per user per user namespace limits. - -The primary purpose of these limits is to stop programs that -malfunction and attempt to create a ridiculous number of objects, -before the malfunction becomes a system wide problem. It is the -intention that the defaults of these limits are set high enough that -no program in normal operation should run into these limits. - -The creation of per user per user namespace objects are charged to -the user in the user namespace who created the object and -verified to be below the per user limit in that user namespace. - -The creation of objects is also charged to all of the users -who created user namespaces the creation of the object happens -in (user namespaces can be nested) and verified to be below the per user -limits in the user namespaces of those users. - -This recursive counting of created objects ensures that creating a -user namespace does not allow a user to escape their current limits. - -Currently, these files are in /proc/sys/user: - -max_cgroup_namespaces -===================== - - The maximum number of cgroup namespaces that any user in the current - user namespace may create. - -max_ipc_namespaces -================== - - The maximum number of ipc namespaces that any user in the current - user namespace may create. - -max_mnt_namespaces -================== - - The maximum number of mount namespaces that any user in the current - user namespace may create. - -max_net_namespaces -================== - - The maximum number of network namespaces that any user in the - current user namespace may create. - -max_pid_namespaces -================== - - The maximum number of pid namespaces that any user in the current - user namespace may create. - -max_user_namespaces -=================== - - The maximum number of user namespaces that any user in the current - user namespace may create. - -max_uts_namespaces -================== - - The maximum number of user namespaces that any user in the current - user namespace may create. diff --git a/Documentation/sysctl/vm.rst b/Documentation/sysctl/vm.rst deleted file mode 100644 index 5aceb5cd5ce7..000000000000 --- a/Documentation/sysctl/vm.rst +++ /dev/null @@ -1,964 +0,0 @@ -=============================== -Documentation for /proc/sys/vm/ -=============================== - -kernel version 2.6.29 - -Copyright (c) 1998, 1999, Rik van Riel - -Copyright (c) 2008 Peter W. Morreale - -For general info and legal blurb, please look in index.rst. - ------------------------------------------------------------------------------- - -This file contains the documentation for the sysctl files in -/proc/sys/vm and is valid for Linux kernel version 2.6.29. - -The files in this directory can be used to tune the operation -of the virtual memory (VM) subsystem of the Linux kernel and -the writeout of dirty data to disk. - -Default values and initialization routines for most of these -files can be found in mm/swap.c. - -Currently, these files are in /proc/sys/vm: - -- admin_reserve_kbytes -- block_dump -- compact_memory -- compact_unevictable_allowed -- dirty_background_bytes -- dirty_background_ratio -- dirty_bytes -- dirty_expire_centisecs -- dirty_ratio -- dirtytime_expire_seconds -- dirty_writeback_centisecs -- drop_caches -- extfrag_threshold -- hugetlb_shm_group -- laptop_mode -- legacy_va_layout -- lowmem_reserve_ratio -- max_map_count -- memory_failure_early_kill -- memory_failure_recovery -- min_free_kbytes -- min_slab_ratio -- min_unmapped_ratio -- mmap_min_addr -- mmap_rnd_bits -- mmap_rnd_compat_bits -- nr_hugepages -- nr_hugepages_mempolicy -- nr_overcommit_hugepages -- nr_trim_pages (only if CONFIG_MMU=n) -- numa_zonelist_order -- oom_dump_tasks -- oom_kill_allocating_task -- overcommit_kbytes -- overcommit_memory -- overcommit_ratio -- page-cluster -- panic_on_oom -- percpu_pagelist_fraction -- stat_interval -- stat_refresh -- numa_stat -- swappiness -- unprivileged_userfaultfd -- user_reserve_kbytes -- vfs_cache_pressure -- watermark_boost_factor -- watermark_scale_factor -- zone_reclaim_mode - - -admin_reserve_kbytes -==================== - -The amount of free memory in the system that should be reserved for users -with the capability cap_sys_admin. - -admin_reserve_kbytes defaults to min(3% of free pages, 8MB) - -That should provide enough for the admin to log in and kill a process, -if necessary, under the default overcommit 'guess' mode. - -Systems running under overcommit 'never' should increase this to account -for the full Virtual Memory Size of programs used to recover. Otherwise, -root may not be able to log in to recover the system. - -How do you calculate a minimum useful reserve? - -sshd or login + bash (or some other shell) + top (or ps, kill, etc.) - -For overcommit 'guess', we can sum resident set sizes (RSS). -On x86_64 this is about 8MB. - -For overcommit 'never', we can take the max of their virtual sizes (VSZ) -and add the sum of their RSS. -On x86_64 this is about 128MB. - -Changing this takes effect whenever an application requests memory. - - -block_dump -========== - -block_dump enables block I/O debugging when set to a nonzero value. More -information on block I/O debugging is in Documentation/laptops/laptop-mode.rst. - - -compact_memory -============== - -Available only when CONFIG_COMPACTION is set. When 1 is written to the file, -all zones are compacted such that free memory is available in contiguous -blocks where possible. This can be important for example in the allocation of -huge pages although processes will also directly compact memory as required. - - -compact_unevictable_allowed -=========================== - -Available only when CONFIG_COMPACTION is set. When set to 1, compaction is -allowed to examine the unevictable lru (mlocked pages) for pages to compact. -This should be used on systems where stalls for minor page faults are an -acceptable trade for large contiguous free memory. Set to 0 to prevent -compaction from moving pages that are unevictable. Default value is 1. - - -dirty_background_bytes -====================== - -Contains the amount of dirty memory at which the background kernel -flusher threads will start writeback. - -Note: - dirty_background_bytes is the counterpart of dirty_background_ratio. Only - one of them may be specified at a time. When one sysctl is written it is - immediately taken into account to evaluate the dirty memory limits and the - other appears as 0 when read. - - -dirty_background_ratio -====================== - -Contains, as a percentage of total available memory that contains free pages -and reclaimable pages, the number of pages at which the background kernel -flusher threads will start writing out dirty data. - -The total available memory is not equal to total system memory. - - -dirty_bytes -=========== - -Contains the amount of dirty memory at which a process generating disk writes -will itself start writeback. - -Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be -specified at a time. When one sysctl is written it is immediately taken into -account to evaluate the dirty memory limits and the other appears as 0 when -read. - -Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any -value lower than this limit will be ignored and the old configuration will be -retained. - - -dirty_expire_centisecs -====================== - -This tunable is used to define when dirty data is old enough to be eligible -for writeout by the kernel flusher threads. It is expressed in 100'ths -of a second. Data which has been dirty in-memory for longer than this -interval will be written out next time a flusher thread wakes up. - - -dirty_ratio -=========== - -Contains, as a percentage of total available memory that contains free pages -and reclaimable pages, the number of pages at which a process which is -generating disk writes will itself start writing out dirty data. - -The total available memory is not equal to total system memory. - - -dirtytime_expire_seconds -======================== - -When a lazytime inode is constantly having its pages dirtied, the inode with -an updated timestamp will never get chance to be written out. And, if the -only thing that has happened on the file system is a dirtytime inode caused -by an atime update, a worker will be scheduled to make sure that inode -eventually gets pushed out to disk. This tunable is used to define when dirty -inode is old enough to be eligible for writeback by the kernel flusher threads. -And, it is also used as the interval to wakeup dirtytime_writeback thread. - - -dirty_writeback_centisecs -========================= - -The kernel flusher threads will periodically wake up and write `old` data -out to disk. This tunable expresses the interval between those wakeups, in -100'ths of a second. - -Setting this to zero disables periodic writeback altogether. - - -drop_caches -=========== - -Writing to this will cause the kernel to drop clean caches, as well as -reclaimable slab objects like dentries and inodes. Once dropped, their -memory becomes free. - -To free pagecache:: - - echo 1 > /proc/sys/vm/drop_caches - -To free reclaimable slab objects (includes dentries and inodes):: - - echo 2 > /proc/sys/vm/drop_caches - -To free slab objects and pagecache:: - - echo 3 > /proc/sys/vm/drop_caches - -This is a non-destructive operation and will not free any dirty objects. -To increase the number of objects freed by this operation, the user may run -`sync` prior to writing to /proc/sys/vm/drop_caches. This will minimize the -number of dirty objects on the system and create more candidates to be -dropped. - -This file is not a means to control the growth of the various kernel caches -(inodes, dentries, pagecache, etc...) These objects are automatically -reclaimed by the kernel when memory is needed elsewhere on the system. - -Use of this file can cause performance problems. Since it discards cached -objects, it may cost a significant amount of I/O and CPU to recreate the -dropped objects, especially if they were under heavy use. Because of this, -use outside of a testing or debugging environment is not recommended. - -You may see informational messages in your kernel log when this file is -used:: - - cat (1234): drop_caches: 3 - -These are informational only. They do not mean that anything is wrong -with your system. To disable them, echo 4 (bit 2) into drop_caches. - - -extfrag_threshold -================= - -This parameter affects whether the kernel will compact memory or direct -reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in -debugfs shows what the fragmentation index for each order is in each zone in -the system. Values tending towards 0 imply allocations would fail due to lack -of memory, values towards 1000 imply failures are due to fragmentation and -1 -implies that the allocation will succeed as long as watermarks are met. - -The kernel will not compact memory in a zone if the -fragmentation index is <= extfrag_threshold. The default value is 500. - - -highmem_is_dirtyable -==================== - -Available only for systems with CONFIG_HIGHMEM enabled (32b systems). - -This parameter controls whether the high memory is considered for dirty -writers throttling. This is not the case by default which means that -only the amount of memory directly visible/usable by the kernel can -be dirtied. As a result, on systems with a large amount of memory and -lowmem basically depleted writers might be throttled too early and -streaming writes can get very slow. - -Changing the value to non zero would allow more memory to be dirtied -and thus allow writers to write more data which can be flushed to the -storage more effectively. Note this also comes with a risk of pre-mature -OOM killer because some writers (e.g. direct block device writes) can -only use the low memory and they can fill it up with dirty data without -any throttling. - - -hugetlb_shm_group -================= - -hugetlb_shm_group contains group id that is allowed to create SysV -shared memory segment using hugetlb page. - - -laptop_mode -=========== - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/laptops/laptop-mode.rst. - - -legacy_va_layout -================ - -If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel -will use the legacy (2.4) layout for all processes. - - -lowmem_reserve_ratio -==================== - -For some specialised workloads on highmem machines it is dangerous for -the kernel to allow process memory to be allocated from the "lowmem" -zone. This is because that memory could then be pinned via the mlock() -system call, or by unavailability of swapspace. - -And on large highmem machines this lack of reclaimable lowmem memory -can be fatal. - -So the Linux page allocator has a mechanism which prevents allocations -which *could* use highmem from using too much lowmem. This means that -a certain amount of lowmem is defended from the possibility of being -captured into pinned user memory. - -(The same argument applies to the old 16 megabyte ISA DMA region. This -mechanism will also defend that region from allocations which could use -highmem or lowmem). - -The `lowmem_reserve_ratio` tunable determines how aggressive the kernel is -in defending these lower zones. - -If you have a machine which uses highmem or ISA DMA and your -applications are using mlock(), or if you are running with no swap then -you probably should change the lowmem_reserve_ratio setting. - -The lowmem_reserve_ratio is an array. You can see them by reading this file:: - - % cat /proc/sys/vm/lowmem_reserve_ratio - 256 256 32 - -But, these values are not used directly. The kernel calculates # of protection -pages for each zones from them. These are shown as array of protection pages -in /proc/zoneinfo like followings. (This is an example of x86-64 box). -Each zone has an array of protection pages like this:: - - Node 0, zone DMA - pages free 1355 - min 3 - low 3 - high 4 - : - : - numa_other 0 - protection: (0, 2004, 2004, 2004) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - pagesets - cpu: 0 pcp: 0 - : - -These protections are added to score to judge whether this zone should be used -for page allocation or should be reclaimed. - -In this example, if normal pages (index=2) are required to this DMA zone and -watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should -not be used because pages_free(1355) is smaller than watermark + protection[2] -(4 + 2004 = 2008). If this protection value is 0, this zone would be used for -normal page requirement. If requirement is DMA zone(index=0), protection[0] -(=0) is used. - -zone[i]'s protection[j] is calculated by following expression:: - - (i < j): - zone[i]->protection[j] - = (total sums of managed_pages from zone[i+1] to zone[j] on the node) - / lowmem_reserve_ratio[i]; - (i = j): - (should not be protected. = 0; - (i > j): - (not necessary, but looks 0) - -The default values of lowmem_reserve_ratio[i] are - - === ==================================== - 256 (if zone[i] means DMA or DMA32 zone) - 32 (others) - === ==================================== - -As above expression, they are reciprocal number of ratio. -256 means 1/256. # of protection pages becomes about "0.39%" of total managed -pages of higher zones on the node. - -If you would like to protect more pages, smaller values are effective. -The minimum value is 1 (1/1 -> 100%). The value less than 1 completely -disables protection of the pages. - - -max_map_count: -============== - -This file contains the maximum number of memory map areas a process -may have. Memory map areas are used as a side-effect of calling -malloc, directly by mmap, mprotect, and madvise, and also when loading -shared libraries. - -While most applications need less than a thousand maps, certain -programs, particularly malloc debuggers, may consume lots of them, -e.g., up to one or two maps per allocation. - -The default value is 65536. - - -memory_failure_early_kill: -========================== - -Control how to kill processes when uncorrected memory error (typically -a 2bit error in a memory module) is detected in the background by hardware -that cannot be handled by the kernel. In some cases (like the page -still having a valid copy on disk) the kernel will handle the failure -transparently without affecting any applications. But if there is -no other uptodate copy of the data it will kill to prevent any data -corruptions from propagating. - -1: Kill all processes that have the corrupted and not reloadable page mapped -as soon as the corruption is detected. Note this is not supported -for a few types of pages, like kernel internally allocated data or -the swap cache, but works for the majority of user pages. - -0: Only unmap the corrupted page from all processes and only kill a process -who tries to access it. - -The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can -handle this if they want to. - -This is only active on architectures/platforms with advanced machine -check handling and depends on the hardware capabilities. - -Applications can override this setting individually with the PR_MCE_KILL prctl - - -memory_failure_recovery -======================= - -Enable memory failure recovery (when supported by the platform) - -1: Attempt recovery. - -0: Always panic on a memory failure. - - -min_free_kbytes -=============== - -This is used to force the Linux VM to keep a minimum number -of kilobytes free. The VM uses this number to compute a -watermark[WMARK_MIN] value for each lowmem zone in the system. -Each lowmem zone gets a number of reserved free pages based -proportionally on its size. - -Some minimal amount of memory is needed to satisfy PF_MEMALLOC -allocations; if you set this to lower than 1024KB, your system will -become subtly broken, and prone to deadlock under high loads. - -Setting this too high will OOM your machine instantly. - - -min_slab_ratio -============== - -This is available only on NUMA kernels. - -A percentage of the total pages in each zone. On Zone reclaim -(fallback from the local zone occurs) slabs will be reclaimed if more -than this percentage of pages in a zone are reclaimable slab pages. -This insures that the slab growth stays under control even in NUMA -systems that rarely perform global reclaim. - -The default is 5 percent. - -Note that slab reclaim is triggered in a per zone / node fashion. -The process of reclaiming slab memory is currently not node specific -and may not be fast. - - -min_unmapped_ratio -================== - -This is available only on NUMA kernels. - -This is a percentage of the total pages in each zone. Zone reclaim will -only occur if more than this percentage of pages are in a state that -zone_reclaim_mode allows to be reclaimed. - -If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared -against all file-backed unmapped pages including swapcache pages and tmpfs -files. Otherwise, only unmapped pages backed by normal files but not tmpfs -files and similar are considered. - -The default is 1 percent. - - -mmap_min_addr -============= - -This file indicates the amount of address space which a user process will -be restricted from mmapping. Since kernel null dereference bugs could -accidentally operate based on the information in the first couple of pages -of memory userspace processes should not be allowed to write to them. By -default this value is set to 0 and no protections will be enforced by the -security module. Setting this value to something like 64k will allow the -vast majority of applications to work correctly and provide defense in depth -against future potential kernel bugs. - - -mmap_rnd_bits -============= - -This value can be used to select the number of bits to use to -determine the random offset to the base address of vma regions -resulting from mmap allocations on architectures which support -tuning address space randomization. This value will be bounded -by the architecture's minimum and maximum supported values. - -This value can be changed after boot using the -/proc/sys/vm/mmap_rnd_bits tunable - - -mmap_rnd_compat_bits -==================== - -This value can be used to select the number of bits to use to -determine the random offset to the base address of vma regions -resulting from mmap allocations for applications run in -compatibility mode on architectures which support tuning address -space randomization. This value will be bounded by the -architecture's minimum and maximum supported values. - -This value can be changed after boot using the -/proc/sys/vm/mmap_rnd_compat_bits tunable - - -nr_hugepages -============ - -Change the minimum size of the hugepage pool. - -See Documentation/admin-guide/mm/hugetlbpage.rst - - -nr_hugepages_mempolicy -====================== - -Change the size of the hugepage pool at run-time on a specific -set of NUMA nodes. - -See Documentation/admin-guide/mm/hugetlbpage.rst - - -nr_overcommit_hugepages -======================= - -Change the maximum size of the hugepage pool. The maximum is -nr_hugepages + nr_overcommit_hugepages. - -See Documentation/admin-guide/mm/hugetlbpage.rst - - -nr_trim_pages -============= - -This is available only on NOMMU kernels. - -This value adjusts the excess page trimming behaviour of power-of-2 aligned -NOMMU mmap allocations. - -A value of 0 disables trimming of allocations entirely, while a value of 1 -trims excess pages aggressively. Any value >= 1 acts as the watermark where -trimming of allocations is initiated. - -The default value is 1. - -See Documentation/nommu-mmap.txt for more information. - - -numa_zonelist_order -=================== - -This sysctl is only for NUMA and it is deprecated. Anything but -Node order will fail! - -'where the memory is allocated from' is controlled by zonelists. - -(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. -you may be able to read ZONE_DMA as ZONE_DMA32...) - -In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. -ZONE_NORMAL -> ZONE_DMA -This means that a memory allocation request for GFP_KERNEL will -get memory from ZONE_DMA only when ZONE_NORMAL is not available. - -In NUMA case, you can think of following 2 types of order. -Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL:: - - (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL - (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. - -Type(A) offers the best locality for processes on Node(0), but ZONE_DMA -will be used before ZONE_NORMAL exhaustion. This increases possibility of -out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. - -Type(B) cannot offer the best locality but is more robust against OOM of -the DMA zone. - -Type(A) is called as "Node" order. Type (B) is "Zone" order. - -"Node order" orders the zonelists by node, then by zone within each node. -Specify "[Nn]ode" for node order - -"Zone Order" orders the zonelists by zone type, then by node within each -zone. Specify "[Zz]one" for zone order. - -Specify "[Dd]efault" to request automatic configuration. - -On 32-bit, the Normal zone needs to be preserved for allocations accessible -by the kernel, so "zone" order will be selected. - -On 64-bit, devices that require DMA32/DMA are relatively rare, so "node" -order will be selected. - -Default order is recommended unless this is causing problems for your -system/application. - - -oom_dump_tasks -============== - -Enables a system-wide task dump (excluding kernel threads) to be produced -when the kernel performs an OOM-killing and includes such information as -pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj -score, and name. This is helpful to determine why the OOM killer was -invoked, to identify the rogue task that caused it, and to determine why -the OOM killer chose the task it did to kill. - -If this is set to zero, this information is suppressed. On very -large systems with thousands of tasks it may not be feasible to dump -the memory state information for each one. Such systems should not -be forced to incur a performance penalty in OOM conditions when the -information may not be desired. - -If this is set to non-zero, this information is shown whenever the -OOM killer actually kills a memory-hogging task. - -The default value is 1 (enabled). - - -oom_kill_allocating_task -======================== - -This enables or disables killing the OOM-triggering task in -out-of-memory situations. - -If this is set to zero, the OOM killer will scan through the entire -tasklist and select a task based on heuristics to kill. This normally -selects a rogue memory-hogging task that frees up a large amount of -memory when killed. - -If this is set to non-zero, the OOM killer simply kills the task that -triggered the out-of-memory condition. This avoids the expensive -tasklist scan. - -If panic_on_oom is selected, it takes precedence over whatever value -is used in oom_kill_allocating_task. - -The default value is 0. - - -overcommit_kbytes -================= - -When overcommit_memory is set to 2, the committed address space is not -permitted to exceed swap plus this amount of physical RAM. See below. - -Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one -of them may be specified at a time. Setting one disables the other (which -then appears as 0 when read). - - -overcommit_memory -================= - -This value contains a flag that enables memory overcommitment. - -When this flag is 0, the kernel attempts to estimate the amount -of free memory left when userspace requests more memory. - -When this flag is 1, the kernel pretends there is always enough -memory until it actually runs out. - -When this flag is 2, the kernel uses a "never overcommit" -policy that attempts to prevent any overcommit of memory. -Note that user_reserve_kbytes affects this policy. - -This feature can be very useful because there are a lot of -programs that malloc() huge amounts of memory "just-in-case" -and don't use much of it. - -The default value is 0. - -See Documentation/vm/overcommit-accounting.rst and -mm/util.c::__vm_enough_memory() for more information. - - -overcommit_ratio -================ - -When overcommit_memory is set to 2, the committed address -space is not permitted to exceed swap plus this percentage -of physical RAM. See above. - - -page-cluster -============ - -page-cluster controls the number of pages up to which consecutive pages -are read in from swap in a single attempt. This is the swap counterpart -to page cache readahead. -The mentioned consecutivity is not in terms of virtual/physical addresses, -but consecutive on swap space - that means they were swapped out together. - -It is a logarithmic value - setting it to zero means "1 page", setting -it to 1 means "2 pages", setting it to 2 means "4 pages", etc. -Zero disables swap readahead completely. - -The default value is three (eight pages at a time). There may be some -small benefits in tuning this to a different value if your workload is -swap-intensive. - -Lower values mean lower latencies for initial faults, but at the same time -extra faults and I/O delays for following faults if they would have been part of -that consecutive pages readahead would have brought in. - - -panic_on_oom -============ - -This enables or disables panic on out-of-memory feature. - -If this is set to 0, the kernel will kill some rogue process, -called oom_killer. Usually, oom_killer can kill rogue processes and -system will survive. - -If this is set to 1, the kernel panics when out-of-memory happens. -However, if a process limits using nodes by mempolicy/cpusets, -and those nodes become memory exhaustion status, one process -may be killed by oom-killer. No panic occurs in this case. -Because other nodes' memory may be free. This means system total status -may be not fatal yet. - -If this is set to 2, the kernel panics compulsorily even on the -above-mentioned. Even oom happens under memory cgroup, the whole -system panics. - -The default value is 0. - -1 and 2 are for failover of clustering. Please select either -according to your policy of failover. - -panic_on_oom=2+kdump gives you very strong tool to investigate -why oom happens. You can get snapshot. - - -percpu_pagelist_fraction -======================== - -This is the fraction of pages at most (high mark pcp->high) in each zone that -are allocated for each per cpu page list. The min value for this is 8. It -means that we don't allow more than 1/8th of pages in each zone to be -allocated in any single per_cpu_pagelist. This entry only changes the value -of hot per cpu pagelists. User can specify a number like 100 to allocate -1/100th of each zone to each per cpu page list. - -The batch value of each per cpu pagelist is also updated as a result. It is -set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8) - -The initial value is zero. Kernel does not use this value at boot time to set -the high water marks for each per cpu page list. If the user writes '0' to this -sysctl, it will revert to this default behavior. - - -stat_interval -============= - -The time interval between which vm statistics are updated. The default -is 1 second. - - -stat_refresh -============ - -Any read or write (by root only) flushes all the per-cpu vm statistics -into their global totals, for more accurate reports when testing -e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo - -As a side-effect, it also checks for negative totals (elsewhere reported -as 0) and "fails" with EINVAL if any are found, with a warning in dmesg. -(At time of writing, a few stats are known sometimes to be found negative, -with no ill effects: errors and warnings on these stats are suppressed.) - - -numa_stat -========= - -This interface allows runtime configuration of numa statistics. - -When page allocation performance becomes a bottleneck and you can tolerate -some possible tool breakage and decreased numa counter precision, you can -do:: - - echo 0 > /proc/sys/vm/numa_stat - -When page allocation performance is not a bottleneck and you want all -tooling to work, you can do:: - - echo 1 > /proc/sys/vm/numa_stat - - -swappiness -========== - -This control is used to define how aggressive the kernel will swap -memory pages. Higher values will increase aggressiveness, lower values -decrease the amount of swap. A value of 0 instructs the kernel not to -initiate swap until the amount of free and file-backed pages is less -than the high water mark in a zone. - -The default value is 60. - - -unprivileged_userfaultfd -======================== - -This flag controls whether unprivileged users can use the userfaultfd -system calls. Set this to 1 to allow unprivileged users to use the -userfaultfd system calls, or set this to 0 to restrict userfaultfd to only -privileged users (with SYS_CAP_PTRACE capability). - -The default value is 1. - - -user_reserve_kbytes -=================== - -When overcommit_memory is set to 2, "never overcommit" mode, reserve -min(3% of current process size, user_reserve_kbytes) of free memory. -This is intended to prevent a user from starting a single memory hogging -process, such that they cannot recover (kill the hog). - -user_reserve_kbytes defaults to min(3% of the current process size, 128MB). - -If this is reduced to zero, then the user will be allowed to allocate -all free memory with a single process, minus admin_reserve_kbytes. -Any subsequent attempts to execute a command will result in -"fork: Cannot allocate memory". - -Changing this takes effect whenever an application requests memory. - - -vfs_cache_pressure -================== - -This percentage value controls the tendency of the kernel to reclaim -the memory which is used for caching of directory and inode objects. - -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will -never reclaim dentries and inodes due to memory pressure and this can easily -lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. - -Increasing vfs_cache_pressure significantly beyond 100 may have negative -performance impact. Reclaim code needs to take various locks to find freeable -directory and inode objects. With vfs_cache_pressure=1000, it will look for -ten times more freeable objects than there are. - - -watermark_boost_factor -====================== - -This factor controls the level of reclaim when memory is being fragmented. -It defines the percentage of the high watermark of a zone that will be -reclaimed if pages of different mobility are being mixed within pageblocks. -The intent is that compaction has less work to do in the future and to -increase the success rate of future high-order allocations such as SLUB -allocations, THP and hugetlbfs pages. - -To make it sensible with respect to the watermark_scale_factor -parameter, the unit is in fractions of 10,000. The default value of -15,000 on !DISCONTIGMEM configurations means that up to 150% of the high -watermark will be reclaimed in the event of a pageblock being mixed due -to fragmentation. The level of reclaim is determined by the number of -fragmentation events that occurred in the recent past. If this value is -smaller than a pageblock then a pageblocks worth of pages will be reclaimed -(e.g. 2MB on 64-bit x86). A boost factor of 0 will disable the feature. - - -watermark_scale_factor -====================== - -This factor controls the aggressiveness of kswapd. It defines the -amount of memory left in a node/system before kswapd is woken up and -how much memory needs to be free before kswapd goes back to sleep. - -The unit is in fractions of 10,000. The default value of 10 means the -distances between watermarks are 0.1% of the available memory in the -node/system. The maximum value is 1000, or 10% of memory. - -A high rate of threads entering direct reclaim (allocstall) or kswapd -going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate -that the number of free pages kswapd maintains for latency reasons is -too small for the allocation bursts occurring in the system. This knob -can then be used to tune kswapd aggressiveness accordingly. - - -zone_reclaim_mode -================= - -Zone_reclaim_mode allows someone to set more or less aggressive approaches to -reclaim memory when a zone runs out of memory. If it is set to zero then no -zone reclaim occurs. Allocations will be satisfied from other zones / nodes -in the system. - -This is value OR'ed together of - -= =================================== -1 Zone reclaim on -2 Zone reclaim writes dirty pages out -4 Zone reclaim swaps pages -= =================================== - -zone_reclaim_mode is disabled by default. For file servers or workloads -that benefit from having their data cached, zone_reclaim_mode should be -left disabled as the caching effect is likely to be more important than -data locality. - -zone_reclaim may be enabled if it's known that the workload is partitioned -such that each partition fits within a NUMA node and that accessing remote -memory would cause a measurable performance reduction. The page allocator -will then reclaim easily reusable pages (those page cache pages that are -currently not used) before allocating off node pages. - -Allowing zone reclaim to write out pages stops processes that are -writing large amounts of data from dirtying pages on other nodes. Zone -reclaim will write out dirty pages if a zone fills up and so effectively -throttle the process. This may decrease the performance of a single process -since it cannot use all of system memory to buffer the outgoing writes -anymore but it preserve the memory on other nodes so that the performance -of other processes running on other nodes will not be affected. - -Allowing regular swap effectively restricts allocations to the local -node unless explicitly overridden by memory policies or cpuset -configurations. diff --git a/Documentation/vm/unevictable-lru.rst b/Documentation/vm/unevictable-lru.rst index 8ba656f37cd8..109052215bce 100644 --- a/Documentation/vm/unevictable-lru.rst +++ b/Documentation/vm/unevictable-lru.rst @@ -439,7 +439,7 @@ Compacting MLOCKED Pages The unevictable LRU can be scanned for compactable regions and the default behavior is to do so. /proc/sys/vm/compact_unevictable_allowed controls -this behavior (see Documentation/sysctl/vm.rst). Once scanning of the +this behavior (see Documentation/admin-guide/sysctl/vm.rst). Once scanning of the unevictable LRU is enabled, the work of compaction is mostly handled by the page migration code and the same work flow as described in MIGRATING MLOCKED PAGES will apply. diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig index 4c3dcb718961..47d2651fd9dc 100644 --- a/fs/proc/Kconfig +++ b/fs/proc/Kconfig @@ -72,7 +72,7 @@ config PROC_SYSCTL interface is through /proc/sys. If you say Y here a tree of modifiable sysctl entries will be generated beneath the /proc/sys directory. They are explained in the files - in . Note that enabling this + in . Note that enabling this option will enlarge the kernel by at least 8 KB. As it is generally a good thing, you should say Y here unless diff --git a/kernel/panic.c b/kernel/panic.c index e0ea74bbb41d..057540b6eee9 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -372,7 +372,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = { /** * print_tainted - return a string to represent the kernel taint state. * - * For individual taint flag meanings, see Documentation/sysctl/kernel.rst + * For individual taint flag meanings, see Documentation/admin-guide/sysctl/kernel.rst * * The string is overwritten by the next call to print_tainted(), * but is always NULL terminated. diff --git a/mm/swap.c b/mm/swap.c index 83a2a15f4836..ae300397dfda 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -8,7 +8,7 @@ /* * This file contains the default values for the operation of the * Linux VM subsystem. Fine-tuning documentation can be found in - * Documentation/sysctl/vm.rst. + * Documentation/admin-guide/sysctl/vm.rst. * Started 18.12.91 * Swap aging added 23.2.95, Stephen Tweedie. * Buffermem limits added 12.3.98, Rik van Riel. -- cgit