diff options
Diffstat (limited to 'Documentation/admin-guide/cgroup-v2.rst')
-rw-r--r-- | Documentation/admin-guide/cgroup-v2.rst | 245 |
1 files changed, 201 insertions, 44 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 17e6e9565156..cb1b4e759b7e 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -64,13 +64,14 @@ v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgrou 5-6. Device 5-7. RDMA 5-7-1. RDMA Interface Files - 5-8. HugeTLB - 5.8-1. HugeTLB Interface Files - 5-9. Misc - 5.9-1 Miscellaneous cgroup Interface Files - 5.9-2 Migration and Ownership - 5-10. Others - 5-10-1. perf_event + 5-8. DMEM + 5-9. HugeTLB + 5.9-1. HugeTLB Interface Files + 5-10. Misc + 5.10-1 Miscellaneous cgroup Interface Files + 5.10-2 Migration and Ownership + 5-11. Others + 5-11-1. perf_event 5-N. Non-normative information 5-N-1. CPU controller root cgroup process behaviour 5-N-2. IO controller root cgroup process behaviour @@ -239,6 +240,13 @@ cgroup v2 currently supports the following mount options. will not be tracked by the memory controller (even if cgroup v2 is remounted later on). + pids_localevents + The option restores v1-like behavior of pids.events:max, that is only + local (inside cgroup proper) fork failures are counted. Without this + option pids.events.max represents any pids.max enforcemnt across + cgroup's subtree. + + Organizing Processes and Threads -------------------------------- @@ -526,10 +534,12 @@ cgroup namespace on namespace creation. Because the resource control interface files in a given directory control the distribution of the parent's resources, the delegatee shouldn't be allowed to write to them. For the first method, this is -achieved by not granting access to these files. For the second, the -kernel rejects writes to all files other than "cgroup.procs" and -"cgroup.subtree_control" on a namespace root from inside the -namespace. +achieved by not granting access to these files. For the second, files +outside the namespace should be hidden from the delegatee by the means +of at least mount namespacing, and the kernel rejects writes to all +files on a namespace root from inside the cgroup namespace, except for +those files listed in "/sys/kernel/cgroup/delegate" (including +"cgroup.procs", "cgroup.threads", "cgroup.subtree_control", etc.). The end results are equivalent for both delegation types. Once delegated, the user can build sub-hierarchy under the directory, @@ -974,6 +984,14 @@ All cgroup core files are prefixed with "cgroup." A dying cgroup can consume system resources not exceeding limits, which were active at the moment of cgroup deletion. + nr_subsys_<cgroup_subsys> + Total number of live cgroup subsystems (e.g memory + cgroup) at and beneath the current cgroup. + + nr_dying_subsys_<cgroup_subsys> + Total number of dying cgroup subsystems (e.g. memory + cgroup) at and beneath the current cgroup. + cgroup.freeze A read-write single value file which exists on non-root cgroups. Allowed values are "0" and "1". The default is "0". @@ -1058,12 +1076,15 @@ cpufreq governor about the minimum desired frequency which should always be provided by a CPU, as well as the maximum desired frequency, which should not be exceeded by a CPU. -WARNING: cgroup2 doesn't yet support control of realtime processes and -the cpu controller can only be enabled when all RT processes are in -the root cgroup. Be aware that system management software may already -have placed RT processes into nonroot cgroups during the system boot -process, and these processes may need to be moved to the root cgroup -before the cpu controller can be enabled. +WARNING: cgroup2 doesn't yet support control of realtime processes. For +a kernel built with the CONFIG_RT_GROUP_SCHED option enabled for group +scheduling of realtime processes, the cpu controller can only be enabled +when all RT processes are in the root cgroup. This limitation does +not apply if CONFIG_RT_GROUP_SCHED is disabled. Be aware that system +management software may already have placed RT processes into nonroot +cgroups during the system boot process, and these processes may need +to be moved to the root cgroup before the cpu controller can be enabled +with a CONFIG_RT_GROUP_SCHED enabled kernel. CPU Interface Files @@ -1296,17 +1317,10 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. - Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1318,12 +1332,26 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. +The following nested keys are defined. + + ========== ================================ + swappiness Swappiness value to reclaim with + ========== ================================ + + Specifying a swappiness value instructs the kernel to perform + the reclaim with that swappiness value. Note that this has the + same semantics as vm.swappiness applied to memcg reclaim with + all the existing limitations and potential future extensions. + memory.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. - The max memory usage recorded for the cgroup and its - descendants since the creation of the cgroup. + The max memory usage recorded for the cgroup and its descendants since + either the creation of the cgroup or the most recent reset for that FD. + + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.oom.group A read-write single value file which exists on non-root @@ -1432,7 +1460,7 @@ PAGE_SIZE multiple when read back. sec_pagetables Amount of memory allocated for secondary page tables, this currently includes KVM mmu allocations on x86 - and arm64. + and arm64 and IOMMU page tables. percpu (npn) Amount of memory used for storing per-cpu kernel @@ -1572,6 +1600,24 @@ PAGE_SIZE multiple when read back. pglazyfreed (npn) Amount of reclaimed lazyfree pages + swpin_zero + Number of pages swapped into memory and filled with zero, where I/O + was optimized out because the page content was detected to be zero + during swapout. + + swpout_zero + Number of zero-filled pages swapped out with I/O skipped due to the + content being detected as zero. + + zswpin + Number of pages moved in to memory from zswap. + + zswpout + Number of pages moved out of memory to zswap. + + zswpwb + Number of pages written from zswap to swap. + thp_fault_alloc (npn) Number of transparent hugepages which were allocated to satisfy a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE @@ -1591,6 +1637,30 @@ PAGE_SIZE multiple when read back. Usually because failed to allocate some continuous swap space for the huge page. + numa_pages_migrated (npn) + Number of pages migrated by NUMA balancing. + + numa_pte_updates (npn) + Number of pages whose page table entries are modified by + NUMA balancing to produce NUMA hinting faults on access. + + numa_hint_faults (npn) + Number of NUMA hinting faults. + + pgdemote_kswapd + Number of pages demoted by kswapd. + + pgdemote_direct + Number of pages demoted directly. + + pgdemote_khugepaged + Number of pages demoted by khugepaged. + + hugetlb + Amount of memory used by hugetlb pages. This metric only shows + up if hugetlb usage is accounted for in memory.current (i.e. + cgroup is mounted with the memory_hugetlb_accounting option). + memory.numa_stat A read-only nested-keyed file which exists on non-root cgroups. @@ -1640,11 +1710,14 @@ PAGE_SIZE multiple when read back. Healthy workloads are not expected to reach this limit. memory.swap.peak - A read-only single value file which exists on non-root - cgroups. + A read-write single value file which exists on non-root cgroups. + + The max swap usage recorded for the cgroup and its descendants since + the creation of the cgroup or the most recent reset for that FD. - The max swap usage recorded for the cgroup and its - descendants since the creation of the cgroup. + A write of any non-empty string to this file resets it to the + current memory usage for subsequent reads through the same + file descriptor. memory.swap.max A read-write single value file which exists on non-root @@ -1694,9 +1767,10 @@ PAGE_SIZE multiple when read back. entries fault back in or are written out to disk. memory.zswap.writeback - A read-write single value file. The default value is "1". The - initial value of the root cgroup is 1, and when a new cgroup is - created, it inherits the current value of its parent. + A read-write single value file. The default value is "1". + Note that this setting is hierarchical, i.e. the writeback would be + implicitly disabled for child cgroups if the upper hierarchy + does so. When this is set to 0, all swapping attempts to swapping devices are disabled. This included both zswap writebacks, and swapping due @@ -1707,6 +1781,8 @@ PAGE_SIZE multiple when read back. Note that this is subtly different from setting memory.swap.max to 0, as it still allows for pages to be written to the zswap pool. + This setting has no effect if zswap is disabled, and swapping + is allowed unless memory.swap.max is set to 0. memory.pressure A read-only nested-keyed file. @@ -2181,11 +2257,31 @@ PID Interface Files Hard limit of number of processes. pids.current - A read-only single value file which exists on all cgroups. + A read-only single value file which exists on non-root cgroups. The number of processes currently in the cgroup and its descendants. + pids.peak + A read-only single value file which exists on non-root cgroups. + + The maximum value that the number of processes in the cgroup and its + descendants has ever reached. + + pids.events + A read-only flat-keyed file which exists on non-root cgroups. Unless + specified otherwise, a value change in this file generates a file + modified event. The following entries are defined. + + max + The number of times the cgroup's total number of processes hit the pids.max + limit (see also pids_localevents). + + pids.events.local + Similar to pids.events but the fields in the file are local + to the cgroup i.e. not hierarchical. The file modified event + generated on this file reflects only the local events. + Organisational operations are not blocked by cgroup policies, so it is possible to have pids.current > pids.max. This can be done by either setting the limit to be smaller than pids.current, or attaching enough @@ -2320,8 +2416,12 @@ Cpuset Interface Files is always a subset of it. Users can manually set it to a value that is different from - "cpuset.cpus". The only constraint in setting it is that the - list of CPUs must be exclusive with respect to its sibling. + "cpuset.cpus". One constraint in setting it is that the list of + CPUs must be exclusive with respect to "cpuset.cpus.exclusive" + of its sibling. If "cpuset.cpus.exclusive" of a sibling cgroup + isn't set, its "cpuset.cpus" value, if set, cannot be a subset + of it to leave at least one CPU available when the exclusive + CPUs are taken away. For a parent cgroup, any one of its exclusive CPUs can only be distributed to at most one of its child cgroups. Having an @@ -2337,8 +2437,8 @@ Cpuset Interface Files cpuset-enabled cgroups. This file shows the effective set of exclusive CPUs that - can be used to create a partition root. The content of this - file will always be a subset of "cpuset.cpus" and its parent's + can be used to create a partition root. The content + of this file will always be a subset of its parent's "cpuset.cpus.exclusive.effective" if its parent is not the root cgroup. It will also be a subset of "cpuset.cpus.exclusive" if it is set. If "cpuset.cpus.exclusive" is not set, it is @@ -2527,6 +2627,49 @@ RDMA Interface Files mlx4_0 hca_handle=1 hca_object=20 ocrdma1 hca_handle=1 hca_object=23 +DMEM +---- + +The "dmem" controller regulates the distribution and accounting of +device memory regions. Because each memory region may have its own page size, +which does not have to be equal to the system page size, the units are always bytes. + +DMEM Interface Files +~~~~~~~~~~~~~~~~~~~~ + + dmem.max, dmem.min, dmem.low + A readwrite nested-keyed file that exists for all the cgroups + except root that describes current configured resource limit + for a region. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 1073741824 + drm/0000:03:00.0/stolen max + + The semantics are the same as for the memory cgroup controller, and are + calculated in the same way. + + dmem.capacity + A read-only file that describes maximum region capacity. + It only exists on the root cgroup. Not all memory can be + allocated by cgroups, as the kernel reserves some for + internal use. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 8514437120 + drm/0000:03:00.0/stolen 67108864 + + dmem.current + A read-only file that describes current resource usage. + It exists for all the cgroup except root. + + An example for xe follows:: + + drm/0000:03:00.0/vram0 12550144 + drm/0000:03:00.0/stolen 8650752 + HugeTLB ------- @@ -2599,6 +2742,15 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_ res_a 3 res_b 0 + misc.peak + A read-only flat-keyed file shown in all cgroups. It shows the + historical maximum usage of the resources in the cgroup and its + children.:: + + $ cat misc.peak + res_a 10 + res_b 8 + misc.max A read-write flat-keyed file shown in the non root cgroups. Allowed maximum usage of the resources in the cgroup and its children.:: @@ -2628,6 +2780,11 @@ Miscellaneous controller provides 3 interface files. If two misc resources (res_ The number of times the cgroup's resource usage was about to go over the max boundary. + misc.events.local + Similar to misc.events but the fields in the file are local to the + cgroup i.e. not hierarchical. The file modified event generated on + this file reflects only the local events. + Migration and Ownership ~~~~~~~~~~~~~~~~~~~~~~~ @@ -2846,7 +3003,7 @@ following two functions. a queue (device) has been associated with the bio and before submission. - wbc_account_cgroup_owner(@wbc, @page, @bytes) + wbc_account_cgroup_owner(@wbc, @folio, @bytes) Should be called for each data segment being written out. While this function doesn't care exactly when it's called during the writeback session, it's the easiest and most @@ -2878,8 +3035,8 @@ Deprecated v1 Core Features - "cgroup.clone_children" is removed. -- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file - at the root instead. +- /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or + "cgroup.stat" files at the root instead. Issues with v1 and Rationales for v2 |