summaryrefslogtreecommitdiff
path: root/Documentation/admin-guide/cgroup-v2.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/admin-guide/cgroup-v2.rst')
-rw-r--r--Documentation/admin-guide/cgroup-v2.rst913
1 files changed, 746 insertions, 167 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 0636bcb60b5a..17e6e9565156 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1,3 +1,5 @@
+.. _cgroup-v2:
+
================
Control Group v2
================
@@ -9,7 +11,7 @@ This is the authoritative documentation on the design, interface and
conventions of cgroup v2. It describes all userland-visible aspects
of cgroup including core and specific controller behaviors. All
future changes must be reflected in this document. Documentation for
-v1 is available under Documentation/admin-guide/cgroup-v1/.
+v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
.. CONTENTS
@@ -54,6 +56,7 @@ v1 is available under Documentation/admin-guide/cgroup-v1/.
5-3-3. IO Latency
5-3-3-1. How IO Latency Throttling Works
5-3-3-2. IO Latency Interface Files
+ 5-3-4. IO Priority
5-4. PID
5-4-1. PID Interface Files
5-5. Cpuset
@@ -61,8 +64,13 @@ v1 is available under Documentation/admin-guide/cgroup-v1/.
5-6. Device
5-7. RDMA
5-7-1. RDMA Interface Files
- 5-8. Misc
- 5-8-1. perf_event
+ 5-8. HugeTLB
+ 5.8-1. HugeTLB Interface Files
+ 5-9. Misc
+ 5.9-1 Miscellaneous cgroup Interface Files
+ 5.9-2 Migration and Ownership
+ 5-10. Others
+ 5-10-1. perf_event
5-N. Non-normative information
5-N-1. CPU controller root cgroup process behaviour
5-N-2. IO controller root cgroup process behaviour
@@ -170,15 +178,21 @@ disabling controllers in v1 and make them always available in v2.
cgroup v2 currently supports the following mount options.
nsdelegate
-
Consider cgroup namespaces as delegation boundaries. This
option is system wide and can only be set on mount or modified
through remount from the init namespace. The mount option is
ignored on non-init namespace mounts. Please refer to the
Delegation section for details.
- memory_localevents
+ favordynmods
+ Reduce the latencies of dynamic cgroup modifications such as
+ task migrations and controller on/offs at the cost of making
+ hot path operations such as forks and exits more expensive.
+ The static usage pattern of creating a cgroup, enabling
+ controllers, and then seeding it with CLONE_INTO_CGROUP is
+ not affected by this option.
+ memory_localevents
Only populate memory.events with data for the current cgroup,
and not any subtrees. This is legacy behaviour, the default
behaviour without this option is to include subtree counts.
@@ -186,6 +200,45 @@ cgroup v2 currently supports the following mount options.
modified through remount from the init namespace. The mount
option is ignored on non-init namespace mounts.
+ memory_recursiveprot
+ Recursively apply memory.min and memory.low protection to
+ entire subtrees, without requiring explicit downward
+ propagation into leaf cgroups. This allows protecting entire
+ subtrees from one another, while retaining free competition
+ within those subtrees. This should have been the default
+ behavior but is a mount-option to avoid regressing setups
+ relying on the original semantics (e.g. specifying bogusly
+ high 'bypass' protection values at higher tree levels).
+
+ memory_hugetlb_accounting
+ Count HugeTLB memory usage towards the cgroup's overall
+ memory usage for the memory controller (for the purpose of
+ statistics reporting and memory protetion). This is a new
+ behavior that could regress existing setups, so it must be
+ explicitly opted in with this mount option.
+
+ A few caveats to keep in mind:
+
+ * There is no HugeTLB pool management involved in the memory
+ controller. The pre-allocated pool does not belong to anyone.
+ Specifically, when a new HugeTLB folio is allocated to
+ the pool, it is not accounted for from the perspective of the
+ memory controller. It is only charged to a cgroup when it is
+ actually used (for e.g at page fault time). Host memory
+ overcommit management has to consider this when configuring
+ hard limits. In general, HugeTLB pool management should be
+ done via other mechanisms (such as the HugeTLB controller).
+ * Failure to charge a HugeTLB folio to the memory controller
+ results in SIGBUS. This could happen even if the HugeTLB pool
+ still has pages available (but the cgroup limit is hit and
+ reclaim attempt fails).
+ * Charging HugeTLB memory towards the memory controller affects
+ memory protection and reclaim dynamics. Any userspace tuning
+ (of low, min limits for e.g) needs to take this into account.
+ * HugeTLB pages utilized while this option is not selected
+ will not be tracked by the memory controller (even if cgroup
+ v2 is remounted later on).
+
Organizing Processes and Threads
--------------------------------
@@ -340,6 +393,13 @@ constraint, a threaded controller must be able to handle competition
between threads in a non-leaf cgroup and its child cgroups. Each
threaded controller defines how such competitions are handled.
+Currently, the following controllers are threaded and can be enabled
+in a threaded cgroup::
+
+- cpu
+- cpuset
+- perf_event
+- pids
[Un]populated Notification
--------------------------
@@ -595,10 +655,12 @@ process migrations.
and is an example of this type.
+.. _cgroupv2-limits-distributor:
+
Limits
------
-A child can only consume upto the configured amount of the resource.
+A child can only consume up to the configured amount of the resource.
Limits can be over-committed - the sum of the limits of children can
exceed the amount of resource available to the parent.
@@ -611,15 +673,16 @@ process migrations.
"io.max" limits the maximum BPS and/or IOPS that a cgroup can consume
on an IO device and is an example of this type.
+.. _cgroupv2-protections-distributor:
Protections
-----------
-A cgroup is protected upto the configured amount of the resource
+A cgroup is protected up to the configured amount of the resource
as long as the usages of all its ancestors are under their
protected levels. Protections can be hard guarantees or best effort
soft boundaries. Protections can also be over-committed in which case
-only upto the amount available to the parent is protected among
+only up to the amount available to the parent is protected among
children.
Protections are in the range [0, max] and defaults to 0, which is
@@ -701,9 +764,7 @@ Conventions
- Settings for a single feature should be contained in a single file.
- The root cgroup should be exempt from resource control and thus
- shouldn't have resource control interface files. Also,
- informational files on the root cgroup which end up showing global
- information available elsewhere shouldn't exist.
+ shouldn't have resource control interface files.
- The default time unit is microseconds. If a different unit is ever
used, an explicit unit suffix must be present.
@@ -775,7 +836,6 @@ Core Interface Files
All cgroup core files are prefixed with "cgroup."
cgroup.type
-
A read-write single value file which exists on non-root
cgroups.
@@ -940,9 +1000,49 @@ All cgroup core files are prefixed with "cgroup."
it's possible to delete a frozen (and empty) cgroup, as well as
create new sub-cgroups.
+ cgroup.kill
+ A write-only single value file which exists in non-root cgroups.
+ The only allowed value is "1".
+
+ Writing "1" to the file causes the cgroup and all descendant cgroups to
+ be killed. This means that all processes located in the affected cgroup
+ tree will be killed via SIGKILL.
+
+ Killing a cgroup tree will deal with concurrent forks appropriately and
+ is protected against migrations.
+
+ In a threaded cgroup, writing this file fails with EOPNOTSUPP as
+ killing cgroups is a process directed operation, i.e. it affects
+ the whole thread-group.
+
+ cgroup.pressure
+ A read-write single value file that allowed values are "0" and "1".
+ The default is "1".
+
+ Writing "0" to the file will disable the cgroup PSI accounting.
+ Writing "1" to the file will re-enable the cgroup PSI accounting.
+
+ This control attribute is not hierarchical, so disable or enable PSI
+ accounting in a cgroup does not affect PSI accounting in descendants
+ and doesn't need pass enablement via ancestors from root.
+
+ The reason this control attribute exists is that PSI accounts stalls for
+ each cgroup separately and aggregates it at each level of the hierarchy.
+ This may cause non-negligible overhead for some workloads when under
+ deep level of the hierarchy, in which case this control attribute can
+ be used to disable PSI accounting in the non-leaf cgroups.
+
+ irq.pressure
+ A read-write nested-keyed file.
+
+ Shows pressure stall information for IRQ/SOFTIRQ. See
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
+
Controllers
===========
+.. _cgroup-v2-cpu:
+
CPU
---
@@ -972,7 +1072,7 @@ CPU Interface Files
All time durations are in microseconds.
cpu.stat
- A read-only flat-keyed file which exists on non-root cgroups.
+ A read-only flat-keyed file.
This file exists whether the controller is enabled or not.
It always reports the following three stats:
@@ -981,17 +1081,23 @@ All time durations are in microseconds.
- user_usec
- system_usec
- and the following three when the controller is enabled:
+ and the following five when the controller is enabled:
- nr_periods
- nr_throttled
- throttled_usec
+ - nr_bursts
+ - burst_usec
cpu.weight
A read-write single value file which exists on non-root
cgroups. The default is "100".
- The weight in the range [1, 10000].
+ For non idle groups (cpu.idle = 0), the weight is in the
+ range [1, 10000].
+
+ If the cgroup has been configured to be SCHED_IDLE (cpu.idle = 1),
+ then the weight will show as a 0.
cpu.weight.nice
A read-write single value file which exists on non-root
@@ -1013,15 +1119,21 @@ All time durations are in microseconds.
$MAX $PERIOD
- which indicates that the group may consume upto $MAX in each
+ which indicates that the group may consume up to $MAX in each
$PERIOD duration. "max" for $MAX indicates no limit. If only
one number is written, $MAX is updated.
+ cpu.max.burst
+ A read-write single value file which exists on non-root
+ cgroups. The default is "0".
+
+ The burst in the range [0, $MAX].
+
cpu.pressure
- A read-only nested-key file which exists on non-root cgroups.
+ A read-write nested-keyed file.
Shows pressure stall information for CPU. See
- Documentation/accounting/psi.rst for details.
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
cpu.uclamp.min
A read-write single value file which exists on non-root cgroups.
@@ -1049,6 +1161,16 @@ All time durations are in microseconds.
values similar to the sched_setattr(2). This maximum utilization
value is used to clamp the task specific maximum utilization clamp.
+ cpu.idle
+ A read-write single value file which exists on non-root cgroups.
+ The default is 0.
+
+ This is the cgroup analog of the per-task SCHED_IDLE sched policy.
+ Setting this value to a 1 will make the scheduling policy of the
+ cgroup SCHED_IDLE. The threads inside the cgroup will retain their
+ own relative priorities, but the cgroup itself will be treated as
+ very low priority relative to its peers.
+
Memory
@@ -1101,7 +1223,7 @@ PAGE_SIZE multiple when read back.
proportionally to the overage, reducing reclaim pressure for
smaller overages.
- Effective min boundary is limited by memory.min values of
+ Effective min boundary is limited by memory.min values of
all ancestor cgroups. If there is memory.min overcommitment
(child cgroup or cgroups are requiring more protected memory
than parent will allow), then each child cgroup will get
@@ -1141,27 +1263,67 @@ PAGE_SIZE multiple when read back.
A read-write single value file which exists on non-root
cgroups. The default is "max".
- Memory usage throttle limit. This is the main mechanism to
- control memory usage of a cgroup. If a cgroup's usage goes
+ Memory usage throttle limit. If a cgroup's usage goes
over the high boundary, the processes of the cgroup are
throttled and put under heavy reclaim pressure.
Going over the high limit never invokes the OOM killer and
- under extreme conditions the limit may be breached.
+ under extreme conditions the limit may be breached. The high
+ limit should be used in scenarios where an external process
+ monitors the limited cgroup to alleviate heavy reclaim
+ pressure.
memory.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
- Memory usage hard limit. This is the final protection
- mechanism. If a cgroup's memory usage reaches this limit and
- can't be reduced, the OOM killer is invoked in the cgroup.
- Under certain circumstances, the usage may go over the limit
- temporarily.
+ Memory usage hard limit. This is the main mechanism to limit
+ memory usage of a cgroup. If a cgroup's memory usage reaches
+ this limit and can't be reduced, the OOM killer is invoked in
+ the cgroup. Under certain circumstances, the usage may go
+ over the limit temporarily.
+
+ In default configuration regular 0-order allocations always
+ succeed unless OOM killer chooses current task as a victim.
+
+ Some kinds of allocations don't invoke the OOM killer.
+ Caller could retry them differently, return into userspace
+ as -ENOMEM or silently ignore in cases like disk readahead.
+
+ memory.reclaim
+ A write-only nested-keyed file which exists for all cgroups.
+
+ This is a simple interface to trigger memory reclaim in the
+ target cgroup.
- This is the ultimate protection mechanism. As long as the
- high limit is used and monitored properly, this limit's
- utility is limited to providing the final safety net.
+ This file accepts a single key, the number of bytes to reclaim.
+ No nested keys are currently supported.
+
+ Example::
+
+ echo "1G" > memory.reclaim
+
+ The interface can be later extended with nested keys to
+ configure the reclaim behavior. For example, specify the
+ type of memory to reclaim from (anon, file, ..).
+
+ Please note that the kernel can over or under reclaim from
+ the target cgroup. If less bytes are reclaimed than the
+ specified amount, -EAGAIN is returned.
+
+ Please note that the proactive reclaim (triggered by this
+ interface) is not meant to indicate memory pressure on the
+ memory cgroup. Therefore socket memory balancing triggered by
+ the memory reclaim normally is not exercised in this case.
+ This means that the networking layer will not adapt based on
+ reclaim induced by memory.reclaim.
+
+ memory.peak
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The max memory usage recorded for the cgroup and its
+ descendants since the creation of the cgroup.
memory.oom.group
A read-write single value file which exists on non-root
@@ -1189,7 +1351,7 @@ PAGE_SIZE multiple when read back.
Note that all fields in this file are hierarchical and the
file modified event can be generated due to an event down the
- hierarchy. For for the local events at the cgroup level see
+ hierarchy. For the local events at the cgroup level see
memory.events.local.
low
@@ -1215,22 +1377,17 @@ PAGE_SIZE multiple when read back.
The number of time the cgroup's memory usage was
reached the limit and allocation was about to fail.
- Depending on context result could be invocation of OOM
- killer and retrying allocation or failing allocation.
-
- Failed allocation in its turn could be returned into
- userspace as -ENOMEM or silently ignored in cases like
- disk readahead. For now OOM in memory cgroup kills
- tasks iff shortage has happened inside page fault.
-
This event is not raised if the OOM killer is not
considered as an option, e.g. for failed high-order
- allocations.
+ allocations or if caller asked to not retry attempts.
oom_kill
The number of processes belonging to this cgroup
killed by any kind of OOM killer.
+ oom_group_kill
+ The number of times a group OOM has occurred.
+
memory.events.local
Similar to memory.events but the fields in the file are local
to the cgroup i.e. not hierarchical. The file modified event
@@ -1249,6 +1406,10 @@ PAGE_SIZE multiple when read back.
can show up in the middle. Don't rely on items remaining in a
fixed position; use the keys to look up specific values!
+ If the entry has no per-node counter (or not show in the
+ memory.numa_stat). We use 'npn' (non-per-node) as the tag
+ to indicate that it will not show in the memory.numa_stat.
+
anon
Amount of memory used in anonymous mappings such as
brk(), sbrk(), and mmap(MAP_ANONYMOUS)
@@ -1257,20 +1418,42 @@ PAGE_SIZE multiple when read back.
Amount of memory used to cache filesystem data,
including tmpfs and shared memory.
+ kernel (npn)
+ Amount of total kernel memory, including
+ (kernel_stack, pagetables, percpu, vmalloc, slab) in
+ addition to other kernel memory use cases.
+
kernel_stack
Amount of memory allocated to kernel stacks.
- slab
- Amount of memory used for storing in-kernel data
- structures.
+ pagetables
+ Amount of memory allocated for page tables.
- sock
+ sec_pagetables
+ Amount of memory allocated for secondary page tables,
+ this currently includes KVM mmu allocations on x86
+ and arm64.
+
+ percpu (npn)
+ Amount of memory used for storing per-cpu kernel
+ data structures.
+
+ sock (npn)
Amount of memory used in network transmission buffers
+ vmalloc (npn)
+ Amount of memory used for vmap backed memory.
+
shmem
Amount of cached filesystem data that is swap-backed,
such as tmpfs, shm segments, shared anonymous mmap()s
+ zswap
+ Amount of memory consumed by the zswap compression backend.
+
+ zswapped
+ Amount of application memory swapped out to zswap.
+
file_mapped
Amount of cached filesystem data mapped with mmap()
@@ -1282,10 +1465,22 @@ PAGE_SIZE multiple when read back.
Amount of cached filesystem data that was modified and
is currently being written back to disk
+ swapcached
+ Amount of swap cached in memory. The swapcache is accounted
+ against both memory and swap usage.
+
anon_thp
Amount of memory used in anonymous mappings backed by
transparent hugepages
+ file_thp
+ Amount of cached filesystem data backed by transparent
+ hugepages
+
+ shmem_thp
+ Amount of shm, tmpfs, shared anonymous mmap()s backed by
+ transparent hugepages
+
inactive_anon, active_anon, inactive_file, active_file, unevictable
Amount of memory, swap-backed and filesystem-backed,
on the internal memory management lists used by the
@@ -1304,64 +1499,123 @@ PAGE_SIZE multiple when read back.
Part of "slab" that cannot be reclaimed on memory
pressure.
- pgfault
- Total number of page faults incurred
+ slab (npn)
+ Amount of memory used for storing in-kernel data
+ structures.
- pgmajfault
- Number of major page faults incurred
+ workingset_refault_anon
+ Number of refaults of previously evicted anonymous pages.
- workingset_refault
+ workingset_refault_file
+ Number of refaults of previously evicted file pages.
- Number of refaults of previously evicted pages
+ workingset_activate_anon
+ Number of refaulted anonymous pages that were immediately
+ activated.
- workingset_activate
+ workingset_activate_file
+ Number of refaulted file pages that were immediately activated.
- Number of refaulted pages that were immediately activated
+ workingset_restore_anon
+ Number of restored anonymous pages which have been detected as
+ an active workingset before they got reclaimed.
- workingset_nodereclaim
+ workingset_restore_file
+ Number of restored file pages which have been detected as an
+ active workingset before they got reclaimed.
+ workingset_nodereclaim
Number of times a shadow node has been reclaimed
- pgrefill
+ pgscan (npn)
+ Amount of scanned pages (in an inactive LRU list)
- Amount of scanned pages (in an active LRU list)
+ pgsteal (npn)
+ Amount of reclaimed pages
- pgscan
+ pgscan_kswapd (npn)
+ Amount of scanned pages by kswapd (in an inactive LRU list)
- Amount of scanned pages (in an inactive LRU list)
+ pgscan_direct (npn)
+ Amount of scanned pages directly (in an inactive LRU list)
- pgsteal
+ pgscan_khugepaged (npn)
+ Amount of scanned pages by khugepaged (in an inactive LRU list)
- Amount of reclaimed pages
+ pgsteal_kswapd (npn)
+ Amount of reclaimed pages by kswapd
- pgactivate
+ pgsteal_direct (npn)
+ Amount of reclaimed pages directly
- Amount of pages moved to the active LRU list
+ pgsteal_khugepaged (npn)
+ Amount of reclaimed pages by khugepaged
- pgdeactivate
+ pgfault (npn)
+ Total number of page faults incurred
- Amount of pages moved to the inactive LRU list
+ pgmajfault (npn)
+ Number of major page faults incurred
- pglazyfree
+ pgrefill (npn)
+ Amount of scanned pages (in an active LRU list)
- Amount of pages postponed to be freed under memory pressure
+ pgactivate (npn)
+ Amount of pages moved to the active LRU list
- pglazyfreed
+ pgdeactivate (npn)
+ Amount of pages moved to the inactive LRU list
- Amount of reclaimed lazyfree pages
+ pglazyfree (npn)
+ Amount of pages postponed to be freed under memory pressure
- thp_fault_alloc
+ pglazyfreed (npn)
+ Amount of reclaimed lazyfree pages
+ thp_fault_alloc (npn)
Number of transparent hugepages which were allocated to satisfy
- a page fault, including COW faults. This counter is not present
- when CONFIG_TRANSPARENT_HUGEPAGE is not set.
-
- thp_collapse_alloc
+ a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
+ is not set.
+ thp_collapse_alloc (npn)
Number of transparent hugepages which were allocated to allow
collapsing an existing range of pages. This counter is not
present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
+ thp_swpout (npn)
+ Number of transparent hugepages which are swapout in one piece
+ without splitting.
+
+ thp_swpout_fallback (npn)
+ Number of transparent hugepages which were split before swapout.
+ Usually because failed to allocate some continuous swap space
+ for the huge page.
+
+ memory.numa_stat
+ A read-only nested-keyed file which exists on non-root cgroups.
+
+ This breaks down the cgroup's memory footprint into different
+ types of memory, type-specific details, and other information
+ per node on the state of the memory management system.
+
+ This is useful for providing visibility into the NUMA locality
+ information within an memcg since the pages are allowed to be
+ allocated from any physical node. One of the use case is evaluating
+ application performance by combining this information with the
+ application's CPU allocation.
+
+ All memory amounts are in bytes.
+
+ The output format of memory.numa_stat is::
+
+ type N0=<bytes in node 0> N1=<bytes in node 1> ...
+
+ The entries are ordered to be human readable, and new entries
+ can show up in the middle. Don't rely on items remaining in a
+ fixed position; use the keys to look up specific values!
+
+ The entries can refer to the memory.stat.
+
memory.swap.current
A read-only single value file which exists on non-root
cgroups.
@@ -1369,6 +1623,29 @@ PAGE_SIZE multiple when read back.
The total amount of swap currently being used by the cgroup
and its descendants.
+ memory.swap.high
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Swap usage throttle limit. If a cgroup's swap usage exceeds
+ this limit, all its further allocations will be throttled to
+ allow userspace to implement custom out-of-memory procedures.
+
+ This limit marks a point of no return for the cgroup. It is NOT
+ designed to manage the amount of swapping a workload does
+ during regular operation. Compare to memory.swap.max, which
+ prohibits swapping past a set amount, but lets the cgroup
+ continue unimpeded as long as other memory can be reclaimed.
+
+ Healthy workloads are not expected to reach this limit.
+
+ memory.swap.peak
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The max swap usage recorded for the cgroup and its
+ descendants since the creation of the cgroup.
+
memory.swap.max
A read-write single value file which exists on non-root
cgroups. The default is "max".
@@ -1382,6 +1659,10 @@ PAGE_SIZE multiple when read back.
otherwise, a value change in this file generates a file
modified event.
+ high
+ The number of times the cgroup's swap usage was over
+ the high threshold.
+
max
The number of times the cgroup's swap usage was about
to go over the max boundary and swap allocation
@@ -1397,11 +1678,41 @@ PAGE_SIZE multiple when read back.
higher than the limit for an extended period of time. This
reduces the impact on the workload and memory management.
+ memory.zswap.current
+ A read-only single value file which exists on non-root
+ cgroups.
+
+ The total amount of memory consumed by the zswap compression
+ backend.
+
+ memory.zswap.max
+ A read-write single value file which exists on non-root
+ cgroups. The default is "max".
+
+ Zswap usage hard limit. If a cgroup's zswap pool reaches this
+ limit, it will refuse to take any more stores before existing
+ entries fault back in or are written out to disk.
+
+ memory.zswap.writeback
+ A read-write single value file. The default value is "1". The
+ initial value of the root cgroup is 1, and when a new cgroup is
+ created, it inherits the current value of its parent.
+
+ When this is set to 0, all swapping attempts to swapping devices
+ are disabled. This included both zswap writebacks, and swapping due
+ to zswap store failures. If the zswap store failures are recurring
+ (for e.g if the pages are incompressible), users can observe
+ reclaim inefficiency after disabling writeback (because the same
+ pages might be rejected again and again).
+
+ Note that this is subtly different from setting memory.swap.max to
+ 0, as it still allows for pages to be written to the zswap pool.
+
memory.pressure
- A read-only nested-key file which exists on non-root cgroups.
+ A read-only nested-keyed file.
Shows pressure stall information for memory. See
- Documentation/accounting/psi.rst for details.
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
Usage Guidelines
@@ -1461,8 +1772,7 @@ IO Interface Files
~~~~~~~~~~~~~~~~~~
io.stat
- A read-only nested-keyed file which exists on non-root
- cgroups.
+ A read-only nested-keyed file.
Lines are keyed by $MAJ:$MIN device numbers and not ordered.
The following nested keys are defined.
@@ -1476,13 +1786,13 @@ IO Interface Files
dios Number of discard IOs
====== =====================
- An example read output follows:
+ An example read output follows::
8:16 rbytes=1459200 wbytes=314773504 rios=192 wios=353 dbytes=0 dios=0
8:0 rbytes=90430464 wbytes=299008000 rios=8950 wios=1252 dbytes=50331648 dios=3021
io.cost.qos
- A read-write nested-keyed file with exists only on the root
+ A read-write nested-keyed file which exists only on the root
cgroup.
This file configures the Quality of Service of the IO cost
@@ -1537,7 +1847,7 @@ IO Interface Files
automatic mode can be restored by setting "ctrl" to "auto".
io.cost.model
- A read-write nested-keyed file with exists only on the root
+ A read-write nested-keyed file which exists only on the root
cgroup.
This file configures the cost model of the IO cost model based
@@ -1638,10 +1948,10 @@ IO Interface Files
8:16 rbps=2097152 wbps=max riops=max wiops=max
io.pressure
- A read-only nested-key file which exists on non-root cgroups.
+ A read-only nested-keyed file.
Shows pressure stall information for IO. See
- Documentation/accounting/psi.rst for details.
+ :ref:`Documentation/accounting/psi.rst <psi>` for details.
Writeback
@@ -1662,9 +1972,9 @@ per-cgroup dirty memory states are examined and the more restrictive
of the two is enforced.
cgroup writeback requires explicit support from the underlying
-filesystem. Currently, cgroup writeback is implemented on ext2, ext4
-and btrfs. On other filesystems, all writeback IOs are attributed to
-the root cgroup.
+filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+attributed to the root cgroup.
There are inherent differences in memory and writeback management
which affects how cgroup ownership is tracked. Memory is tracked per
@@ -1763,7 +2073,7 @@ IO Latency Interface Files
io.latency
This takes a similar format as the other controllers.
- "MAJOR:MINOR target=<target time in microseconds"
+ "MAJOR:MINOR target=<target time in microseconds>"
io.stat
If the controller is enabled you will see extra stats in io.stat in
@@ -1783,6 +2093,68 @@ IO Latency Interface Files
duration of time between evaluation events. Windows only elapse
with IO activity. Idle periods extend the most recent window.
+IO Priority
+~~~~~~~~~~~
+
+A single attribute controls the behavior of the I/O priority cgroup policy,
+namely the io.prio.class attribute. The following values are accepted for
+that attribute:
+
+ no-change
+ Do not modify the I/O priority class.
+
+ promote-to-rt
+ For requests that have a non-RT I/O priority class, change it into RT.
+ Also change the priority level of these requests to 4. Do not modify
+ the I/O priority of requests that have priority class RT.
+
+ restrict-to-be
+ For requests that do not have an I/O priority class or that have I/O
+ priority class RT, change it into BE. Also change the priority level
+ of these requests to 0. Do not modify the I/O priority class of
+ requests that have priority class IDLE.
+
+ idle
+ Change the I/O priority class of all requests into IDLE, the lowest
+ I/O priority class.
+
+ none-to-rt
+ Deprecated. Just an alias for promote-to-rt.
+
+The following numerical values are associated with the I/O priority policies:
+
++----------------+---+
+| no-change | 0 |
++----------------+---+
+| promote-to-rt | 1 |
++----------------+---+
+| restrict-to-be | 2 |
++----------------+---+
+| idle | 3 |
++----------------+---+
+
+The numerical value that corresponds to each I/O priority class is as follows:
+
++-------------------------------+---+
+| IOPRIO_CLASS_NONE | 0 |
++-------------------------------+---+
+| IOPRIO_CLASS_RT (real-time) | 1 |
++-------------------------------+---+
+| IOPRIO_CLASS_BE (best effort) | 2 |
++-------------------------------+---+
+| IOPRIO_CLASS_IDLE | 3 |
++-------------------------------+---+
+
+The algorithm to set the I/O priority class for a request is as follows:
+
+- If I/O priority class policy is promote-to-rt, change the request I/O
+ priority class to IOPRIO_CLASS_RT and change the request I/O priority
+ level to 4.
+- If I/O priority class policy is not promote-to-rt, translate the I/O priority
+ class policy into a number, then change the request I/O priority class
+ into the maximum of the I/O priority class policy number and the numerical
+ I/O priority class.
+
PID
---
@@ -1851,7 +2223,7 @@ Cpuset Interface Files
from the requested CPUs.
The CPU numbers are comma-separated numbers or ranges.
- For example:
+ For example::
# cat cpuset.cpus
0-4,6,8-10
@@ -1890,7 +2262,7 @@ Cpuset Interface Files
from the requested memory nodes.
The memory node numbers are comma-separated numbers or ranges.
- For example:
+ For example::
# cat cpuset.mems
0-1,3
@@ -1903,6 +2275,17 @@ Cpuset Interface Files
The value of "cpuset.mems" stays constant until the next update
and won't be affected by any memory nodes hotplug events.
+ Setting a non-empty value to "cpuset.mems" causes memory of
+ tasks within the cgroup to be migrated to the designated nodes if
+ they are currently using memory outside of the designated nodes.
+
+ There is a cost for this memory migration. The migration
+ may not be complete and some memory pages may be left behind.
+ So it is recommended that "cpuset.mems" should be set properly
+ before spawning new tasks into the cpuset. Even if there is
+ a need to change "cpuset.mems" with active tasks, it shouldn't
+ be done frequently.
+
cpuset.mems.effective
A read-only multiple values file which exists on all
cpuset-enabled cgroups.
@@ -1919,78 +2302,166 @@ Cpuset Interface Files
Its value will be affected by memory nodes hotplug events.
+ cpuset.cpus.exclusive
+ A read-write multiple values file which exists on non-root
+ cpuset-enabled cgroups.
+
+ It lists all the exclusive CPUs that are allowed to be used
+ to create a new cpuset partition. Its value is not used
+ unless the cgroup becomes a valid partition root. See the
+ "cpuset.cpus.partition" section below for a description of what
+ a cpuset partition is.
+
+ When the cgroup becomes a partition root, the actual exclusive
+ CPUs that are allocated to that partition are listed in
+ "cpuset.cpus.exclusive.effective" which may be different
+ from "cpuset.cpus.exclusive". If "cpuset.cpus.exclusive"
+ has previously been set, "cpuset.cpus.exclusive.effective"
+ is always a subset of it.
+
+ Users can manually set it to a value that is different from
+ "cpuset.cpus". The only constraint in setting it is that the
+ list of CPUs must be exclusive with respect to its sibling.
+
+ For a parent cgroup, any one of its exclusive CPUs can only
+ be distributed to at most one of its child cgroups. Having an
+ exclusive CPU appearing in two or more of its child cgroups is
+ not allowed (the exclusivity rule). A value that violates the
+ exclusivity rule will be rejected with a write error.
+
+ The root cgroup is a partition root and all its available CPUs
+ are in its exclusive CPU set.
+
+ cpuset.cpus.exclusive.effective
+ A read-only multiple values file which exists on all non-root
+ cpuset-enabled cgroups.
+
+ This file shows the effective set of exclusive CPUs that
+ can be used to create a partition root. The content of this
+ file will always be a subset of "cpuset.cpus" and its parent's
+ "cpuset.cpus.exclusive.effective" if its parent is not the root
+ cgroup. It will also be a subset of "cpuset.cpus.exclusive"
+ if it is set. If "cpuset.cpus.exclusive" is not set, it is
+ treated to have an implicit value of "cpuset.cpus" in the
+ formation of local partition.
+
+ cpuset.cpus.isolated
+ A read-only and root cgroup only multiple values file.
+
+ This file shows the set of all isolated CPUs used in existing
+ isolated partitions. It will be empty if no isolated partition
+ is created.
+
cpuset.cpus.partition
A read-write single value file which exists on non-root
cpuset-enabled cgroups. This flag is owned by the parent cgroup
and is not delegatable.
- It accepts only the following input values when written to.
-
- "root" - a partition root
- "member" - a non-root member of a partition
-
- When set to be a partition root, the current cgroup is the
- root of a new partition or scheduling domain that comprises
- itself and all its descendants except those that are separate
- partition roots themselves and their descendants. The root
- cgroup is always a partition root.
-
- There are constraints on where a partition root can be set.
- It can only be set in a cgroup if all the following conditions
- are true.
-
- 1) The "cpuset.cpus" is not empty and the list of CPUs are
- exclusive, i.e. they are not shared by any of its siblings.
- 2) The parent cgroup is a partition root.
- 3) The "cpuset.cpus" is also a proper subset of the parent's
- "cpuset.cpus.effective".
- 4) There is no child cgroups with cpuset enabled. This is for
- eliminating corner cases that have to be handled if such a
- condition is allowed.
-
- Setting it to partition root will take the CPUs away from the
- effective CPUs of the parent cgroup. Once it is set, this
- file cannot be reverted back to "member" if there are any child
- cgroups with cpuset enabled.
-
- A parent partition cannot distribute all its CPUs to its
- child partitions. There must be at least one cpu left in the
- parent partition.
-
- Once becoming a partition root, changes to "cpuset.cpus" is
- generally allowed as long as the first condition above is true,
- the change will not take away all the CPUs from the parent
- partition and the new "cpuset.cpus" value is a superset of its
- children's "cpuset.cpus" values.
-
- Sometimes, external factors like changes to ancestors'
- "cpuset.cpus" or cpu hotplug can cause the state of the partition
- root to change. On read, the "cpuset.sched.partition" file
- can show the following values.
-
- "member" Non-root member of a partition
- "root" Partition root
- "root invalid" Invalid partition root
-
- It is a partition root if the first 2 partition root conditions
- above are true and at least one CPU from "cpuset.cpus" is
- granted by the parent cgroup.
-
- A partition root can become invalid if none of CPUs requested
- in "cpuset.cpus" can be granted by the parent cgroup or the
- parent cgroup is no longer a partition root itself. In this
- case, it is not a real partition even though the restriction
- of the first partition root condition above will still apply.
- The cpu affinity of all the tasks in the cgroup will then be
- associated with CPUs in the nearest ancestor partition.
-
- An invalid partition root can be transitioned back to a
- real partition root if at least one of the requested CPUs
- can now be granted by its parent. In this case, the cpu
- affinity of all the tasks in the formerly invalid partition
- will be associated to the CPUs of the newly formed partition.
- Changing the partition state of an invalid partition root to
- "member" is always allowed even if child cpusets are present.
+ It accepts only the following input values when written to.
+
+ ========== =====================================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ ========== =====================================
+
+ A cpuset partition is a collection of cpuset-enabled cgroups with
+ a partition root at the top of the hierarchy and its descendants
+ except those that are separate partition roots themselves and
+ their descendants. A partition has exclusive access to the
+ set of exclusive CPUs allocated to it. Other cgroups outside
+ of that partition cannot use any CPUs in that set.
+
+ There are two types of partitions - local and remote. A local
+ partition is one whose parent cgroup is also a valid partition
+ root. A remote partition is one whose parent cgroup is not a
+ valid partition root itself. Writing to "cpuset.cpus.exclusive"
+ is optional for the creation of a local partition as its
+ "cpuset.cpus.exclusive" file will assume an implicit value that
+ is the same as "cpuset.cpus" if it is not set. Writing the
+ proper "cpuset.cpus.exclusive" values down the cgroup hierarchy
+ before the target partition root is mandatory for the creation
+ of a remote partition.
+
+ Currently, a remote partition cannot be created under a local
+ partition. All the ancestors of a remote partition root except
+ the root cgroup cannot be a partition root.
+
+ The root cgroup is always a partition root and its state cannot
+ be changed. All other non-root cgroups start out as "member".
+
+ When set to "root", the current cgroup is the root of a new
+ partition or scheduling domain. The set of exclusive CPUs is
+ determined by the value of its "cpuset.cpus.exclusive.effective".
+
+ When set to "isolated", the CPUs in that partition will be in
+ an isolated state without any load balancing from the scheduler
+ and excluded from the unbound workqueues. Tasks placed in such
+ a partition with multiple CPUs should be carefully distributed
+ and bound to each of the individual CPUs for optimal performance.
+
+ A partition root ("root" or "isolated") can be in one of the
+ two possible states - valid or invalid. An invalid partition
+ root is in a degraded state where some state information may
+ be retained, but behaves more like a "member".
+
+ All possible state transitions among "member", "root" and
+ "isolated" are allowed.
+
+ On read, the "cpuset.cpus.partition" file can show the following
+ values.
+
+ ============================= =====================================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ "root invalid (<reason>)" Invalid partition root
+ "isolated invalid (<reason>)" Invalid isolated partition root
+ ============================= =====================================
+
+ In the case of an invalid partition root, a descriptive string on
+ why the partition is invalid is included within parentheses.
+
+ For a local partition root to be valid, the following conditions
+ must be met.
+
+ 1) The parent cgroup is a valid partition root.
+ 2) The "cpuset.cpus.exclusive.effective" file cannot be empty,
+ though it may contain offline CPUs.
+ 3) The "cpuset.cpus.effective" cannot be empty unless there is
+ no task associated with this partition.
+
+ For a remote partition root to be valid, all the above conditions
+ except the first one must be met.
+
+ External events like hotplug or changes to "cpuset.cpus" or
+ "cpuset.cpus.exclusive" can cause a valid partition root to
+ become invalid and vice versa. Note that a task cannot be
+ moved to a cgroup with empty "cpuset.cpus.effective".
+
+ A valid non-root parent partition may distribute out all its CPUs
+ to its child local partitions when there is no task associated
+ with it.
+
+ Care must be taken to change a valid partition root to "member"
+ as all its child local partitions, if present, will become
+ invalid causing disruption to tasks running in those child
+ partitions. These inactivated partitions could be recovered if
+ their parent is switched back to a partition root with a proper
+ value in "cpuset.cpus" or "cpuset.cpus.exclusive".
+
+ Poll and inotify events are triggered whenever the state of
+ "cpuset.cpus.partition" changes. That includes changes caused
+ by write to "cpuset.cpus.partition", cpu hotplug or other
+ changes that modify the validity status of the partition.
+ This will allow user space agents to monitor unexpected changes
+ to "cpuset.cpus.partition" without the need to do continuous
+ polling.
+
+ A user can pre-configure certain CPUs to an isolated state
+ with load balancing disabled at boot time with the "isolcpus"
+ kernel boot command line option. If those CPUs are to be put
+ into a partition, they have to be used in an isolated partition.
Device controller
@@ -2002,26 +2473,26 @@ existing device files.
Cgroup v2 device controller has no interface files and is implemented
on top of cgroup BPF. To control access to device files, a user may
-create bpf programs of the BPF_CGROUP_DEVICE type and attach them
-to cgroups. On an attempt to access a device file, corresponding
-BPF programs will be executed, and depending on the return value
-the attempt will succeed or fail with -EPERM.
+create bpf programs of type BPF_PROG_TYPE_CGROUP_DEVICE and attach
+them to cgroups with BPF_CGROUP_DEVICE flag. On an attempt to access a
+device file, corresponding BPF programs will be executed, and depending
+on the return value the attempt will succeed or fail with -EPERM.
-A BPF_CGROUP_DEVICE program takes a pointer to the bpf_cgroup_dev_ctx
-structure, which describes the device access attempt: access type
-(mknod/read/write) and device (type, major and minor numbers).
-If the program returns 0, the attempt fails with -EPERM, otherwise
-it succeeds.
+A BPF_PROG_TYPE_CGROUP_DEVICE program takes a pointer to the
+bpf_cgroup_dev_ctx structure, which describes the device access attempt:
+access type (mknod/read/write) and device (type, major and minor numbers).
+If the program returns 0, the attempt fails with -EPERM, otherwise it
+succeeds.
-An example of BPF_CGROUP_DEVICE program may be found in the kernel
-source tree in the tools/testing/selftests/bpf/dev_cgroup.c file.
+An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in
+tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.
RDMA
----
The "rdma" controller regulates the distribution and accounting of
-of RDMA resources.
+RDMA resources.
RDMA Interface Files
~~~~~~~~~~~~~~~~~~~~
@@ -2056,10 +2527,118 @@ RDMA Interface Files
mlx4_0 hca_handle=1 hca_object=20
ocrdma1 hca_handle=1 hca_object=23
+HugeTLB
+-------
+
+The HugeTLB controller allows to limit the HugeTLB usage per control group and
+enforces the controller limit during page fault.
+
+HugeTLB Interface Files
+~~~~~~~~~~~~~~~~~~~~~~~
+
+ hugetlb.<hugepagesize>.current
+ Show current usage for "hugepagesize" hugetlb. It exists for all
+ the cgroup except root.
+
+ hugetlb.<hugepagesize>.max
+ Set/show the hard limit of "hugepagesize" hugetlb usage.
+ The default value is "max". It exists for all the cgroup except root.
+
+ hugetlb.<hugepagesize>.events
+ A read-only flat-keyed file which exists on non-root cgroups.
+
+ max
+ The number of allocation failure due to HugeTLB limit
+
+ hugetlb.<hugepagesize>.events.local
+ Similar to hugetlb.<hugepagesize>.events but the fields in the file
+ are local to the cgroup i.e. not hierarchical. The file modified event
+ generated on this file reflects only the local events.
+
+ hugetlb.<hugepagesize>.numa_stat
+ Similar to memory.numa_stat, it shows the numa information of the
+ hugetlb pages of <hugepagesize> in this cgroup. Only active in
+ use hugetlb pages are included. The per-node values are in bytes.
Misc
----
+The Miscellaneous cgroup provides the resource limiting and tracking
+mechanism for the scalar resources which cannot be abstracted like the other
+cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config
+option.
+
+A resource can be added to the controller via enum misc_res_type{} in the
+include/linux/misc_cgroup.h file and the corresponding name via misc_res_name[]
+in the kernel/cgroup/misc.c file. Provider of the resource must set its
+capacity prior to using the resource by calling misc_cg_set_capacity().
+
+Once a capacity is set then the resource usage can be updated using charge and
+uncharge APIs. All of the APIs to interact with misc controller are in
+include/linux/misc_cgroup.h.
+
+Misc Interface Files
+~~~~~~~~~~~~~~~~~~~~
+
+Miscellaneous controller provides 3 interface files. If two misc resources (res_a and res_b) are registered then:
+
+ misc.capacity
+ A read-only flat-keyed file shown only in the root cgroup. It shows
+ miscellaneous scalar resources available on the platform along with
+ their quantities::
+
+ $ cat misc.capacity
+ res_a 50
+ res_b 10
+
+ misc.current
+ A read-only flat-keyed file shown in the all cgroups. It shows
+ the current usage of the resources in the cgroup and its children.::
+
+ $ cat misc.current
+ res_a 3
+ res_b 0
+
+ misc.max
+ A read-write flat-keyed file shown in the non root cgroups. Allowed
+ maximum usage of the resources in the cgroup and its children.::
+
+ $ cat misc.max
+ res_a max
+ res_b 4
+
+ Limit can be set by::
+
+ # echo res_a 1 > misc.max
+
+ Limit can be set to max by::
+
+ # echo res_a max > misc.max
+
+ Limits can be set higher than the capacity value in the misc.capacity
+ file.
+
+ misc.events
+ A read-only flat-keyed file which exists on non-root cgroups. The
+ following entries are defined. Unless specified otherwise, a value
+ change in this file generates a file modified event. All fields in
+ this file are hierarchical.
+
+ max
+ The number of times the cgroup's resource usage was
+ about to go over the max boundary.
+
+Migration and Ownership
+~~~~~~~~~~~~~~~~~~~~~~~
+
+A miscellaneous scalar resource is charged to the cgroup in which it is used
+first, and stays charged to that cgroup until that resource is freed. Migrating
+a process to a different cgroup does not move the charge to the destination
+cgroup where the process has moved.
+
+Others
+------
+
perf_event
~~~~~~~~~~
@@ -2116,7 +2695,7 @@ Without cgroup namespace, the "/proc/$PID/cgroup" file shows the
complete path of the cgroup of a process. In a container setup where
a set of cgroups and namespaces are intended to isolate processes the
"/proc/$PID/cgroup" file may leak potential system level information
-to the isolated processes. For Example::
+to the isolated processes. For example::
# cat /proc/self/cgroup
0::/batchjobs/container_id1