summaryrefslogtreecommitdiff
path: root/Documentation/accounting/delay-accounting.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/accounting/delay-accounting.rst')
-rw-r--r--Documentation/accounting/delay-accounting.rst214
1 files changed, 214 insertions, 0 deletions
diff --git a/Documentation/accounting/delay-accounting.rst b/Documentation/accounting/delay-accounting.rst
new file mode 100644
index 000000000000..86d7902a657f
--- /dev/null
+++ b/Documentation/accounting/delay-accounting.rst
@@ -0,0 +1,214 @@
+================
+Delay accounting
+================
+
+Tasks encounter delays in execution when they wait
+for some kernel resource to become available e.g. a
+runnable task may wait for a free CPU to run on.
+
+The per-task delay accounting functionality measures
+the delays experienced by a task while
+
+a) waiting for a CPU (while being runnable)
+b) completion of synchronous block I/O initiated by the task
+c) swapping in pages
+d) memory reclaim
+e) thrashing
+f) direct compact
+g) write-protect copy
+h) IRQ/SOFTIRQ
+
+and makes these statistics available to userspace through
+the taskstats interface.
+
+Such delays provide feedback for setting a task's cpu priority,
+io priority and rss limit values appropriately. Long delays for
+important tasks could be a trigger for raising its corresponding priority.
+
+The functionality, through its use of the taskstats interface, also provides
+delay statistics aggregated for all tasks (or threads) belonging to a
+thread group (corresponding to a traditional Unix process). This is a commonly
+needed aggregation that is more efficiently done by the kernel.
+
+Userspace utilities, particularly resource management applications, can also
+aggregate delay statistics into arbitrary groups. To enable this, delay
+statistics of a task are available both during its lifetime as well as on its
+exit, ensuring continuous and complete monitoring can be done.
+
+
+Interface
+---------
+
+Delay accounting uses the taskstats interface which is described
+in detail in a separate document in this directory. Taskstats returns a
+generic data structure to userspace corresponding to per-pid and per-tgid
+statistics. The delay accounting functionality populates specific fields of
+this structure. See
+
+ include/uapi/linux/taskstats.h
+
+for a description of the fields pertaining to delay accounting.
+It will generally be in the form of counters returning the cumulative
+delay seen for cpu, sync block I/O, swapin, memory reclaim, thrash page
+cache, direct compact, write-protect copy, IRQ/SOFTIRQ etc.
+
+Taking the difference of two successive readings of a given
+counter (say cpu_delay_total) for a task will give the delay
+experienced by the task waiting for the corresponding resource
+in that interval.
+
+When a task exits, records containing the per-task statistics
+are sent to userspace without requiring a command. If it is the last exiting
+task of a thread group, the per-tgid statistics are also sent. More details
+are given in the taskstats interface description.
+
+The getdelays.c userspace utility in tools/accounting directory allows simple
+commands to be run and the corresponding delay statistics to be displayed. It
+also serves as an example of using the taskstats interface.
+
+Usage
+-----
+
+Compile the kernel with::
+
+ CONFIG_TASK_DELAY_ACCT=y
+ CONFIG_TASKSTATS=y
+
+Delay accounting is disabled by default at boot up.
+To enable, add::
+
+ delayacct
+
+to the kernel boot options. The rest of the instructions below assume this has
+been done. Alternatively, use sysctl kernel.task_delayacct to switch the state
+at runtime. Note however that only tasks started after enabling it will have
+delayacct information.
+
+After the system has booted up, use a utility
+similar to getdelays.c to access the delays
+seen by a given task or a task group (tgid).
+The utility also allows a given command to be
+executed and the corresponding delays to be
+seen.
+
+General format of the getdelays command::
+
+ getdelays [-dilv] [-t tgid] [-p pid]
+
+Get delays, since system boot, for pid 10::
+
+ # ./getdelays -d -p 10
+ (output similar to next case)
+
+Get sum and peak of delays, since system boot, for all pids with tgid 242::
+
+ bash-4.4# ./getdelays -d -t 242
+ print delayacct stats ON
+ TGID 242
+
+
+ CPU count real total virtual total delay total delay average delay max delay min
+ 39 156000000 156576579 2111069 0.054ms 0.212296ms 0.031307ms
+ IO count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+ SWAP count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+ RECLAIM count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+ THRASHING count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+ COMPACT count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+ WPCOPY count delay total delay average delay max delay min
+ 156 11215873 0.072ms 0.207403ms 0.033913ms
+ IRQ count delay total delay average delay max delay min
+ 0 0 0.000ms 0.000000ms 0.000000ms
+
+Get IO accounting for pid 1, it works only with -p::
+
+ # ./getdelays -i -p 1
+ printing IO accounting
+ linuxrc: read=65536, write=0, cancelled_write=0
+
+The above command can be used with -v to get more debug information.
+
+After the system starts, use `delaytop` to get the system-wide delay information,
+which includes system-wide PSI information and Top-N high-latency tasks.
+Note: PSI support requires `CONFIG_PSI=y` and `psi=1` for full functionality.
+
+`delaytop` is an interactive tool for monitoring system pressure and task delays.
+It supports multiple sorting options, display modes, and real-time keyboard controls.
+
+Basic usage with default settings (sorts by CPU delay, shows top 20 tasks, refreshes every 2 seconds)::
+
+ bash# ./delaytop
+ System Pressure Information: (avg10/avg60vg300/total)
+ CPU some: 0.0%/ 0.0%/ 0.0%/ 106137(ms)
+ CPU full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
+ Memory full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
+ Memory some: 0.0%/ 0.0%/ 0.0%/ 0(ms)
+ IO full: 0.0%/ 0.0%/ 0.0%/ 2240(ms)
+ IO some: 0.0%/ 0.0%/ 0.0%/ 2783(ms)
+ IRQ full: 0.0%/ 0.0%/ 0.0%/ 0(ms)
+ [o]sort [M]memverbose [q]quit
+ Top 20 processes (sorted by cpu delay):
+ PID TGID COMMAND CPU(ms) IO(ms) IRQ(ms) MEM(ms)
+ ------------------------------------------------------------------------
+ 110 110 kworker/15:0H-s 27.91 0.00 0.00 0.00
+ 57 57 cpuhp/7 3.18 0.00 0.00 0.00
+ 99 99 cpuhp/14 2.97 0.00 0.00 0.00
+ 51 51 cpuhp/6 0.90 0.00 0.00 0.00
+ 44 44 kworker/4:0H-sy 0.80 0.00 0.00 0.00
+ 60 60 ksoftirqd/7 0.74 0.00 0.00 0.00
+ 76 76 idle_inject/10 0.31 0.00 0.00 0.00
+ 100 100 idle_inject/14 0.30 0.00 0.00 0.00
+ 1309 1309 systemsettings 0.29 0.00 0.00 0.00
+ 45 45 cpuhp/5 0.22 0.00 0.00 0.00
+ 63 63 cpuhp/8 0.20 0.00 0.00 0.00
+ 87 87 cpuhp/12 0.18 0.00 0.00 0.00
+ 93 93 cpuhp/13 0.17 0.00 0.00 0.00
+ 1265 1265 acpid 0.17 0.00 0.00 0.00
+ 1552 1552 sshd 0.17 0.00 0.00 0.00
+ 2584 2584 sddm-helper 0.16 0.00 0.00 0.00
+ 1284 1284 rtkit-daemon 0.15 0.00 0.00 0.00
+ 1326 1326 nde-netfilter 0.14 0.00 0.00 0.00
+ 27 27 cpuhp/2 0.13 0.00 0.00 0.00
+ 631 631 kworker/11:2-rc 0.11 0.00 0.00 0.00
+
+Interactive keyboard controls during runtime::
+
+ o - Select sort field (CPU, IO, IRQ, Memory, etc.)
+ M - Toggle display mode (Default/Memory Verbose)
+ q - Quit
+
+Available sort fields(use -s/--sort or interactive command)::
+
+ cpu(c) - CPU delay
+ blkio(i) - I/O delay
+ irq(q) - IRQ delay
+ mem(m) - Total memory delay
+ swapin(s) - Swapin delay (memory verbose mode only)
+ freepages(r) - Freepages reclaim delay (memory verbose mode only)
+ thrashing(t) - Thrashing delay (memory verbose mode only)
+ compact(p) - Compaction delay (memory verbose mode only)
+ wpcopy(w) - Write page copy delay (memory verbose mode only)
+
+Advanced usage examples::
+
+ # ./delaytop -s blkio
+ Sorted by IO delay
+
+ # ./delaytop -s mem -M
+ Sorted by memory delay in memory verbose mode
+
+ # ./delaytop -p pid
+ Print delayacct stats
+
+ # ./delaytop -P num
+ Display the top N tasks
+
+ # ./delaytop -n num
+ Set delaytop refresh frequency (num times)
+
+ # ./delaytop -d secs
+ Specify refresh interval as secs