diff options
Diffstat (limited to 'Documentation/trace')
-rw-r--r-- | Documentation/trace/boottime-trace.rst | 4 | ||||
-rw-r--r-- | Documentation/trace/coresight/coresight-perf.rst | 31 | ||||
-rw-r--r-- | Documentation/trace/coresight/panic.rst | 4 | ||||
-rw-r--r-- | Documentation/trace/eprobetrace.rst | 269 | ||||
-rw-r--r-- | Documentation/trace/ftrace-design.rst | 12 | ||||
-rw-r--r-- | Documentation/trace/ftrace.rst | 13 | ||||
-rw-r--r-- | Documentation/trace/histogram.rst | 2 | ||||
-rw-r--r-- | Documentation/trace/index.rst | 99 | ||||
-rw-r--r-- | Documentation/trace/rv/da_monitor_synthesis.rst | 147 | ||||
-rw-r--r-- | Documentation/trace/rv/index.rst | 4 | ||||
-rw-r--r-- | Documentation/trace/rv/linear_temporal_logic.rst | 134 | ||||
-rw-r--r-- | Documentation/trace/rv/monitor_rtapp.rst | 133 | ||||
-rw-r--r-- | Documentation/trace/rv/monitor_sched.rst | 307 | ||||
-rw-r--r-- | Documentation/trace/rv/monitor_synthesis.rst | 271 | ||||
-rw-r--r-- | Documentation/trace/tracepoints.rst | 17 |
15 files changed, 1225 insertions, 222 deletions
diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst index d594597201fd..3efac10adb36 100644 --- a/Documentation/trace/boottime-trace.rst +++ b/Documentation/trace/boottime-trace.rst @@ -198,8 +198,8 @@ Most of the subsystems and architecture dependent drivers will be initialized after that (arch_initcall or subsys_initcall). Thus, you can trace those with boot-time tracing. If you want to trace events before core_initcall, you can use the options -starting with ``kernel``. Some of them will be enabled eariler than the initcall -processing (for example,. ``kernel.ftrace=function`` and ``kernel.trace_event`` +starting with ``kernel``. Some of them will be enabled earlier than the initcall +processing (for example, ``kernel.ftrace=function`` and ``kernel.trace_event`` will start before the initcall.) diff --git a/Documentation/trace/coresight/coresight-perf.rst b/Documentation/trace/coresight/coresight-perf.rst index d087aae7d492..30be89320621 100644 --- a/Documentation/trace/coresight/coresight-perf.rst +++ b/Documentation/trace/coresight/coresight-perf.rst @@ -78,6 +78,37 @@ enabled like:: Please refer to the kernel configuration help for more information. +Fine-grained tracing with AUX pause and resume +---------------------------------------------- + +Arm CoreSight may generate a large amount of hardware trace data, which +will lead to overhead in recording and distract users when reviewing +profiling result. To mitigate the issue of excessive trace data, Perf +provides AUX pause and resume functionality for fine-grained tracing. + +The AUX pause and resume can be triggered by associated events. These +events can be ftrace tracepoints (including static and dynamic +tracepoints) or PMU events (e.g. CPU PMU cycle event). To create a perf +session with AUX pause / resume, three configuration terms are +introduced: + +- "aux-action=start-paused": it is specified for the cs_etm PMU event to + launch in a paused state. +- "aux-action=pause": an associated event is specified with this term + to pause AUX trace. +- "aux-action=resume": an associated event is specified with this term + to resume AUX trace. + +Example for triggering AUX pause and resume with ftrace tracepoints:: + + perf record -e cs_etm/aux-action=start-paused/k,syscalls:sys_enter_openat/aux-action=resume/,syscalls:sys_exit_openat/aux-action=pause/ ls + +Example for triggering AUX pause and resume with PMU event:: + + perf record -a -e cs_etm/aux-action=start-paused/k \ + -e cycles/aux-action=pause,period=10000000/ \ + -e cycles/aux-action=resume,period=1050000/ -- sleep 1 + Perf test - Verify kernel and userspace perf CoreSight work ----------------------------------------------------------- diff --git a/Documentation/trace/coresight/panic.rst b/Documentation/trace/coresight/panic.rst index a58aa914c241..6e4bde953cae 100644 --- a/Documentation/trace/coresight/panic.rst +++ b/Documentation/trace/coresight/panic.rst @@ -67,8 +67,8 @@ Trace data captured at the time of panic, can be read from rebooted kernel or from crashdump kernel using a special device file /dev/crash_tmc_xxx. This device file is created only when there is a valid crashdata available. -General flow of trace capture and decode incase of kernel panic -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +General flow of trace capture and decode in case of kernel panic +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1. Enable source and sink on all the cores using the sysfs interface. ETR sinks should have trace buffers allocated from reserved memory, by selecting "resrv" buffer mode from sysfs. diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst new file mode 100644 index 000000000000..89b5157cfab8 --- /dev/null +++ b/Documentation/trace/eprobetrace.rst @@ -0,0 +1,269 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================================== +Eprobe - Event-based Probe Tracing +================================== + +:Author: Steven Rostedt <rostedt@goodmis.org> + +- Written for v6.17 + +Overview +======== + +Eprobes are dynamic events that are placed on existing events to either +dereference a field that is a pointer, or simply to limit what fields are +recorded in the trace event. + +Eprobes depend on kprobe events so to enable this feature, build your kernel +with CONFIG_EPROBE_EVENTS=y. + +Eprobes are created via the /sys/kernel/tracing/dynamic_events file. + +Synopsis of eprobe_events +------------------------- +:: + + e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS] : Set a probe + -:[EGRP/][EEVENT] : Clear a probe + + EGRP : Group name of the new event. If omitted, use "eprobes" for it. + EEVENT : Event name. If omitted, the event name is generated and will + be the same event name as the event it attached to. + GRP : Group name of the event to attach to. + EVENT : Event name of the event to attach to. + + FETCHARGS : Arguments. Each probe can have up to 128 args. + $FIELD : Fetch the value of the event field called FIELD. + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) + $comm : Fetch current task comm. + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4) + \IMM : Store an immediate value to the argument. + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types + (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char", + "string", "ustring", "symbol", "symstr" and "bitfield" are + supported. + +Types +----- +The FETCHARGS above is very similar to the kprobe events as described in +Documentation/trace/kprobetrace.rst. + +The difference between eprobes and kprobes FETCHARGS is that eprobes has a +$FIELD command that returns the content of the event field of the event +that is attached. Eprobes do not have access to registers, stacks and function +arguments that kprobes has. + +If a field argument is a pointer, it may be dereferenced just like a memory +address using the FETCHARGS syntax. + + +Attaching to dynamic events +--------------------------- + +Eprobes may attach to dynamic events as well as to normal events. It may +attach to a kprobe event, a synthetic event or a fprobe event. This is useful +if the type of a field needs to be changed. See Example 2 below. + +Usage examples +============== + +Example 1 +--------- + +The basic usage of eprobes is to limit the data that is being recorded into +the tracing buffer. For example, a common event to trace is the sched_switch +trace event. That has a format of:: + + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:char prev_comm[16]; offset:8; size:16; signed:0; + field:pid_t prev_pid; offset:24; size:4; signed:1; + field:int prev_prio; offset:28; size:4; signed:1; + field:long prev_state; offset:32; size:8; signed:1; + field:char next_comm[16]; offset:40; size:16; signed:0; + field:pid_t next_pid; offset:56; size:4; signed:1; + field:int next_prio; offset:60; size:4; signed:1; + +The first four fields are common to all events and can not be limited. But the +rest of the event has 60 bytes of information. It records the names of the +previous and next tasks being scheduled out and in, as well as their pids and +priorities. It also records the state of the previous task. If only the pids +of the tasks are of interest, why waste the ring buffer with all the other +fields? + +An eprobe can limit what gets recorded. Note, it does not help in performance, +as all the fields are recorded in a temporary buffer to process the eprobe. +:: + + # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events + # echo 1 > /sys/kernel/tracing/events/sched/switch/enable + # cat /sys/kernel/tracing/trace + + # tracer: nop + # + # entries-in-buffer/entries-written: 2721/2721 #P:8 + # + # _-----=> irqs-off/BH-disabled + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / _-=> migrate-disable + # |||| / delay + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION + # | | | ||||| | | + sshd-session-1082 [004] d..4. 5041.239906: switch: (sched.sched_switch) prev=1082 next=0 + bash-1085 [001] d..4. 5041.240198: switch: (sched.sched_switch) prev=1085 next=141 + kworker/u34:5-141 [001] d..4. 5041.240259: switch: (sched.sched_switch) prev=141 next=1085 + <idle>-0 [004] d..4. 5041.240354: switch: (sched.sched_switch) prev=0 next=1082 + bash-1085 [001] d..4. 5041.240385: switch: (sched.sched_switch) prev=1085 next=141 + kworker/u34:5-141 [001] d..4. 5041.240410: switch: (sched.sched_switch) prev=141 next=1085 + bash-1085 [001] d..4. 5041.240478: switch: (sched.sched_switch) prev=1085 next=0 + sshd-session-1082 [004] d..4. 5041.240526: switch: (sched.sched_switch) prev=1082 next=0 + <idle>-0 [001] d..4. 5041.247524: switch: (sched.sched_switch) prev=0 next=90 + <idle>-0 [002] d..4. 5041.247545: switch: (sched.sched_switch) prev=0 next=16 + kworker/1:1-90 [001] d..4. 5041.247580: switch: (sched.sched_switch) prev=90 next=0 + rcu_sched-16 [002] d..4. 5041.247591: switch: (sched.sched_switch) prev=16 next=0 + <idle>-0 [002] d..4. 5041.257536: switch: (sched.sched_switch) prev=0 next=16 + rcu_sched-16 [002] d..4. 5041.257573: switch: (sched.sched_switch) prev=16 next=0 + +Note, without adding the "u32" after the prev_pid and next_pid, the values +would default showing in hexadecimal. + +Example 2 +--------- + +If a specific system call is to be recorded but the syscalls events are not +enabled, the raw_syscalls can still be used (syscalls are system call +events are not normal events, but are created from the raw_syscalls events +within the kernel). In order to trace the openat system call, one can create +an event probe on top of the raw_syscalls event: +:: + + # cd /sys/kernel/tracing + # cat events/raw_syscalls/sys_enter/format + name: sys_enter + ID: 395 + format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:long id; offset:8; size:8; signed:1; + field:unsigned long args[6]; offset:16; size:48; signed:0; + + print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5] + +From the source code, the sys_openat() has: +:: + + int sys_openat(int dirfd, const char *path, int flags, mode_t mode) + { + return my_syscall4(__NR_openat, dirfd, path, flags, mode); + } + +The path is the second parameter, and that is what is wanted. +:: + + # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events + +This is being run on x86_64 where the word size is 8 bytes and the openat +system call __NR_openat is set at 257. +:: + + # echo 'nr == 257' > events/eprobes/openat/filter + +Now enable the event and look at the trace. +:: + + # echo 1 > events/eprobes/openat/enable + # cat trace + + # tracer: nop + # + # entries-in-buffer/entries-written: 4/4 #P:8 + # + # _-----=> irqs-off/BH-disabled + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / _-=> migrate-disable + # |||| / delay + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION + # | | | ||||| | | + cat-1298 [003] ...2. 2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) + cat-1298 [003] ...2. 2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) + cat-1298 [003] ...2. 2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) + cat-1298 [003] ...2. 2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault) + +The filename shows "(fault)". This is likely because the filename has not been +pulled into memory yet and currently trace events cannot fault in memory that +is not present. When an eprobe tries to read memory that has not been faulted +in yet, it will show the "(fault)" text. + +To get around this, as the kernel will likely pull in this filename and make +it present, attaching it to a synthetic event that can pass the address of the +filename from the entry of the event to the end of the event, this can be used +to show the filename when the system call returns. + +Remove the old eprobe:: + + # echo 1 > events/eprobes/openat/enable + # echo '-:openat' >> dynamic_events + +This time make an eprobe where the address of the filename is saved:: + + # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events + +Create a synthetic event that passes the address of the filename to the +end of the event:: + + # echo 's:filename u64 file' >> dynamic_events + # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger + # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger + +Now that the address of the filename has been passed to the end of the +system call, create another eprobe to attach to the exit event to show the +string:: + + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events + # echo 1 > events/eprobes/openat/enable + # cat trace + + # tracer: nop + # + # entries-in-buffer/entries-written: 4/4 #P:8 + # + # _-----=> irqs-off/BH-disabled + # / _----=> need-resched + # | / _---=> hardirq/softirq + # || / _--=> preempt-depth + # ||| / _-=> migrate-disable + # |||| / delay + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION + # | | | ||||| | | + cat-1331 [001] ...5. 2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache" + cat-1331 [001] ...5. 2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6" + cat-1331 [001] ...5. 2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive" + cat-1331 [001] ...5. 2944.831362: openat: (synthetic.filename) filename="trace" + +Example 3 +--------- + +If syscall trace events are available, the above would not need the first +eprobe, but it would still need the last one:: + + # echo 's:filename u64 file' >> dynamic_events + # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger + # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger + # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events + # echo 1 > events/eprobes/openat/enable + +And this would produce the same result as Example 2. diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst index dc82d64b3a44..8f4fab3f9324 100644 --- a/Documentation/trace/ftrace-design.rst +++ b/Documentation/trace/ftrace-design.rst @@ -238,19 +238,15 @@ You need very few things to get the syscalls tracing in an arch. - Tag this arch as HAVE_SYSCALL_TRACEPOINTS. -HAVE_FTRACE_MCOUNT_RECORD -------------------------- +HAVE_DYNAMIC_FTRACE +------------------- See scripts/recordmcount.pl for more info. Just fill in the arch-specific details for how to locate the addresses of mcount call sites via objdump. This option doesn't make much sense without also implementing dynamic ftrace. - -HAVE_DYNAMIC_FTRACE -------------------- - -You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so -scroll your reader back up if you got over eager. +You will first need HAVE_FUNCTION_TRACER, so scroll your reader back up if you +got over eager. Once those are out of the way, you will need to implement: - asm/ftrace.h: diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst index c9e88bf65709..af66a05e18cc 100644 --- a/Documentation/trace/ftrace.rst +++ b/Documentation/trace/ftrace.rst @@ -1205,6 +1205,19 @@ Here are the available options: default instance. The only way the top level instance has this flag cleared, is by it being set in another instance. + copy_trace_marker + If there are applications that hard code writing into the top level + trace_marker file (/sys/kernel/tracing/trace_marker or trace_marker_raw), + and the tooling would like it to go into an instance, this option can + be used. Create an instance and set this option, and then all writes + into the top level trace_marker file will also be redirected into this + instance. + + Note, by default this option is set for the top level instance. If it + is disabled, then writes to the trace_marker or trace_marker_raw files + will not be written into the top level file. If no instance has this + option set, then a write will error with the errno of ENODEV. + annotate It is sometimes confusing when the CPU buffers are full and one CPU buffer had a lot of events recently, thus diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst index 0aada18c38c6..2b98c1720a54 100644 --- a/Documentation/trace/histogram.rst +++ b/Documentation/trace/histogram.rst @@ -249,7 +249,7 @@ Extended error information table, it should keep a running total of the number of bytes requested by that call_site. - We'll let it run for awhile and then dump the contents of the 'hist' + We'll let it run for a while and then dump the contents of the 'hist' file in the kmalloc event's subdirectory (for readability, a number of entries have been omitted):: diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 2c991dc96ace..b4a429dc4f7a 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -1,39 +1,104 @@ -========================== -Linux Tracing Technologies -========================== +================================ +Linux Tracing Technologies Guide +================================ + +Tracing in the Linux kernel is a powerful mechanism that allows +developers and system administrators to analyze and debug system +behavior. This guide provides documentation on various tracing +frameworks and tools available in the Linux kernel. + +Introduction to Tracing +----------------------- + +This section provides an overview of Linux tracing mechanisms +and debugging approaches. .. toctree:: - :maxdepth: 2 + :maxdepth: 1 - ftrace-design + debugging + tracepoints tracepoint-analysis + ring-buffer-map + +Core Tracing Frameworks +----------------------- + +The following are the primary tracing frameworks integrated into +the Linux kernel. + +.. toctree:: + :maxdepth: 1 + ftrace + ftrace-design ftrace-uses - fprobe kprobes kprobetrace - uprobetracer fprobetrace - tracepoints + eprobetrace + fprobe + ring-buffer-design + +Event Tracing and Analysis +-------------------------- + +A detailed explanation of event tracing mechanisms and their +applications. + +.. toctree:: + :maxdepth: 1 + events events-kmem events-power events-nmi events-msr - mmiotrace + boottime-trace histogram histogram-design - boottime-trace - debugging - hwlat_detector - osnoise-tracer - timerlat-tracer + +Hardware and Performance Tracing +-------------------------------- + +This section covers tracing features that monitor hardware +interactions and system performance. + +.. toctree:: + :maxdepth: 1 + intel_th - ring-buffer-design - ring-buffer-map stm sys-t coresight/index - user_events rv/index hisi-ptt + mmiotrace + hwlat_detector + osnoise-tracer + timerlat-tracer + +User-Space Tracing +------------------ + +These tools allow tracing user-space applications and +interactions. + +.. toctree:: + :maxdepth: 1 + + user_events + uprobetracer + +Additional Resources +-------------------- + +For more details, refer to the respective documentation of each +tracing tool and framework. + +.. only:: subproject and html + + Indices + ======= + + * :ref:`genindex` diff --git a/Documentation/trace/rv/da_monitor_synthesis.rst b/Documentation/trace/rv/da_monitor_synthesis.rst deleted file mode 100644 index 0a92729c8a9b..000000000000 --- a/Documentation/trace/rv/da_monitor_synthesis.rst +++ /dev/null @@ -1,147 +0,0 @@ -Deterministic Automata Monitor Synthesis -======================================== - -The starting point for the application of runtime verification (RV) techniques -is the *specification* or *modeling* of the desired (or undesired) behavior -of the system under scrutiny. - -The formal representation needs to be then *synthesized* into a *monitor* -that can then be used in the analysis of the trace of the system. The -*monitor* connects to the system via an *instrumentation* that converts -the events from the *system* to the events of the *specification*. - - -In Linux terms, the runtime verification monitors are encapsulated inside -the *RV monitor* abstraction. The RV monitor includes a set of instances -of the monitor (per-cpu monitor, per-task monitor, and so on), the helper -functions that glue the monitor to the system reference model, and the -trace output as a reaction to event parsing and exceptions, as depicted -below:: - - Linux +----- RV Monitor ----------------------------------+ Formal - Realm | | Realm - +-------------------+ +----------------+ +-----------------+ - | Linux kernel | | Monitor | | Reference | - | Tracing | -> | Instance(s) | <- | Model | - | (instrumentation) | | (verification) | | (specification) | - +-------------------+ +----------------+ +-----------------+ - | | | - | V | - | +----------+ | - | | Reaction | | - | +--+--+--+-+ | - | | | | | - | | | +-> trace output ? | - +------------------------|--|----------------------+ - | +----> panic ? - +-------> <user-specified> - -DA monitor synthesis --------------------- - -The synthesis of automata-based models into the Linux *RV monitor* abstraction -is automated by the dot2k tool and the rv/da_monitor.h header file that -contains a set of macros that automatically generate the monitor's code. - -dot2k ------ - -The dot2k utility leverages dot2c by converting an automaton model in -the DOT format into the C representation [1] and creating the skeleton of -a kernel monitor in C. - -For example, it is possible to transform the wip.dot model present in -[1] into a per-cpu monitor with the following command:: - - $ dot2k -d wip.dot -t per_cpu - -This will create a directory named wip/ with the following files: - -- wip.h: the wip model in C -- wip.c: the RV monitor - -The wip.c file contains the monitor declaration and the starting point for -the system instrumentation. - -Monitor macros --------------- - -The rv/da_monitor.h enables automatic code generation for the *Monitor -Instance(s)* using C macros. - -The benefits of the usage of macro for monitor synthesis are 3-fold as it: - -- Reduces the code duplication; -- Facilitates the bug fix/improvement; -- Avoids the case of developers changing the core of the monitor code - to manipulate the model in a (let's say) non-standard way. - -This initial implementation presents three different types of monitor instances: - -- ``#define DECLARE_DA_MON_GLOBAL(name, type)`` -- ``#define DECLARE_DA_MON_PER_CPU(name, type)`` -- ``#define DECLARE_DA_MON_PER_TASK(name, type)`` - -The first declares the functions for a global deterministic automata monitor, -the second for monitors with per-cpu instances, and the third with per-task -instances. - -In all cases, the 'name' argument is a string that identifies the monitor, and -the 'type' argument is the data type used by dot2k on the representation of -the model in C. - -For example, the wip model with two states and three events can be -stored in an 'unsigned char' type. Considering that the preemption control -is a per-cpu behavior, the monitor declaration in the 'wip.c' file is:: - - DECLARE_DA_MON_PER_CPU(wip, unsigned char); - -The monitor is executed by sending events to be processed via the functions -presented below:: - - da_handle_event_$(MONITOR_NAME)($(event from event enum)); - da_handle_start_event_$(MONITOR_NAME)($(event from event enum)); - da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum)); - -The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where -the event will be processed if the monitor is processing events. - -When a monitor is enabled, it is placed in the initial state of the automata. -However, the monitor does not know if the system is in the *initial state*. - -The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the -monitor that the system is returning to the initial state, so the monitor can -start monitoring the next event. - -The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify -the monitor that the system is known to be in the initial state, so the -monitor can start monitoring and monitor the current event. - -Using the wip model as example, the events "preempt_disable" and -"sched_waking" should be sent to monitor, respectively, via [2]:: - - da_handle_event_wip(preempt_disable_wip); - da_handle_event_wip(sched_waking_wip); - -While the event "preempt_enabled" will use:: - - da_handle_start_event_wip(preempt_enable_wip); - -To notify the monitor that the system will be returning to the initial state, -so the system and the monitor should be in sync. - -Final remarks -------------- - -With the monitor synthesis in place using the rv/da_monitor.h and -dot2k, the developer's work should be limited to the instrumentation -of the system, increasing the confidence in the overall approach. - -[1] For details about deterministic automata format and the translation -from one representation to another, see:: - - Documentation/trace/rv/deterministic_automata.rst - -[2] dot2k appends the monitor's name suffix to the events enums to -avoid conflicting variables when exporting the global vmlinux.h -use by BPF programs. diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst index e80e0057feb4..a2812ac5cfeb 100644 --- a/Documentation/trace/rv/index.rst +++ b/Documentation/trace/rv/index.rst @@ -8,8 +8,10 @@ Runtime Verification runtime-verification.rst deterministic_automata.rst - da_monitor_synthesis.rst + linear_temporal_logic.rst + monitor_synthesis.rst da_monitor_instrumentation.rst monitor_wip.rst monitor_wwnr.rst monitor_sched.rst + monitor_rtapp.rst diff --git a/Documentation/trace/rv/linear_temporal_logic.rst b/Documentation/trace/rv/linear_temporal_logic.rst new file mode 100644 index 000000000000..9eee09d9cacf --- /dev/null +++ b/Documentation/trace/rv/linear_temporal_logic.rst @@ -0,0 +1,134 @@ +Linear temporal logic +===================== + +Introduction +------------ + +Runtime verification monitor is a verification technique which checks that the +kernel follows a specification. It does so by using tracepoints to monitor the +kernel's execution trace, and verifying that the execution trace sastifies the +specification. + +Initially, the specification can only be written in the form of deterministic +automaton (DA). However, while attempting to implement DA monitors for some +complex specifications, deterministic automaton is found to be inappropriate as +the specification language. The automaton is complicated, hard to understand, +and error-prone. + +Thus, RV monitors based on linear temporal logic (LTL) are introduced. This type +of monitor uses LTL as specification instead of DA. For some cases, writing the +specification as LTL is more concise and intuitive. + +Many materials explain LTL in details. One book is:: + + Christel Baier and Joost-Pieter Katoen: Principles of Model Checking, The MIT + Press, 2008. + +Grammar +------- + +Unlike some existing syntax, kernel's implementation of LTL is more verbose. +This is motivated by considering that the people who read the LTL specifications +may not be well-versed in LTL. + +Grammar: + ltl ::= opd | ( ltl ) | ltl binop ltl | unop ltl + +Operands (opd): + true, false, user-defined names consisting of upper-case characters, digits, + and underscore. + +Unary Operators (unop): + always + eventually + next + not + +Binary Operators (binop): + until + and + or + imply + equivalent + +This grammar is ambiguous: operator precedence is not defined. Parentheses must +be used. + +Example linear temporal logic +----------------------------- +.. code-block:: + + RAIN imply (GO_OUTSIDE imply HAVE_UMBRELLA) + +means: if it is raining, going outside means having an umbrella. + +.. code-block:: + + RAIN imply (WET until not RAIN) + +means: if it is raining, it is going to be wet until the rain stops. + +.. code-block:: + + RAIN imply eventually not RAIN + +means: if it is raining, rain will eventually stop. + +The above examples are referring to the current time instance only. For kernel +verification, the `always` operator is usually desirable, to specify that +something is always true at the present and for all future. For example:: + + always (RAIN imply eventually not RAIN) + +means: *all* rain eventually stops. + +In the above examples, `RAIN`, `GO_OUTSIDE`, `HAVE_UMBRELLA` and `WET` are the +"atomic propositions". + +Monitor synthesis +----------------- + +To synthesize an LTL into a kernel monitor, the `rvgen` tool can be used: +`tools/verification/rvgen`. The specification needs to be provided as a file, +and it must have a "RULE = LTL" assignment. For example:: + + RULE = always (ACQUIRE imply ((not KILLED and not CRASHED) until RELEASE)) + +which says: if `ACQUIRE`, then `RELEASE` must happen before `KILLED` or +`CRASHED`. + +The LTL can be broken down using sub-expressions. The above is equivalent to: + + .. code-block:: + + RULE = always (ACQUIRE imply (ALIVE until RELEASE)) + ALIVE = not KILLED and not CRASHED + +From this specification, `rvgen` generates the C implementation of a Buchi +automaton - a non-deterministic state machine which checks the satisfiability of +the LTL. See Documentation/trace/rv/monitor_synthesis.rst for details on using +`rvgen`. + +References +---------- + +One book covering model checking and linear temporal logic is:: + + Christel Baier and Joost-Pieter Katoen: Principles of Model Checking, The MIT + Press, 2008. + +For an example of using linear temporal logic in software testing, see:: + + Ruijie Meng, Zhen Dong, Jialin Li, Ivan Beschastnikh, and Abhik Roychoudhury. + 2022. Linear-time temporal logic guided greybox fuzzing. In Proceedings of the + 44th International Conference on Software Engineering (ICSE '22). Association + for Computing Machinery, New York, NY, USA, 1343–1355. + https://doi.org/10.1145/3510003.3510082 + +The kernel's LTL monitor implementation is based on:: + + Gerth, R., Peled, D., Vardi, M.Y., Wolper, P. (1996). Simple On-the-fly + Automatic Verification of Linear Temporal Logic. In: Dembiński, P., Średniawa, + M. (eds) Protocol Specification, Testing and Verification XV. PSTV 1995. IFIP + Advances in Information and Communication Technology. Springer, Boston, MA. + https://doi.org/10.1007/978-0-387-34892-6_1 diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst new file mode 100644 index 000000000000..c8104eda924a --- /dev/null +++ b/Documentation/trace/rv/monitor_rtapp.rst @@ -0,0 +1,133 @@ +Real-time application monitors +============================== + +- Name: rtapp +- Type: container for multiple monitors +- Author: Nam Cao <namcao@linutronix.de> + +Description +----------- + +Real-time applications may have design flaws such that they experience +unexpected latency and fail to meet their time requirements. Often, these flaws +follow a few patterns: + + - Page faults: A real-time thread may access memory that does not have a + mapped physical backing or must first be copied (such as for copy-on-write). + Thus a page fault is raised and the kernel must first perform the expensive + action. This causes significant delays to the real-time thread + - Priority inversion: A real-time thread blocks waiting for a lower-priority + thread. This causes the real-time thread to effectively take on the + scheduling priority of the lower-priority thread. For example, the real-time + thread needs to access a shared resource that is protected by a + non-pi-mutex, but the mutex is currently owned by a non-real-time thread. + +The `rtapp` monitor detects these patterns. It aids developers to identify +reasons for unexpected latency with real-time applications. It is a container of +multiple sub-monitors described in the following sections. + +Monitor pagefault ++++++++++++++++++ + +The `pagefault` monitor reports real-time tasks raising page faults. Its +specification is:: + + RULE = always (RT imply not PAGEFAULT) + +To fix warnings reported by this monitor, `mlockall()` or `mlock()` can be used +to ensure physical backing for memory. + +This monitor may have false negatives because the pages used by the real-time +threads may just happen to be directly available during testing. To minimize +this, the system can be put under memory pressure (e.g. invoking the OOM killer +using a program that does `ptr = malloc(SIZE_OF_RAM); memset(ptr, 0, +SIZE_OF_RAM);`) so that the kernel executes aggressive strategies to recycle as +much physical memory as possible. + +Monitor sleep ++++++++++++++ + +The `sleep` monitor reports real-time threads sleeping in a manner that may +cause undesirable latency. Real-time applications should only put a real-time +thread to sleep for one of the following reasons: + + - Cyclic work: real-time thread sleeps waiting for the next cycle. For this + case, only the `clock_nanosleep` syscall should be used with `TIMER_ABSTIME` + (to avoid time drift) and `CLOCK_MONOTONIC` (to avoid the clock being + changed). No other method is safe for real-time. For example, threads + waiting for timerfd can be woken by softirq which provides no real-time + guarantee. + - Real-time thread waiting for something to happen (e.g. another thread + releasing shared resources, or a completion signal from another thread). In + this case, only futexes (FUTEX_LOCK_PI, FUTEX_LOCK_PI2 or one of + FUTEX_WAIT_*) should be used. Applications usually do not use futexes + directly, but use PI mutexes and PI condition variables which are built on + top of futexes. Be aware that the C library might not implement conditional + variables as safe for real-time. As an alternative, the librtpi library + exists to provide a conditional variable implementation that is correct for + real-time applications in Linux. + +Beside the reason for sleeping, the eventual waker should also be +real-time-safe. Namely, one of: + + - An equal-or-higher-priority thread + - Hard interrupt handler + - Non-maskable interrupt handler + +This monitor's warning usually means one of the following: + + - Real-time thread is blocked by a non-real-time thread (e.g. due to + contention on a mutex without priority inheritance). This is priority + inversion. + - Time-critical work waits for something which is not safe for real-time (e.g. + timerfd). + - The work executed by the real-time thread does not need to run at real-time + priority at all. This is not a problem for the real-time thread itself, but + it is potentially taking the CPU away from other important real-time work. + +Application developers may purposely choose to have their real-time application +sleep in a way that is not safe for real-time. It is debatable whether that is a +problem. Application developers must analyze the warnings to make a proper +assessment. + +The monitor's specification is:: + + RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST)) + + RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD) + and ((not WAKE) until RT_FRIENDLY_WAKE) + + RT_VALID_SLEEP_REASON = FUTEX_WAIT + or RT_FRIENDLY_NANOSLEEP + + RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP + and NANOSLEEP_TIMER_ABSTIME + and NANOSLEEP_CLOCK_MONOTONIC + + RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO + or WOKEN_BY_HARDIRQ + or WOKEN_BY_NMI + or KTHREAD_SHOULD_STOP + + ALLOWLIST = BLOCK_ON_RT_MUTEX + or FUTEX_LOCK_PI + or TASK_IS_RCU + or TASK_IS_MIGRATION + +Beside the scenarios described above, this specification also handle some +special cases: + + - `KERNEL_THREAD`: kernel tasks do not have any pattern that can be recognized + as valid real-time sleeping reasons. Therefore sleeping reason is not + checked for kernel tasks. + - `KTHREAD_SHOULD_STOP`: a non-real-time thread may stop a real-time kernel + thread by waking it and waiting for it to exit (`kthread_stop()`). This + wakeup is safe for real-time. + - `ALLOWLIST`: to handle known false positives with the kernel. + - `BLOCK_ON_RT_MUTEX` is included in the allowlist due to its implementation. + In the release path of rt_mutex, a boosted task is de-boosted before waking + the rt_mutex's waiter. Consequently, the monitor may see a real-time-unsafe + wakeup (e.g. non-real-time task waking real-time task). This is actually + real-time-safe because preemption is disabled for the duration. + - `FUTEX_LOCK_PI` is included in the allowlist for the same reason as + `BLOCK_ON_RT_MUTEX`. diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst index 24b2c62a3bc2..3f8381ad9ec7 100644 --- a/Documentation/trace/rv/monitor_sched.rst +++ b/Documentation/trace/rv/monitor_sched.rst @@ -40,26 +40,6 @@ defined in by Daniel Bristot in [1]. Currently we included the following: -Monitor tss -~~~~~~~~~~~ - -The task switch while scheduling (tss) monitor ensures a task switch happens -only in scheduling context, that is inside a call to `__schedule`:: - - | - | - v - +-----------------+ - | thread | <+ - +-----------------+ | - | | - | schedule_entry | schedule_exit - v | - sched_switch | - +--------------- | - | sched | - +--------------> -+ - Monitor sco ~~~~~~~~~~~ @@ -144,26 +124,277 @@ does not enable preemption:: | scheduling_contex -+ -Monitor sncid -~~~~~~~~~~~~~ +Monitor sts +~~~~~~~~~~~ -The schedule not called with interrupt disabled (sncid) monitor ensures -schedule is not called with interrupt disabled:: +The schedule implies task switch (sts) monitor ensures a task switch happens +only in scheduling context and up to once, as well as scheduling occurs with +interrupts enabled but no task switch can happen before interrupts are +disabled. When the next task picked for execution is the same as the previously +running one, no real task switch occurs but interrupts are disabled nonetheless:: - | - | - v - schedule_entry +--------------+ - schedule_exit | | - +----------------- | can_sched | - | | | - +----------------> | | <+ - +--------------+ | - | | - | irq_disable | irq_enable - v | - | - cant_sched -+ + irq_entry | + +----+ | + v | v + +------------+ irq_enable #===================# irq_disable + | | ------------> H H irq_entry + | cant_sched | <------------ H H irq_enable + | | irq_disable H can_sched H --------------+ + +------------+ H H | + H H | + +---------------> H H <-------------+ + | #===================# + | | + schedule_exit | schedule_entry + | v + | +-------------------+ irq_enable + | | scheduling | <---------------+ + | +-------------------+ | + | | | + | | irq_disable +--------+ irq_entry + | v | | --------+ + | +-------------------+ irq_entry | in_irq | | + | | | -----------> | | <-------+ + | | disable_to_switch | +--------+ + | | | --+ + | +-------------------+ | + | | | + | | sched_switch | + | v | + | +-------------------+ | + | | switching | | irq_enable + | +-------------------+ | + | | | + | | irq_enable | + | v | + | +-------------------+ | + +-- | enable_to_exit | <-+ + +-------------------+ + ^ | irq_disable + | | irq_entry + +---------------+ irq_enable + +Monitor nrp +----------- + +The need resched preempts (nrp) monitor ensures preemption requires +``need_resched``. Only kernel preemption is considered, since preemption +while returning to userspace, for this monitor, is indistinguishable from +``sched_switch_yield`` (described in the sssw monitor). +A kernel preemption is whenever ``__schedule`` is called with the preemption +flag set to true (e.g. from preempt_enable or exiting from interrupts). This +type of preemption occurs after the need for ``rescheduling`` has been set. +This is not valid for the *lazy* variant of the flag, which causes only +userspace preemption. +A ``schedule_entry_preempt`` may involve a task switch or not, in the latter +case, a task goes through the scheduler from a preemption context but it is +picked as the next task to run. Since the scheduler runs, this clears the need +to reschedule. The ``any_thread_running`` state does not imply the monitored +task is not running as this monitor does not track the outcome of scheduling. + +In theory, a preemption can only occur after the ``need_resched`` flag is set. In +practice, however, it is possible to see a preemption where the flag is not +set. This can happen in one specific condition:: + + need_resched + preempt_schedule() + preempt_schedule_irq() + __schedule() + !need_resched + __schedule() + +In the situation above, standard preemption starts (e.g. from preempt_enable +when the flag is set), an interrupt occurs before scheduling and, on its exit +path, it schedules, which clears the ``need_resched`` flag. +When the preempted task runs again, the standard preemption started earlier +resumes, although the flag is no longer set. The monitor considers this a +``nested_preemption``, this allows another preemption without re-setting the +flag. This condition relaxes the monitor constraints and may catch false +negatives (i.e. no real ``nested_preemptions``) but makes the monitor more +robust and able to validate other scenarios. +For simplicity, the monitor starts in ``preempt_irq``, although no interrupt +occurred, as the situation above is hard to pinpoint:: + + schedule_entry + irq_entry #===========================================# + +-------------------------- H H + | H H + +-------------------------> H any_thread_running H + H H + +-------------------------> H H + | #===========================================# + | schedule_entry | ^ + | schedule_entry_preempt | sched_need_resched | schedule_entry + | | schedule_entry_preempt + | v | + | +----------------------+ | + | +--- | | | + | sched_need_resched | | rescheduling | -+ + | +--> | | + | +----------------------+ + | | irq_entry + | v + | +----------------------+ + | | | ---+ + | ---> | | | sched_need_resched + | | preempt_irq | | irq_entry + | | | <--+ + | | | <--+ + | +----------------------+ | + | | schedule_entry | sched_need_resched + | | schedule_entry_preempt | + | v | + | +-----------------------+ | + +-------------------------- | nested_preempt | --+ + +-----------------------+ + ^ irq_entry | + +-------------------+ + +Due to how the ``need_resched`` flag on the preemption count works on arm64, +this monitor is unstable on that architecture, as it often records preemption +when the flag is not set, even in presence of the workaround above. +For the time being, the monitor is disabled by default on arm64. + +Monitor sssw +------------ + +The set state sleep and wakeup (sssw) monitor ensures ``set_state`` to +sleepable leads to sleeping and sleeping tasks require wakeup. It includes the +following types of switch: + +* ``switch_suspend``: + a task puts itself to sleep, this can happen only after explicitly setting + the task to ``sleepable``. After a task is suspended, it needs to be woken up + (``waking`` state) before being switched in again. + Setting the task's state to ``sleepable`` can be reverted before switching if it + is woken up or set to ``runnable``. +* ``switch_blocking``: + a special case of a ``switch_suspend`` where the task is waiting on a + sleeping RT lock (``PREEMPT_RT`` only), it is common to see wakeup and set + state events racing with each other and this leads the model to perceive this + type of switch when the task is not set to sleepable. This is a limitation of + the model in SMP system and workarounds may slow down the system. +* ``switch_preempt``: + a task switch as a result of kernel preemption (``schedule_entry_preempt`` in + the nrp model). +* ``switch_yield``: + a task explicitly calls the scheduler or is preempted while returning to + userspace. It can happen after a ``yield`` system call, from the idle task or + if the ``need_resched`` flag is set. By definition, a task cannot yield while + ``sleepable`` as that would be a suspension. A special case of a yield occurs + when a task in ``TASK_INTERRUPTIBLE`` calls the scheduler while a signal is + pending. The task doesn't go through the usual blocking/waking and is set + back to runnable, the resulting switch (if there) looks like a yield to the + ``signal_wakeup`` state and is followed by the signal delivery. From this + state, the monitor expects a signal even if it sees a wakeup event, although + not necessary, to rule out false negatives. + +This monitor doesn't include a running state, ``sleepable`` and ``runnable`` +are only referring to the task's desired state, which could be scheduled out +(e.g. due to preemption). However, it does include the event +``sched_switch_in`` to represent when a task is allowed to become running. This +can be triggered also by preemption, but cannot occur after the task got to +``sleeping`` before a ``wakeup`` occurs:: + + +--------------------------------------------------------------------------+ + | | + | | + | switch_suspend | | + | switch_blocking | | + v v | + +----------+ #==========================# set_state_runnable | + | | H H wakeup | + | | H H switch_in | + | | H H switch_yield | + | sleeping | H H switch_preempt | + | | H H signal_deliver | + | | switch_ H H ------+ | + | | _blocking H runnable H | | + | | <----------- H H <-----+ | + +----------+ H H | + | wakeup H H | + +---------------------> H H | + H H | + +---------> H H | + | #==========================# | + | | ^ | + | | | set_state_runnable | + | | | wakeup | + | set_state_sleepable | +------------------------+ + | v | | + | +--------------------------+ set_state_sleepable + | | | switch_in + | | | switch_preempt + signal_deliver | sleepable | signal_deliver + | | | ------+ + | | | | + | | | <-----+ + | +--------------------------+ + | | ^ + | switch_yield | set_state_sleepable + | v | + | +---------------+ | + +---------- | signal_wakeup | -+ + +---------------+ + ^ | switch_in + | | switch_preempt + | | switch_yield + +-----------+ wakeup + +Monitor opid +------------ + +The operations with preemption and irq disabled (opid) monitor ensures +operations like ``wakeup`` and ``need_resched`` occur with interrupts and +preemption disabled or during interrupt context, in such case preemption may +not be disabled explicitly. +``need_resched`` can be set by some RCU internals functions, in which case it +doesn't match a task wakeup and might occur with only interrupts disabled:: + + | sched_need_resched + | sched_waking + | irq_entry + | +--------------------+ + v v | + +------------------------------------------------------+ + +----------- | disabled | <+ + | +------------------------------------------------------+ | + | | ^ | + | | preempt_disable sched_need_resched | + | preempt_enable | +--------------------+ | + | v | v | | + | +------------------------------------------------------+ | + | | irq_disabled | | + | +------------------------------------------------------+ | + | | | ^ | + | irq_entry irq_entry | | | + | sched_need_resched v | irq_disable | + | sched_waking +--------------+ | | | + | +----- | | irq_enable | | + | | | in_irq | | | | + | +----> | | | | | + | +--------------+ | | irq_disable + | | | | | + | irq_enable | irq_enable | | | + | v v | | + | #======================================================# | + | H enabled H | + | #======================================================# | + | | ^ ^ preempt_enable | | + | preempt_disable preempt_enable +--------------------+ | + | v | | + | +------------------+ | | + +----------> | preempt_disabled | -+ | + +------------------+ | + | | + +-------------------------------------------------------+ + +This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of +events occurring in interrupt context is a shortcut to identify valid scenarios +where the preemption tracepoints might not be visible, during interrupts +preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts +might invoke a softirq to set ``need_resched`` and wake up a task. This is +another special case that is currently not supported by the monitor. References ---------- diff --git a/Documentation/trace/rv/monitor_synthesis.rst b/Documentation/trace/rv/monitor_synthesis.rst new file mode 100644 index 000000000000..ac808a7554f5 --- /dev/null +++ b/Documentation/trace/rv/monitor_synthesis.rst @@ -0,0 +1,271 @@ +Runtime Verification Monitor Synthesis +====================================== + +The starting point for the application of runtime verification (RV) techniques +is the *specification* or *modeling* of the desired (or undesired) behavior +of the system under scrutiny. + +The formal representation needs to be then *synthesized* into a *monitor* +that can then be used in the analysis of the trace of the system. The +*monitor* connects to the system via an *instrumentation* that converts +the events from the *system* to the events of the *specification*. + + +In Linux terms, the runtime verification monitors are encapsulated inside +the *RV monitor* abstraction. The RV monitor includes a set of instances +of the monitor (per-cpu monitor, per-task monitor, and so on), the helper +functions that glue the monitor to the system reference model, and the +trace output as a reaction to event parsing and exceptions, as depicted +below:: + + Linux +----- RV Monitor ----------------------------------+ Formal + Realm | | Realm + +-------------------+ +----------------+ +-----------------+ + | Linux kernel | | Monitor | | Reference | + | Tracing | -> | Instance(s) | <- | Model | + | (instrumentation) | | (verification) | | (specification) | + +-------------------+ +----------------+ +-----------------+ + | | | + | V | + | +----------+ | + | | Reaction | | + | +--+--+--+-+ | + | | | | | + | | | +-> trace output ? | + +------------------------|--|----------------------+ + | +----> panic ? + +-------> <user-specified> + +RV monitor synthesis +-------------------- + +The synthesis of a specification into the Linux *RV monitor* abstraction is +automated by the rvgen tool and the header file containing common code for +creating monitors. The header files are: + + * rv/da_monitor.h for deterministic automaton monitor. + * rv/ltl_monitor.h for linear temporal logic monitor. + +rvgen +----- + +The rvgen utility converts a specification into the C presentation and creating +the skeleton of a kernel monitor in C. + +For example, it is possible to transform the wip.dot model present in +[1] into a per-cpu monitor with the following command:: + + $ rvgen monitor -c da -s wip.dot -t per_cpu + +This will create a directory named wip/ with the following files: + +- wip.h: the wip model in C +- wip.c: the RV monitor + +The wip.c file contains the monitor declaration and the starting point for +the system instrumentation. + +Similarly, a linear temporal logic monitor can be generated with the following +command:: + + $ rvgen monitor -c ltl -s pagefault.ltl -t per_task + +This generates pagefault/ directory with: + +- pagefault.h: The Buchi automaton (the non-deterministic state machine to + verify the specification) +- pagefault.c: The skeleton for the RV monitor + +Monitor header files +-------------------- + +The header files: + +- `rv/da_monitor.h` for deterministic automaton monitor +- `rv/ltl_monitor` for linear temporal logic monitor + +include common macros and static functions for implementing *Monitor +Instance(s)*. + +The benefits of having all common functionalities in a single header file are +3-fold: + + - Reduce the code duplication; + - Facilitate the bug fix/improvement; + - Avoid the case of developers changing the core of the monitor code to + manipulate the model in a (let's say) non-standard way. + +rv/da_monitor.h ++++++++++++++++ + +This initial implementation presents three different types of monitor instances: + +- ``#define DECLARE_DA_MON_GLOBAL(name, type)`` +- ``#define DECLARE_DA_MON_PER_CPU(name, type)`` +- ``#define DECLARE_DA_MON_PER_TASK(name, type)`` + +The first declares the functions for a global deterministic automata monitor, +the second for monitors with per-cpu instances, and the third with per-task +instances. + +In all cases, the 'name' argument is a string that identifies the monitor, and +the 'type' argument is the data type used by rvgen on the representation of +the model in C. + +For example, the wip model with two states and three events can be +stored in an 'unsigned char' type. Considering that the preemption control +is a per-cpu behavior, the monitor declaration in the 'wip.c' file is:: + + DECLARE_DA_MON_PER_CPU(wip, unsigned char); + +The monitor is executed by sending events to be processed via the functions +presented below:: + + da_handle_event_$(MONITOR_NAME)($(event from event enum)); + da_handle_start_event_$(MONITOR_NAME)($(event from event enum)); + da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum)); + +The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where +the event will be processed if the monitor is processing events. + +When a monitor is enabled, it is placed in the initial state of the automata. +However, the monitor does not know if the system is in the *initial state*. + +The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the +monitor that the system is returning to the initial state, so the monitor can +start monitoring the next event. + +The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify +the monitor that the system is known to be in the initial state, so the +monitor can start monitoring and monitor the current event. + +Using the wip model as example, the events "preempt_disable" and +"sched_waking" should be sent to monitor, respectively, via [2]:: + + da_handle_event_wip(preempt_disable_wip); + da_handle_event_wip(sched_waking_wip); + +While the event "preempt_enabled" will use:: + + da_handle_start_event_wip(preempt_enable_wip); + +To notify the monitor that the system will be returning to the initial state, +so the system and the monitor should be in sync. + +rv/ltl_monitor.h +++++++++++++++++ +This file must be combined with the $(MODEL_NAME).h file (generated by `rvgen`) +to be complete. For example, for the `pagefault` monitor, the `pagefault.c` +source file must include:: + + #include "pagefault.h" + #include <rv/ltl_monitor.h> + +(the skeleton monitor file generated by `rvgen` already does this). + +`$(MODEL_NAME).h` (`pagefault.h` in the above example) includes the +implementation of the Buchi automaton - a non-deterministic state machine that +verifies the LTL specification. While `rv/ltl_monitor.h` includes the common +helper functions to interact with the Buchi automaton and to implement an RV +monitor. An important definition in `$(MODEL_NAME).h` is:: + + enum ltl_atom { + LTL_$(FIRST_ATOMIC_PROPOSITION), + LTL_$(SECOND_ATOMIC_PROPOSITION), + ... + LTL_NUM_ATOM + }; + +which is the list of atomic propositions present in the LTL specification +(prefixed with "LTL\_" to avoid name collision). This `enum` is passed to the +functions interacting with the Buchi automaton. + +While generating code, `rvgen` cannot understand the meaning of the atomic +propositions. Thus, that task is left for manual work. The recommended pratice +is adding tracepoints to places where the atomic propositions change; and in the +tracepoints' handlers: the Buchi automaton is executed using:: + + void ltl_atom_update(struct task_struct *task, enum ltl_atom atom, bool value) + +which tells the Buchi automaton that the atomic proposition `atom` is now +`value`. The Buchi automaton checks whether the LTL specification is still +satisfied, and invokes the monitor's error tracepoint and the reactor if +violation is detected. + +Tracepoints and `ltl_atom_update()` should be used whenever possible. However, +it is sometimes not the most convenient. For some atomic propositions which are +changed in multiple places in the kernel, it is cumbersome to trace all those +places. Furthermore, it may not be important that the atomic propositions are +updated at precise times. For example, considering the following linear temporal +logic:: + + RULE = always (RT imply not PAGEFAULT) + +This LTL states that a real-time task does not raise page faults. For this +specification, it is not important when `RT` changes, as long as it has the +correct value when `PAGEFAULT` is true. Motivated by this case, another +function is introduced:: + + void ltl_atom_fetch(struct task_struct *task, struct ltl_monitor *mon) + +This function is called whenever the Buchi automaton is triggered. Therefore, it +can be manually implemented to "fetch" `RT`:: + + void ltl_atom_fetch(struct task_struct *task, struct ltl_monitor *mon) + { + ltl_atom_set(mon, LTL_RT, rt_task(task)); + } + +Effectively, whenever `PAGEFAULT` is updated with a call to `ltl_atom_update()`, +`RT` is also fetched. Thus, the LTL specification can be verified without +tracing `RT` everywhere. + +For atomic propositions which act like events, they usually need to be set (or +cleared) and then immediately cleared (or set). A convenient function is +provided:: + + void ltl_atom_pulse(struct task_struct *task, enum ltl_atom atom, bool value) + +which is equivalent to:: + + ltl_atom_update(task, atom, value); + ltl_atom_update(task, atom, !value); + +To initialize the atomic propositions, the following function must be +implemented:: + + ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation) + +This function is called for all running tasks when the monitor is enabled. It is +also called for new tasks created after the enabling the monitor. It should +initialize as many atomic propositions as possible, for example:: + + void ltl_atom_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation) + { + ltl_atom_set(mon, LTL_RT, rt_task(task)); + if (task_creation) + ltl_atom_set(mon, LTL_PAGEFAULT, false); + } + +Atomic propositions not initialized by `ltl_atom_init()` will stay in the +unknown state until relevant tracepoints are hit, which can take some time. As +monitoring for a task cannot be done until all atomic propositions is known for +the task, the monitor may need some time to start validating tasks which have +been running before the monitor is enabled. Therefore, it is recommended to +start the tasks of interest after enabling the monitor. + +Final remarks +------------- + +With the monitor synthesis in place using the header files and +rvgen, the developer's work should be limited to the instrumentation +of the system, increasing the confidence in the overall approach. + +[1] For details about deterministic automata format and the translation +from one representation to another, see:: + + Documentation/trace/rv/deterministic_automata.rst + +[2] rvgen appends the monitor's name suffix to the events enums to +avoid conflicting variables when exporting the global vmlinux.h +use by BPF programs. diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst index decabcc77b56..b35c40e3abbe 100644 --- a/Documentation/trace/tracepoints.rst +++ b/Documentation/trace/tracepoints.rst @@ -71,7 +71,7 @@ In subsys/file.c (where the tracing statement must be added):: void somefct(void) { ... - trace_subsys_eventname(arg, task); + trace_subsys_eventname_tp(arg, task); ... } @@ -129,12 +129,12 @@ within an if statement with the following:: for (i = 0; i < count; i++) tot += calculate_nuggets(); - trace_foo_bar(tot); + trace_foo_bar_tp(tot); } -All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled() +All trace_<tracepoint>_tp() calls have a matching trace_<tracepoint>_enabled() function defined that returns true if the tracepoint is enabled and -false otherwise. The trace_<tracepoint>() should always be within the +false otherwise. The trace_<tracepoint>_tp() should always be within the block of the if (trace_<tracepoint>_enabled()) to prevent races between the tracepoint being enabled and the check being seen. @@ -143,7 +143,10 @@ the static_key of the tracepoint to allow the if statement to be implemented with jump labels and avoid conditional branches. .. note:: The convenience macro TRACE_EVENT provides an alternative way to - define tracepoints. Check http://lwn.net/Articles/379903, + define tracepoints. Note, DECLARE_TRACE(foo) creates a function + "trace_foo_tp()" whereas TRACE_EVENT(foo) creates a function + "trace_foo()", and also exposes the tracepoint as a trace event in + /sys/kernel/tracing/events directory. Check http://lwn.net/Articles/379903, http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362 for a series of articles with more details. @@ -159,7 +162,9 @@ In a C file:: void do_trace_foo_bar_wrapper(args) { - trace_foo_bar(args); + trace_foo_bar_tp(args); // for tracepoints created via DECLARE_TRACE + // or + trace_foo_bar(args); // for tracepoints created via TRACE_EVENT } In the header file:: |