15 files changed, 1225 insertions, 222 deletions
diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst
index d594597201fd..3efac10adb36 100644
--- a/Documentation/trace/boottime-trace.rst
+++ b/Documentation/trace/boottime-trace.rst
@@ -198,8 +198,8 @@ Most of the subsystems and architecture dependent drivers will be initialized
 after that (arch_initcall or subsys_initcall). Thus, you can trace those with
 boot-time tracing.
 If you want to trace events before core_initcall, you can use the options
-starting with ``kernel``. Some of them will be enabled eariler than the initcall
-processing (for example,. ``kernel.ftrace=function`` and ``kernel.trace_event``
+starting with ``kernel``. Some of them will be enabled earlier than the initcall
+processing (for example, ``kernel.ftrace=function`` and ``kernel.trace_event``
 will start before the initcall.)
 
 
diff --git a/Documentation/trace/coresight/coresight-perf.rst b/Documentation/trace/coresight/coresight-perf.rst
index d087aae7d492..30be89320621 100644
--- a/Documentation/trace/coresight/coresight-perf.rst
+++ b/Documentation/trace/coresight/coresight-perf.rst
@@ -78,6 +78,37 @@ enabled like::
 
 Please refer to the kernel configuration help for more information.
 
+Fine-grained tracing with AUX pause and resume
+----------------------------------------------
+
+Arm CoreSight may generate a large amount of hardware trace data, which
+will lead to overhead in recording and distract users when reviewing
+profiling result. To mitigate the issue of excessive trace data, Perf
+provides AUX pause and resume functionality for fine-grained tracing.
+
+The AUX pause and resume can be triggered by associated events. These
+events can be ftrace tracepoints (including static and dynamic
+tracepoints) or PMU events (e.g. CPU PMU cycle event). To create a perf
+session with AUX pause / resume, three configuration terms are
+introduced:
+
+- "aux-action=start-paused": it is specified for the cs_etm PMU event to
+  launch in a paused state.
+- "aux-action=pause": an associated event is specified with this term
+  to pause AUX trace.
+- "aux-action=resume": an associated event is specified with this term
+  to resume AUX trace.
+
+Example for triggering AUX pause and resume with ftrace tracepoints::
+
+  perf record -e cs_etm/aux-action=start-paused/k,syscalls:sys_enter_openat/aux-action=resume/,syscalls:sys_exit_openat/aux-action=pause/ ls
+
+Example for triggering AUX pause and resume with PMU event::
+
+  perf record -a -e cs_etm/aux-action=start-paused/k \
+        -e cycles/aux-action=pause,period=10000000/ \
+        -e cycles/aux-action=resume,period=1050000/ -- sleep 1
+
 Perf test - Verify kernel and userspace perf CoreSight work
 -----------------------------------------------------------
 
diff --git a/Documentation/trace/coresight/panic.rst b/Documentation/trace/coresight/panic.rst
index a58aa914c241..6e4bde953cae 100644
--- a/Documentation/trace/coresight/panic.rst
+++ b/Documentation/trace/coresight/panic.rst
@@ -67,8 +67,8 @@ Trace data captured at the time of panic, can be read from rebooted kernel
 or from crashdump kernel using a special device file /dev/crash_tmc_xxx.
 This device file is created only when there is a valid crashdata available.
 
-General flow of trace capture and decode incase of kernel panic
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+General flow of trace capture and decode in case of kernel panic
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 1. Enable source and sink on all the cores using the sysfs interface.
    ETR sinks should have trace buffers allocated from reserved memory,
    by selecting "resrv" buffer mode from sysfs.
diff --git a/Documentation/trace/eprobetrace.rst b/Documentation/trace/eprobetrace.rst
new file mode 100644
index 000000000000..89b5157cfab8
--- /dev/null
+++ b/Documentation/trace/eprobetrace.rst
@@ -0,0 +1,269 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Eprobe - Event-based Probe Tracing
+==================================
+
+:Author: Steven Rostedt <rostedt@goodmis.org>
+
+- Written for v6.17
+
+Overview
+========
+
+Eprobes are dynamic events that are placed on existing events to either
+dereference a field that is a pointer, or simply to limit what fields are
+recorded in the trace event.
+
+Eprobes depend on kprobe events so to enable this feature, build your kernel
+with CONFIG_EPROBE_EVENTS=y.
+
+Eprobes are created via the /sys/kernel/tracing/dynamic_events file.
+
+Synopsis of eprobe_events
+-------------------------
+::
+
+  e[:[EGRP/][EEVENT]] GRP.EVENT [FETCHARGS]	: Set a probe
+  -:[EGRP/][EEVENT]				: Clear a probe
+
+ EGRP		: Group name of the new event. If omitted, use "eprobes" for it.
+ EEVENT		: Event name. If omitted, the event name is generated and will
+		  be the same event name as the event it attached to.
+ GRP		: Group name of the event to attach to.
+ EVENT		: Event name of the event to attach to.
+
+ FETCHARGS	: Arguments. Each probe can have up to 128 args.
+  $FIELD	: Fetch the value of the event field called FIELD.
+  @ADDR		: Fetch memory at ADDR (ADDR should be in kernel)
+  @SYM[+|-offs]	: Fetch memory at SYM +|- offs (SYM should be a data symbol)
+  $comm		: Fetch current task comm.
+  +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*3)(\*4)
+  \IMM		: Store an immediate value to the argument.
+  NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
+  FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
+		  (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
+		  (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
+                  "string", "ustring", "symbol", "symstr" and "bitfield" are
+                  supported.
+
+Types
+-----
+The FETCHARGS above is very similar to the kprobe events as described in
+Documentation/trace/kprobetrace.rst.
+
+The difference between eprobes and kprobes FETCHARGS is that eprobes has a
+$FIELD command that returns the content of the event field of the event
+that is attached. Eprobes do not have access to registers, stacks and function
+arguments that kprobes has.
+
+If a field argument is a pointer, it may be dereferenced just like a memory
+address using the FETCHARGS syntax.
+
+
+Attaching to dynamic events
+---------------------------
+
+Eprobes may attach to dynamic events as well as to normal events. It may
+attach to a kprobe event, a synthetic event or a fprobe event. This is useful
+if the type of a field needs to be changed. See Example 2 below.
+
+Usage examples
+==============
+
+Example 1
+---------
+
+The basic usage of eprobes is to limit the data that is being recorded into
+the tracing buffer. For example, a common event to trace is the sched_switch
+trace event. That has a format of::
+
+	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
+	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
+	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
+	field:int common_pid;	offset:4;	size:4;	signed:1;
+
+	field:char prev_comm[16];	offset:8;	size:16;	signed:0;
+	field:pid_t prev_pid;	offset:24;	size:4;	signed:1;
+	field:int prev_prio;	offset:28;	size:4;	signed:1;
+	field:long prev_state;	offset:32;	size:8;	signed:1;
+	field:char next_comm[16];	offset:40;	size:16;	signed:0;
+	field:pid_t next_pid;	offset:56;	size:4;	signed:1;
+	field:int next_prio;	offset:60;	size:4;	signed:1;
+
+The first four fields are common to all events and can not be limited. But the
+rest of the event has 60 bytes of information. It records the names of the
+previous and next tasks being scheduled out and in, as well as their pids and
+priorities. It also records the state of the previous task. If only the pids
+of the tasks are of interest, why waste the ring buffer with all the other
+fields?
+
+An eprobe can limit what gets recorded. Note, it does not help in performance,
+as all the fields are recorded in a temporary buffer to process the eprobe.
+::
+
+ # echo 'e:sched/switch sched.sched_switch prev=$prev_pid:u32 next=$next_pid:u32' >> /sys/kernel/tracing/dynamic_events
+ # echo 1 > /sys/kernel/tracing/events/sched/switch/enable
+ # cat /sys/kernel/tracing/trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 2721/2721   #P:8
+ #
+ #                                _-----=> irqs-off/BH-disabled
+ #                               / _----=> need-resched
+ #                              | / _---=> hardirq/softirq
+ #                              || / _--=> preempt-depth
+ #                              ||| / _-=> migrate-disable
+ #                              |||| /     delay
+ #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
+ #              | |         |   |||||     |         |
+     sshd-session-1082    [004] d..4.  5041.239906: switch: (sched.sched_switch) prev=1082 next=0
+             bash-1085    [001] d..4.  5041.240198: switch: (sched.sched_switch) prev=1085 next=141
+    kworker/u34:5-141     [001] d..4.  5041.240259: switch: (sched.sched_switch) prev=141 next=1085
+           <idle>-0       [004] d..4.  5041.240354: switch: (sched.sched_switch) prev=0 next=1082
+             bash-1085    [001] d..4.  5041.240385: switch: (sched.sched_switch) prev=1085 next=141
+    kworker/u34:5-141     [001] d..4.  5041.240410: switch: (sched.sched_switch) prev=141 next=1085
+             bash-1085    [001] d..4.  5041.240478: switch: (sched.sched_switch) prev=1085 next=0
+     sshd-session-1082    [004] d..4.  5041.240526: switch: (sched.sched_switch) prev=1082 next=0
+           <idle>-0       [001] d..4.  5041.247524: switch: (sched.sched_switch) prev=0 next=90
+           <idle>-0       [002] d..4.  5041.247545: switch: (sched.sched_switch) prev=0 next=16
+      kworker/1:1-90      [001] d..4.  5041.247580: switch: (sched.sched_switch) prev=90 next=0
+        rcu_sched-16      [002] d..4.  5041.247591: switch: (sched.sched_switch) prev=16 next=0
+           <idle>-0       [002] d..4.  5041.257536: switch: (sched.sched_switch) prev=0 next=16
+        rcu_sched-16      [002] d..4.  5041.257573: switch: (sched.sched_switch) prev=16 next=0
+
+Note, without adding the "u32" after the prev_pid and next_pid, the values
+would default showing in hexadecimal.
+
+Example 2
+---------
+
+If a specific system call is to be recorded but the syscalls events are not
+enabled, the raw_syscalls can still be used (syscalls are system call
+events are not normal events, but are created from the raw_syscalls events
+within the kernel). In order to trace the openat system call, one can create
+an event probe on top of the raw_syscalls event:
+::
+
+ # cd /sys/kernel/tracing
+ # cat events/raw_syscalls/sys_enter/format
+ name: sys_enter
+ ID: 395
+ format:
+	field:unsigned short common_type;	offset:0;	size:2;	signed:0;
+	field:unsigned char common_flags;	offset:2;	size:1;	signed:0;
+	field:unsigned char common_preempt_count;	offset:3;	size:1;	signed:0;
+	field:int common_pid;	offset:4;	size:4;	signed:1;
+
+	field:long id;	offset:8;	size:8;	signed:1;
+	field:unsigned long args[6];	offset:16;	size:48;	signed:0;
+
+ print fmt: "NR %ld (%lx, %lx, %lx, %lx, %lx, %lx)", REC->id, REC->args[0], REC->args[1], REC->args[2], REC->args[3], REC->args[4], REC->args[5]
+
+From the source code, the sys_openat() has:
+::
+
+ int sys_openat(int dirfd, const char *path, int flags, mode_t mode)
+ {
+	return my_syscall4(__NR_openat, dirfd, path, flags, mode);
+ }
+
+The path is the second parameter, and that is what is wanted.
+::
+
+ # echo 'e:openat raw_syscalls.sys_enter nr=$id filename=+8($args):ustring' >> dynamic_events
+
+This is being run on x86_64 where the word size is 8 bytes and the openat
+system call __NR_openat is set at 257.
+::
+
+ # echo 'nr == 257' > events/eprobes/openat/filter
+
+Now enable the event and look at the trace.
+::
+
+ # echo 1 > events/eprobes/openat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 4/4   #P:8
+ #
+ #                                _-----=> irqs-off/BH-disabled
+ #                               / _----=> need-resched
+ #                              | / _---=> hardirq/softirq
+ #                              || / _--=> preempt-depth
+ #                              ||| / _-=> migrate-disable
+ #                              |||| /     delay
+ #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
+ #              | |         |   |||||     |         |
+              cat-1298    [003] ...2.  2060.875970: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+              cat-1298    [003] ...2.  2060.876197: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+              cat-1298    [003] ...2.  2060.879126: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+              cat-1298    [003] ...2.  2060.879639: openat: (raw_syscalls.sys_enter) nr=0x101 filename=(fault)
+
+The filename shows "(fault)". This is likely because the filename has not been
+pulled into memory yet and currently trace events cannot fault in memory that
+is not present. When an eprobe tries to read memory that has not been faulted
+in yet, it will show the "(fault)" text.
+
+To get around this, as the kernel will likely pull in this filename and make
+it present, attaching it to a synthetic event that can pass the address of the
+filename from the entry of the event to the end of the event, this can be used
+to show the filename when the system call returns.
+
+Remove the old eprobe::
+
+ # echo 1 > events/eprobes/openat/enable
+ # echo '-:openat' >> dynamic_events
+
+This time make an eprobe where the address of the filename is saved::
+
+ # echo 'e:openat_start raw_syscalls.sys_enter nr=$id filename=+8($args):x64' >> dynamic_events
+
+Create a synthetic event that passes the address of the filename to the
+end of the event::
+
+ # echo 's:filename u64 file' >> dynamic_events
+ # echo 'hist:keys=common_pid:f=filename if nr == 257' > events/eprobes/openat_start/trigger
+ # echo 'hist:keys=common_pid:file=$f:onmatch(eprobes.openat_start).trace(filename,$file) if id == 257' > events/raw_syscalls/sys_exit/trigger
+
+Now that the address of the filename has been passed to the end of the
+system call, create another eprobe to attach to the exit event to show the
+string::
+
+ # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
+ # echo 1 > events/eprobes/openat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 4/4   #P:8
+ #
+ #                                _-----=> irqs-off/BH-disabled
+ #                               / _----=> need-resched
+ #                              | / _---=> hardirq/softirq
+ #                              || / _--=> preempt-depth
+ #                              ||| / _-=> migrate-disable
+ #                              |||| /     delay
+ #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
+ #              | |         |   |||||     |         |
+              cat-1331    [001] ...5.  2944.787977: openat: (synthetic.filename) filename="/etc/ld.so.cache"
+              cat-1331    [001] ...5.  2944.788480: openat: (synthetic.filename) filename="/lib/x86_64-linux-gnu/libc.so.6"
+              cat-1331    [001] ...5.  2944.793426: openat: (synthetic.filename) filename="/usr/lib/locale/locale-archive"
+              cat-1331    [001] ...5.  2944.831362: openat: (synthetic.filename) filename="trace"
+
+Example 3
+---------
+
+If syscall trace events are available, the above would not need the first
+eprobe, but it would still need the last one::
+
+ # echo 's:filename u64 file' >> dynamic_events
+ # echo 'hist:keys=common_pid:f=filename' > events/syscalls/sys_enter_openat/trigger
+ # echo 'hist:keys=common_pid:file=$f:onmatch(syscalls.sys_enter_openat).trace(filename,$file)' > events/syscalls/sys_exit_openat/trigger
+ # echo 'e:openat synthetic.filename filename=+0($file):ustring' >> dynamic_events
+ # echo 1 > events/eprobes/openat/enable
+
+And this would produce the same result as Example 2.
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index dc82d64b3a44..8f4fab3f9324 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -238,19 +238,15 @@ You need very few things to get the syscalls tracing in an arch.
   - Tag this arch as HAVE_SYSCALL_TRACEPOINTS.
 
 
-HAVE_FTRACE_MCOUNT_RECORD
--------------------------
+HAVE_DYNAMIC_FTRACE
+-------------------
 
 See scripts/recordmcount.pl for more info.  Just fill in the arch-specific
 details for how to locate the addresses of mcount call sites via objdump.
 This option doesn't make much sense without also implementing dynamic ftrace.
 
-
-HAVE_DYNAMIC_FTRACE
--------------------
-
-You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so
-scroll your reader back up if you got over eager.
+You will first need HAVE_FUNCTION_TRACER, so scroll your reader back up if you
+got over eager.
 
 Once those are out of the way, you will need to implement:
 	- asm/ftrace.h:
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index c9e88bf65709..af66a05e18cc 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1205,6 +1205,19 @@ Here are the available options:
 	default instance. The only way the top level instance has this flag
 	cleared, is by it being set in another instance.
 
+  copy_trace_marker
+	If there are applications that hard code writing into the top level
+	trace_marker file (/sys/kernel/tracing/trace_marker or trace_marker_raw),
+	and the tooling would like it to go into an instance, this option can
+	be used. Create an instance and set this option, and then all writes
+	into the top level trace_marker file will also be redirected into this
+	instance.
+
+	Note, by default this option is set for the top level instance. If it
+	is disabled, then writes to the trace_marker or trace_marker_raw files
+	will not be written into the top level file. If no instance has this
+	option set, then a write will error with the errno of ENODEV.
+
   annotate
 	It is sometimes confusing when the CPU buffers are full
 	and one CPU buffer had a lot of events recently, thus
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 0aada18c38c6..2b98c1720a54 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -249,7 +249,7 @@ Extended error information
   table, it should keep a running total of the number of bytes
   requested by that call_site.
 
-  We'll let it run for awhile and then dump the contents of the 'hist'
+  We'll let it run for a while and then dump the contents of the 'hist'
   file in the kmalloc event's subdirectory (for readability, a number
   of entries have been omitted)::
 
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 2c991dc96ace..b4a429dc4f7a 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -1,39 +1,104 @@
-==========================
-Linux Tracing Technologies
-==========================
+================================
+Linux Tracing Technologies Guide
+================================
+
+Tracing in the Linux kernel is a powerful mechanism that allows
+developers and system administrators to analyze and debug system
+behavior. This guide provides documentation on various tracing
+frameworks and tools available in the Linux kernel.
+
+Introduction to Tracing
+-----------------------
+
+This section provides an overview of Linux tracing mechanisms
+and debugging approaches.
 
 .. toctree::
-   :maxdepth: 2
+   :maxdepth: 1
 
-   ftrace-design
+   debugging
+   tracepoints
    tracepoint-analysis
+   ring-buffer-map
+
+Core Tracing Frameworks
+-----------------------
+
+The following are the primary tracing frameworks integrated into
+the Linux kernel.
+
+.. toctree::
+   :maxdepth: 1
+
    ftrace
+   ftrace-design
    ftrace-uses
-   fprobe
    kprobes
    kprobetrace
-   uprobetracer
    fprobetrace
-   tracepoints
+   eprobetrace
+   fprobe
+   ring-buffer-design
+
+Event Tracing and Analysis
+--------------------------
+
+A detailed explanation of event tracing mechanisms and their
+applications.
+
+.. toctree::
+   :maxdepth: 1
+
    events
    events-kmem
    events-power
    events-nmi
    events-msr
-   mmiotrace
+   boottime-trace
    histogram
    histogram-design
-   boottime-trace
-   debugging
-   hwlat_detector
-   osnoise-tracer
-   timerlat-tracer
+
+Hardware and Performance Tracing
+--------------------------------
+
+This section covers tracing features that monitor hardware
+interactions and system performance.
+
+.. toctree::
+   :maxdepth: 1
+
    intel_th
-   ring-buffer-design
-   ring-buffer-map
    stm
    sys-t
    coresight/index
-   user_events
    rv/index
    hisi-ptt
+   mmiotrace
+   hwlat_detector
+   osnoise-tracer
+   timerlat-tracer
+
+User-Space Tracing
+------------------
+
+These tools allow tracing user-space applications and
+interactions.
+
+.. toctree::
+   :maxdepth: 1
+
+   user_events
+   uprobetracer
+
+Additional Resources
+--------------------
+
+For more details, refer to the respective documentation of each
+tracing tool and framework.
+
+.. only:: subproject and html
+
+   Indices
+   =======
+
+   * :ref:`genindex`
diff --git a/Documentation/trace/rv/da_monitor_synthesis.rst b/Documentation/trace/rv/da_monitor_synthesis.rst
deleted file mode 100644
index 0a92729c8a9b..000000000000
--- a/Documentation/trace/rv/da_monitor_synthesis.rst
+++ /dev/null
@@ -1,147 +0,0 @@
-Deterministic Automata Monitor Synthesis
-========================================
-
-The starting point for the application of runtime verification (RV) techniques
-is the *specification* or *modeling* of the desired (or undesired) behavior
-of the system under scrutiny.
-
-The formal representation needs to be then *synthesized* into a *monitor*
-that can then be used in the analysis of the trace of the system. The
-*monitor* connects to the system via an *instrumentation* that converts
-the events from the *system* to the events of the *specification*.
-
-
-In Linux terms, the runtime verification monitors are encapsulated inside
-the *RV monitor* abstraction. The RV monitor includes a set of instances
-of the monitor (per-cpu monitor, per-task monitor, and so on), the helper
-functions that glue the monitor to the system reference model, and the
-trace output as a reaction to event parsing and exceptions, as depicted
-below::
-
- Linux  +----- RV Monitor ----------------------------------+ Formal
-  Realm |                                                   |  Realm
-  +-------------------+     +----------------+     +-----------------+
-  |   Linux kernel    |     |     Monitor    |     |     Reference   |
-  |     Tracing       |  -> |   Instance(s)  | <-  |       Model     |
-  | (instrumentation) |     | (verification) |     | (specification) |
-  +-------------------+     +----------------+     +-----------------+
-         |                          |                       |
-         |                          V                       |
-         |                     +----------+                 |
-         |                     | Reaction |                 |
-         |                     +--+--+--+-+                 |
-         |                        |  |  |                   |
-         |                        |  |  +-> trace output ?  |
-         +------------------------|--|----------------------+
-                                  |  +----> panic ?
-                                  +-------> <user-specified>
-
-DA monitor synthesis
---------------------
-
-The synthesis of automata-based models into the Linux *RV monitor* abstraction
-is automated by the dot2k tool and the rv/da_monitor.h header file that
-contains a set of macros that automatically generate the monitor's code.
-
-dot2k
------
-
-The dot2k utility leverages dot2c by converting an automaton model in
-the DOT format into the C representation [1] and creating the skeleton of
-a kernel monitor in C.
-
-For example, it is possible to transform the wip.dot model present in
-[1] into a per-cpu monitor with the following command::
-
-  $ dot2k -d wip.dot -t per_cpu
-
-This will create a directory named wip/ with the following files:
-
-- wip.h: the wip model in C
-- wip.c: the RV monitor
-
-The wip.c file contains the monitor declaration and the starting point for
-the system instrumentation.
-
-Monitor macros
---------------
-
-The rv/da_monitor.h enables automatic code generation for the *Monitor
-Instance(s)* using C macros.
-
-The benefits of the usage of macro for monitor synthesis are 3-fold as it:
-
-- Reduces the code duplication;
-- Facilitates the bug fix/improvement;
-- Avoids the case of developers changing the core of the monitor code
-  to manipulate the model in a (let's say) non-standard way.
-
-This initial implementation presents three different types of monitor instances:
-
-- ``#define DECLARE_DA_MON_GLOBAL(name, type)``
-- ``#define DECLARE_DA_MON_PER_CPU(name, type)``
-- ``#define DECLARE_DA_MON_PER_TASK(name, type)``
-
-The first declares the functions for a global deterministic automata monitor,
-the second for monitors with per-cpu instances, and the third with per-task
-instances.
-
-In all cases, the 'name' argument is a string that identifies the monitor, and
-the 'type' argument is the data type used by dot2k on the representation of
-the model in C.
-
-For example, the wip model with two states and three events can be
-stored in an 'unsigned char' type. Considering that the preemption control
-is a per-cpu behavior, the monitor declaration in the 'wip.c' file is::
-
-  DECLARE_DA_MON_PER_CPU(wip, unsigned char);
-
-The monitor is executed by sending events to be processed via the functions
-presented below::
-
-  da_handle_event_$(MONITOR_NAME)($(event from event enum));
-  da_handle_start_event_$(MONITOR_NAME)($(event from event enum));
-  da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum));
-
-The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where
-the event will be processed if the monitor is processing events.
-
-When a monitor is enabled, it is placed in the initial state of the automata.
-However, the monitor does not know if the system is in the *initial state*.
-
-The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the
-monitor that the system is returning to the initial state, so the monitor can
-start monitoring the next event.
-
-The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify
-the monitor that the system is known to be in the initial state, so the
-monitor can start monitoring and monitor the current event.
-
-Using the wip model as example, the events "preempt_disable" and
-"sched_waking" should be sent to monitor, respectively, via [2]::
-
-  da_handle_event_wip(preempt_disable_wip);
-  da_handle_event_wip(sched_waking_wip);
-
-While the event "preempt_enabled" will use::
-
-  da_handle_start_event_wip(preempt_enable_wip);
-
-To notify the monitor that the system will be returning to the initial state,
-so the system and the monitor should be in sync.
-
-Final remarks
--------------
-
-With the monitor synthesis in place using the rv/da_monitor.h and
-dot2k, the developer's work should be limited to the instrumentation
-of the system, increasing the confidence in the overall approach.
-
-[1] For details about deterministic automata format and the translation
-from one representation to another, see::
-
-  Documentation/trace/rv/deterministic_automata.rst
-
-[2] dot2k appends the monitor's name suffix to the events enums to
-avoid conflicting variables when exporting the global vmlinux.h
-use by BPF programs.
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
index e80e0057feb4..a2812ac5cfeb 100644
--- a/Documentation/trace/rv/index.rst
+++ b/Documentation/trace/rv/index.rst
@@ -8,8 +8,10 @@ Runtime Verification
 
    runtime-verification.rst
    deterministic_automata.rst
-   da_monitor_synthesis.rst
+   linear_temporal_logic.rst
+   monitor_synthesis.rst
    da_monitor_instrumentation.rst
    monitor_wip.rst
    monitor_wwnr.rst
    monitor_sched.rst
+   monitor_rtapp.rst
diff --git a/Documentation/trace/rv/linear_temporal_logic.rst b/Documentation/trace/rv/linear_temporal_logic.rst
new file mode 100644
index 000000000000..9eee09d9cacf
--- /dev/null
+++ b/Documentation/trace/rv/linear_temporal_logic.rst
@@ -0,0 +1,134 @@
+Linear temporal logic
+=====================
+
+Introduction
+------------
+
+Runtime verification monitor is a verification technique which checks that the
+kernel follows a specification. It does so by using tracepoints to monitor the
+kernel's execution trace, and verifying that the execution trace sastifies the
+specification.
+
+Initially, the specification can only be written in the form of deterministic
+automaton (DA).  However, while attempting to implement DA monitors for some
+complex specifications, deterministic automaton is found to be inappropriate as
+the specification language. The automaton is complicated, hard to understand,
+and error-prone.
+
+Thus, RV monitors based on linear temporal logic (LTL) are introduced. This type
+of monitor uses LTL as specification instead of DA. For some cases, writing the
+specification as LTL is more concise and intuitive.
+
+Many materials explain LTL in details. One book is::
+
+  Christel Baier and Joost-Pieter Katoen: Principles of Model Checking, The MIT
+  Press, 2008.
+
+Grammar
+-------
+
+Unlike some existing syntax, kernel's implementation of LTL is more verbose.
+This is motivated by considering that the people who read the LTL specifications
+may not be well-versed in LTL.
+
+Grammar:
+    ltl ::= opd | ( ltl ) | ltl binop ltl | unop ltl
+
+Operands (opd):
+    true, false, user-defined names consisting of upper-case characters, digits,
+    and underscore.
+
+Unary Operators (unop):
+    always
+    eventually
+    next
+    not
+
+Binary Operators (binop):
+    until
+    and
+    or
+    imply
+    equivalent
+
+This grammar is ambiguous: operator precedence is not defined. Parentheses must
+be used.
+
+Example linear temporal logic
+-----------------------------
+.. code-block::
+
+   RAIN imply (GO_OUTSIDE imply HAVE_UMBRELLA)
+
+means: if it is raining, going outside means having an umbrella.
+
+.. code-block::
+
+   RAIN imply (WET until not RAIN)
+
+means: if it is raining, it is going to be wet until the rain stops.
+
+.. code-block::
+
+   RAIN imply eventually not RAIN
+
+means: if it is raining, rain will eventually stop.
+
+The above examples are referring to the current time instance only. For kernel
+verification, the `always` operator is usually desirable, to specify that
+something is always true at the present and for all future. For example::
+
+    always (RAIN imply eventually not RAIN)
+
+means: *all* rain eventually stops.
+
+In the above examples, `RAIN`, `GO_OUTSIDE`, `HAVE_UMBRELLA` and `WET` are the
+"atomic propositions".
+
+Monitor synthesis
+-----------------
+
+To synthesize an LTL into a kernel monitor, the `rvgen` tool can be used:
+`tools/verification/rvgen`. The specification needs to be provided as a file,
+and it must have a "RULE = LTL" assignment. For example::
+
+    RULE = always (ACQUIRE imply ((not KILLED and not CRASHED) until RELEASE))
+
+which says: if `ACQUIRE`, then `RELEASE` must happen before `KILLED` or
+`CRASHED`.
+
+The LTL can be broken down using sub-expressions. The above is equivalent to:
+
+   .. code-block::
+
+    RULE = always (ACQUIRE imply (ALIVE until RELEASE))
+    ALIVE = not KILLED and not CRASHED
+
+From this specification, `rvgen` generates the C implementation of a Buchi
+automaton - a non-deterministic state machine which checks the satisfiability of
+the LTL. See Documentation/trace/rv/monitor_synthesis.rst for details on using
+`rvgen`.
+
+References
+----------
+
+One book covering model checking and linear temporal logic is::
+
+  Christel Baier and Joost-Pieter Katoen: Principles of Model Checking, The MIT
+  Press, 2008.
+
+For an example of using linear temporal logic in software testing, see::
+
+  Ruijie Meng, Zhen Dong, Jialin Li, Ivan Beschastnikh, and Abhik Roychoudhury.
+  2022. Linear-time temporal logic guided greybox fuzzing. In Proceedings of the
+  44th International Conference on Software Engineering (ICSE '22).  Association
+  for Computing Machinery, New York, NY, USA, 1343–1355.
+  https://doi.org/10.1145/3510003.3510082
+
+The kernel's LTL monitor implementation is based on::
+
+  Gerth, R., Peled, D., Vardi, M.Y., Wolper, P. (1996). Simple On-the-fly
+  Automatic Verification of Linear Temporal Logic. In: Dembiński, P., Średniawa,
+  M. (eds) Protocol Specification, Testing and Verification XV. PSTV 1995. IFIP
+  Advances in Information and Communication Technology. Springer, Boston, MA.
+  https://doi.org/10.1007/978-0-387-34892-6_1
diff --git a/Documentation/trace/rv/monitor_rtapp.rst b/Documentation/trace/rv/monitor_rtapp.rst
new file mode 100644
index 000000000000..c8104eda924a
--- /dev/null
+++ b/Documentation/trace/rv/monitor_rtapp.rst
@@ -0,0 +1,133 @@
+Real-time application monitors
+==============================
+
+- Name: rtapp
+- Type: container for multiple monitors
+- Author: Nam Cao <namcao@linutronix.de>
+
+Description
+-----------
+
+Real-time applications may have design flaws such that they experience
+unexpected latency and fail to meet their time requirements. Often, these flaws
+follow a few patterns:
+
+  - Page faults: A real-time thread may access memory that does not have a
+    mapped physical backing or must first be copied (such as for copy-on-write).
+    Thus a page fault is raised and the kernel must first perform the expensive
+    action. This causes significant delays to the real-time thread
+  - Priority inversion: A real-time thread blocks waiting for a lower-priority
+    thread. This causes the real-time thread to effectively take on the
+    scheduling priority of the lower-priority thread. For example, the real-time
+    thread needs to access a shared resource that is protected by a
+    non-pi-mutex, but the mutex is currently owned by a non-real-time thread.
+
+The `rtapp` monitor detects these patterns. It aids developers to identify
+reasons for unexpected latency with real-time applications. It is a container of
+multiple sub-monitors described in the following sections.
+
+Monitor pagefault
++++++++++++++++++
+
+The `pagefault` monitor reports real-time tasks raising page faults. Its
+specification is::
+
+  RULE = always (RT imply not PAGEFAULT)
+
+To fix warnings reported by this monitor, `mlockall()` or `mlock()` can be used
+to ensure physical backing for memory.
+
+This monitor may have false negatives because the pages used by the real-time
+threads may just happen to be directly available during testing.  To minimize
+this, the system can be put under memory pressure (e.g.  invoking the OOM killer
+using a program that does `ptr = malloc(SIZE_OF_RAM); memset(ptr, 0,
+SIZE_OF_RAM);`) so that the kernel executes aggressive strategies to recycle as
+much physical memory as possible.
+
+Monitor sleep
++++++++++++++
+
+The `sleep` monitor reports real-time threads sleeping in a manner that may
+cause undesirable latency. Real-time applications should only put a real-time
+thread to sleep for one of the following reasons:
+
+  - Cyclic work: real-time thread sleeps waiting for the next cycle. For this
+    case, only the `clock_nanosleep` syscall should be used with `TIMER_ABSTIME`
+    (to avoid time drift) and `CLOCK_MONOTONIC` (to avoid the clock being
+    changed). No other method is safe for real-time. For example, threads
+    waiting for timerfd can be woken by softirq which provides no real-time
+    guarantee.
+  - Real-time thread waiting for something to happen (e.g. another thread
+    releasing shared resources, or a completion signal from another thread). In
+    this case, only futexes (FUTEX_LOCK_PI, FUTEX_LOCK_PI2 or one of
+    FUTEX_WAIT_*) should be used.  Applications usually do not use futexes
+    directly, but use PI mutexes and PI condition variables which are built on
+    top of futexes. Be aware that the C library might not implement conditional
+    variables as safe for real-time. As an alternative, the librtpi library
+    exists to provide a conditional variable implementation that is correct for
+    real-time applications in Linux.
+
+Beside the reason for sleeping, the eventual waker should also be
+real-time-safe. Namely, one of:
+
+  - An equal-or-higher-priority thread
+  - Hard interrupt handler
+  - Non-maskable interrupt handler
+
+This monitor's warning usually means one of the following:
+
+  - Real-time thread is blocked by a non-real-time thread (e.g. due to
+    contention on a mutex without priority inheritance). This is priority
+    inversion.
+  - Time-critical work waits for something which is not safe for real-time (e.g.
+    timerfd).
+  - The work executed by the real-time thread does not need to run at real-time
+    priority at all.  This is not a problem for the real-time thread itself, but
+    it is potentially taking the CPU away from other important real-time work.
+
+Application developers may purposely choose to have their real-time application
+sleep in a way that is not safe for real-time. It is debatable whether that is a
+problem. Application developers must analyze the warnings to make a proper
+assessment.
+
+The monitor's specification is::
+
+  RULE = always ((RT and SLEEP) imply (RT_FRIENDLY_SLEEP or ALLOWLIST))
+
+  RT_FRIENDLY_SLEEP = (RT_VALID_SLEEP_REASON or KERNEL_THREAD)
+                  and ((not WAKE) until RT_FRIENDLY_WAKE)
+
+  RT_VALID_SLEEP_REASON = FUTEX_WAIT
+                       or RT_FRIENDLY_NANOSLEEP
+
+  RT_FRIENDLY_NANOSLEEP = CLOCK_NANOSLEEP
+                      and NANOSLEEP_TIMER_ABSTIME
+                      and NANOSLEEP_CLOCK_MONOTONIC
+
+  RT_FRIENDLY_WAKE = WOKEN_BY_EQUAL_OR_HIGHER_PRIO
+                  or WOKEN_BY_HARDIRQ
+                  or WOKEN_BY_NMI
+                  or KTHREAD_SHOULD_STOP
+
+  ALLOWLIST = BLOCK_ON_RT_MUTEX
+           or FUTEX_LOCK_PI
+           or TASK_IS_RCU
+           or TASK_IS_MIGRATION
+
+Beside the scenarios described above, this specification also handle some
+special cases:
+
+  - `KERNEL_THREAD`: kernel tasks do not have any pattern that can be recognized
+    as valid real-time sleeping reasons. Therefore sleeping reason is not
+    checked for kernel tasks.
+  - `KTHREAD_SHOULD_STOP`: a non-real-time thread may stop a real-time kernel
+    thread by waking it and waiting for it to exit (`kthread_stop()`). This
+    wakeup is safe for real-time.
+  - `ALLOWLIST`: to handle known false positives with the kernel.
+  - `BLOCK_ON_RT_MUTEX` is included in the allowlist due to its implementation.
+    In the release path of rt_mutex, a boosted task is de-boosted before waking
+    the rt_mutex's waiter. Consequently, the monitor may see a real-time-unsafe
+    wakeup (e.g. non-real-time task waking real-time task). This is actually
+    real-time-safe because preemption is disabled for the duration.
+  - `FUTEX_LOCK_PI` is included in the allowlist for the same reason as
+    `BLOCK_ON_RT_MUTEX`.
diff --git a/Documentation/trace/rv/monitor_sched.rst b/Documentation/trace/rv/monitor_sched.rst
index 24b2c62a3bc2..3f8381ad9ec7 100644
--- a/Documentation/trace/rv/monitor_sched.rst
+++ b/Documentation/trace/rv/monitor_sched.rst
@@ -40,26 +40,6 @@ defined in by Daniel Bristot in [1].
 
 Currently we included the following:
 
-Monitor tss
-~~~~~~~~~~~
-
-The task switch while scheduling (tss) monitor ensures a task switch happens
-only in scheduling context, that is inside a call to `__schedule`::
-
-                     |
-                     |
-                     v
-                   +-----------------+
-                   |     thread      | <+
-                   +-----------------+  |
-                     |                  |
-                     | schedule_entry   | schedule_exit
-                     v                  |
-    sched_switch                        |
-  +---------------                      |
-  |                       sched         |
-  +-------------->                     -+
-
 Monitor sco
 ~~~~~~~~~~~
 
@@ -144,26 +124,277 @@ does not enable preemption::
                                                   |
                           scheduling_contex      -+
 
-Monitor sncid
-~~~~~~~~~~~~~
+Monitor sts
+~~~~~~~~~~~
 
-The schedule not called with interrupt disabled (sncid) monitor ensures
-schedule is not called with interrupt disabled::
+The schedule implies task switch (sts) monitor ensures a task switch happens
+only in scheduling context and up to once, as well as scheduling occurs with
+interrupts enabled but no task switch can happen before interrupts are
+disabled. When the next task picked for execution is the same as the previously
+running one, no real task switch occurs but interrupts are disabled nonetheless::
 
-                       |
-                       |
-                       v
-    schedule_entry   +--------------+
-    schedule_exit    |              |
-  +----------------- |  can_sched   |
-  |                  |              |
-  +----------------> |              | <+
-                     +--------------+  |
-                       |               |
-                       | irq_disable   | irq_enable
-                       v               |
-                                       |
-                        cant_sched    -+
+    irq_entry                      |
+     +----+                        |
+     v    |                        v
+ +------------+ irq_enable    #===================#   irq_disable
+ |            | ------------> H                   H   irq_entry
+ | cant_sched | <------------ H                   H   irq_enable
+ |            | irq_disable   H     can_sched     H --------------+
+ +------------+               H                   H               |
+                              H                   H               |
+            +---------------> H                   H <-------------+
+            |                 #===================#
+            |                   |
+      schedule_exit             | schedule_entry
+            |                   v
+            |   +-------------------+     irq_enable
+            |   |    scheduling     | <---------------+
+            |   +-------------------+                 |
+            |     |                                   |
+            |     | irq_disable                    +--------+  irq_entry
+            |     v                                |        | --------+
+            |   +-------------------+  irq_entry   | in_irq |         |
+            |   |                   | -----------> |        | <-------+
+            |   | disable_to_switch |              +--------+
+            |   |                   | --+
+            |   +-------------------+   |
+            |     |                     |
+            |     | sched_switch        |
+            |     v                     |
+            |   +-------------------+   |
+            |   |     switching     |   | irq_enable
+            |   +-------------------+   |
+            |     |                     |
+            |     | irq_enable          |
+            |     v                     |
+            |   +-------------------+   |
+            +-- |  enable_to_exit   | <-+
+                +-------------------+
+                  ^               | irq_disable
+                  |               | irq_entry
+                  +---------------+ irq_enable
+
+Monitor nrp
+-----------
+
+The need resched preempts (nrp) monitor ensures preemption requires
+``need_resched``. Only kernel preemption is considered, since preemption
+while returning to userspace, for this monitor, is indistinguishable from
+``sched_switch_yield`` (described in the sssw monitor).
+A kernel preemption is whenever ``__schedule`` is called with the preemption
+flag set to true (e.g. from preempt_enable or exiting from interrupts). This
+type of preemption occurs after the need for ``rescheduling`` has been set.
+This is not valid for the *lazy* variant of the flag, which causes only
+userspace preemption.
+A ``schedule_entry_preempt`` may involve a task switch or not, in the latter
+case, a task goes through the scheduler from a preemption context but it is
+picked as the next task to run. Since the scheduler runs, this clears the need
+to reschedule. The ``any_thread_running`` state does not imply the monitored
+task is not running as this monitor does not track the outcome of scheduling.
+
+In theory, a preemption can only occur after the ``need_resched`` flag is set. In
+practice, however, it is possible to see a preemption where the flag is not
+set. This can happen in one specific condition::
+
+  need_resched
+                   preempt_schedule()
+                                           preempt_schedule_irq()
+                                                   __schedule()
+  !need_resched
+                           __schedule()
+
+In the situation above, standard preemption starts (e.g. from preempt_enable
+when the flag is set), an interrupt occurs before scheduling and, on its exit
+path, it schedules, which clears the ``need_resched`` flag.
+When the preempted task runs again, the standard preemption started earlier
+resumes, although the flag is no longer set. The monitor considers this a
+``nested_preemption``, this allows another preemption without re-setting the
+flag. This condition relaxes the monitor constraints and may catch false
+negatives (i.e. no real ``nested_preemptions``) but makes the monitor more
+robust and able to validate other scenarios.
+For simplicity, the monitor starts in ``preempt_irq``, although no interrupt
+occurred, as the situation above is hard to pinpoint::
+
+    schedule_entry
+    irq_entry                 #===========================================#
+  +-------------------------- H                                           H
+  |                           H                                           H
+  +-------------------------> H             any_thread_running            H
+                              H                                           H
+  +-------------------------> H                                           H
+  |                           #===========================================#
+  | schedule_entry              |                       ^
+  | schedule_entry_preempt      | sched_need_resched    | schedule_entry
+  |                             |                      schedule_entry_preempt
+  |                             v                       |
+  |                           +----------------------+  |
+  |                      +--- |                      |  |
+  |   sched_need_resched |    |     rescheduling     | -+
+  |                      +--> |                      |
+  |                           +----------------------+
+  |                             | irq_entry
+  |                             v
+  |                           +----------------------+
+  |                           |                      | ---+
+  |                      ---> |                      |    | sched_need_resched
+  |                           |      preempt_irq     |    | irq_entry
+  |                           |                      | <--+
+  |                           |                      | <--+
+  |                           +----------------------+    |
+  |                             | schedule_entry          | sched_need_resched
+  |                             | schedule_entry_preempt  |
+  |                             v                         |
+  |                           +-----------------------+   |
+  +-------------------------- |    nested_preempt     | --+
+                              +-----------------------+
+                                ^ irq_entry         |
+                                +-------------------+
+
+Due to how the ``need_resched`` flag on the preemption count works on arm64,
+this monitor is unstable on that architecture, as it often records preemption
+when the flag is not set, even in presence of the workaround above.
+For the time being, the monitor is disabled by default on arm64.
+
+Monitor sssw
+------------
+
+The set state sleep and wakeup (sssw) monitor ensures ``set_state`` to
+sleepable leads to sleeping and sleeping tasks require wakeup. It includes the
+following types of switch:
+
+* ``switch_suspend``:
+  a task puts itself to sleep, this can happen only after explicitly setting
+  the task to ``sleepable``. After a task is suspended, it needs to be woken up
+  (``waking`` state) before being switched in again.
+  Setting the task's state to ``sleepable`` can be reverted before switching if it
+  is woken up or set to ``runnable``.
+* ``switch_blocking``:
+  a special case of a ``switch_suspend`` where the task is waiting on a
+  sleeping RT lock (``PREEMPT_RT`` only), it is common to see wakeup and set
+  state events racing with each other and this leads the model to perceive this
+  type of switch when the task is not set to sleepable. This is a limitation of
+  the model in SMP system and workarounds may slow down the system.
+* ``switch_preempt``:
+  a task switch as a result of kernel preemption (``schedule_entry_preempt`` in
+  the nrp model).
+* ``switch_yield``:
+  a task explicitly calls the scheduler or is preempted while returning to
+  userspace. It can happen after a ``yield`` system call, from the idle task or
+  if the ``need_resched`` flag is set. By definition, a task cannot yield while
+  ``sleepable`` as that would be a suspension. A special case of a yield occurs
+  when a task in ``TASK_INTERRUPTIBLE`` calls the scheduler while a signal is
+  pending. The task doesn't go through the usual blocking/waking and is set
+  back to runnable, the resulting switch (if there) looks like a yield to the
+  ``signal_wakeup`` state and is followed by the signal delivery. From this
+  state, the monitor expects a signal even if it sees a wakeup event, although
+  not necessary, to rule out false negatives.
+
+This monitor doesn't include a running state, ``sleepable`` and ``runnable``
+are only referring to the task's desired state, which could be scheduled out
+(e.g. due to preemption). However, it does include the event
+``sched_switch_in`` to represent when a task is allowed to become running. This
+can be triggered also by preemption, but cannot occur after the task got to
+``sleeping`` before a ``wakeup`` occurs::
+
+   +--------------------------------------------------------------------------+
+   |                                                                          |
+   |                                                                          |
+   | switch_suspend           |                                               |
+   | switch_blocking          |                                               |
+   v                          v                                               |
+ +----------+              #==========================#   set_state_runnable  |
+ |          |              H                          H   wakeup              |
+ |          |              H                          H   switch_in           |
+ |          |              H                          H   switch_yield        |
+ | sleeping |              H                          H   switch_preempt      |
+ |          |              H                          H   signal_deliver      |
+ |          |  switch_     H                          H ------+               |
+ |          |  _blocking   H         runnable         H       |               |
+ |          | <----------- H                          H <-----+               |
+ +----------+              H                          H                       |
+   |   wakeup              H                          H                       |
+   +---------------------> H                          H                       |
+                           H                          H                       |
+               +---------> H                          H                       |
+               |           #==========================#                       |
+               |             |                ^                               |
+               |             |                | set_state_runnable            |
+               |             |                | wakeup                        |
+               |    set_state_sleepable       |      +------------------------+
+               |             v                |      |
+               |           +--------------------------+  set_state_sleepable
+               |           |                          |  switch_in
+               |           |                          |  switch_preempt
+   signal_deliver          |        sleepable         |  signal_deliver
+               |           |                          | ------+
+               |           |                          |       |
+               |           |                          | <-----+
+               |           +--------------------------+
+               |             |                ^
+               |        switch_yield          | set_state_sleepable
+               |             v                |
+               |           +---------------+  |
+               +---------- | signal_wakeup | -+
+                           +---------------+
+                             ^           | switch_in
+                             |           | switch_preempt
+                             |           | switch_yield
+                             +-----------+ wakeup
+
+Monitor opid
+------------
+
+The operations with preemption and irq disabled (opid) monitor ensures
+operations like ``wakeup`` and ``need_resched`` occur with interrupts and
+preemption disabled or during interrupt context, in such case preemption may
+not be disabled explicitly.
+``need_resched`` can be set by some RCU internals functions, in which case it
+doesn't match a task wakeup and might occur with only interrupts disabled::
+
+                 |                     sched_need_resched
+                 |                     sched_waking
+                 |                     irq_entry
+                 |                   +--------------------+
+                 v                   v                    |
+               +------------------------------------------------------+
+  +----------- |                     disabled                         | <+
+  |            +------------------------------------------------------+  |
+  |              |                 ^                                     |
+  |              |          preempt_disable      sched_need_resched      |
+  |       preempt_enable           |           +--------------------+    |
+  |              v                 |           v                    |    |
+  |            +------------------------------------------------------+  |
+  |            |                   irq_disabled                       |  |
+  |            +------------------------------------------------------+  |
+  |                              |             |        ^                |
+  |     irq_entry            irq_entry         |        |                |
+  |     sched_need_resched       v             |   irq_disable           |
+  |     sched_waking +--------------+          |        |                |
+  |           +----- |              |     irq_enable    |                |
+  |           |      |    in_irq    |          |        |                |
+  |           +----> |              |          |        |                |
+  |                  +--------------+          |        |          irq_disable
+  |                     |                      |        |                |
+  | irq_enable          | irq_enable           |        |                |
+  |                     v                      v        |                |
+  |            #======================================================#  |
+  |            H                     enabled                          H  |
+  |            #======================================================#  |
+  |              |                   ^         ^ preempt_enable     |    |
+  |       preempt_disable     preempt_enable   +--------------------+    |
+  |              v                   |                                   |
+  |            +------------------+  |                                   |
+  +----------> | preempt_disabled | -+                                   |
+               +------------------+                                      |
+                 |                                                       |
+                 +-------------------------------------------------------+
+
+This monitor is designed to work on ``PREEMPT_RT`` kernels, the special case of
+events occurring in interrupt context is a shortcut to identify valid scenarios
+where the preemption tracepoints might not be visible, during interrupts
+preemption is always disabled. On non- ``PREEMPT_RT`` kernels, the interrupts
+might invoke a softirq to set ``need_resched`` and wake up a task. This is
+another special case that is currently not supported by the monitor.
 
 References
 ----------
diff --git a/Documentation/trace/rv/monitor_synthesis.rst b/Documentation/trace/rv/monitor_synthesis.rst
new file mode 100644
index 000000000000..ac808a7554f5
--- /dev/null
+++ b/Documentation/trace/rv/monitor_synthesis.rst
@@ -0,0 +1,271 @@
+Runtime Verification Monitor Synthesis
+======================================
+
+The starting point for the application of runtime verification (RV) techniques
+is the *specification* or *modeling* of the desired (or undesired) behavior
+of the system under scrutiny.
+
+The formal representation needs to be then *synthesized* into a *monitor*
+that can then be used in the analysis of the trace of the system. The
+*monitor* connects to the system via an *instrumentation* that converts
+the events from the *system* to the events of the *specification*.
+
+
+In Linux terms, the runtime verification monitors are encapsulated inside
+the *RV monitor* abstraction. The RV monitor includes a set of instances
+of the monitor (per-cpu monitor, per-task monitor, and so on), the helper
+functions that glue the monitor to the system reference model, and the
+trace output as a reaction to event parsing and exceptions, as depicted
+below::
+
+ Linux  +----- RV Monitor ----------------------------------+ Formal
+  Realm |                                                   |  Realm
+  +-------------------+     +----------------+     +-----------------+
+  |   Linux kernel    |     |     Monitor    |     |     Reference   |
+  |     Tracing       |  -> |   Instance(s)  | <-  |       Model     |
+  | (instrumentation) |     | (verification) |     | (specification) |
+  +-------------------+     +----------------+     +-----------------+
+         |                          |                       |
+         |                          V                       |
+         |                     +----------+                 |
+         |                     | Reaction |                 |
+         |                     +--+--+--+-+                 |
+         |                        |  |  |                   |
+         |                        |  |  +-> trace output ?  |
+         +------------------------|--|----------------------+
+                                  |  +----> panic ?
+                                  +-------> <user-specified>
+
+RV monitor synthesis
+--------------------
+
+The synthesis of a specification into the Linux *RV monitor* abstraction is
+automated by the rvgen tool and the header file containing common code for
+creating monitors. The header files are:
+
+  * rv/da_monitor.h for deterministic automaton monitor.
+  * rv/ltl_monitor.h for linear temporal logic monitor.
+
+rvgen
+-----
+
+The rvgen utility converts a specification into the C presentation and creating
+the skeleton of a kernel monitor in C.
+
+For example, it is possible to transform the wip.dot model present in
+[1] into a per-cpu monitor with the following command::
+
+  $ rvgen monitor -c da -s wip.dot -t per_cpu
+
+This will create a directory named wip/ with the following files:
+
+- wip.h: the wip model in C
+- wip.c: the RV monitor
+
+The wip.c file contains the monitor declaration and the starting point for
+the system instrumentation.
+
+Similarly, a linear temporal logic monitor can be generated with the following
+command::
+
+  $ rvgen monitor -c ltl -s pagefault.ltl -t per_task
+
+This generates pagefault/ directory with:
+
+- pagefault.h: The Buchi automaton (the non-deterministic state machine to
+  verify the specification)
+- pagefault.c: The skeleton for the RV monitor
+
+Monitor header files
+--------------------
+
+The header files:
+
+- `rv/da_monitor.h` for deterministic automaton monitor
+- `rv/ltl_monitor` for linear temporal logic monitor
+
+include common macros and static functions for implementing *Monitor
+Instance(s)*.
+
+The benefits of having all common functionalities in a single header file are
+3-fold:
+
+  - Reduce the code duplication;
+  - Facilitate the bug fix/improvement;
+  - Avoid the case of developers changing the core of the monitor code to
+    manipulate the model in a (let's say) non-standard way.
+
+rv/da_monitor.h
++++++++++++++++
+
+This initial implementation presents three different types of monitor instances:
+
+- ``#define DECLARE_DA_MON_GLOBAL(name, type)``
+- ``#define DECLARE_DA_MON_PER_CPU(name, type)``
+- ``#define DECLARE_DA_MON_PER_TASK(name, type)``
+
+The first declares the functions for a global deterministic automata monitor,
+the second for monitors with per-cpu instances, and the third with per-task
+instances.
+
+In all cases, the 'name' argument is a string that identifies the monitor, and
+the 'type' argument is the data type used by rvgen on the representation of
+the model in C.
+
+For example, the wip model with two states and three events can be
+stored in an 'unsigned char' type. Considering that the preemption control
+is a per-cpu behavior, the monitor declaration in the 'wip.c' file is::
+
+  DECLARE_DA_MON_PER_CPU(wip, unsigned char);
+
+The monitor is executed by sending events to be processed via the functions
+presented below::
+
+  da_handle_event_$(MONITOR_NAME)($(event from event enum));
+  da_handle_start_event_$(MONITOR_NAME)($(event from event enum));
+  da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum));
+
+The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where
+the event will be processed if the monitor is processing events.
+
+When a monitor is enabled, it is placed in the initial state of the automata.
+However, the monitor does not know if the system is in the *initial state*.
+
+The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the
+monitor that the system is returning to the initial state, so the monitor can
+start monitoring the next event.
+
+The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify
+the monitor that the system is known to be in the initial state, so the
+monitor can start monitoring and monitor the current event.
+
+Using the wip model as example, the events "preempt_disable" and
+"sched_waking" should be sent to monitor, respectively, via [2]::
+
+  da_handle_event_wip(preempt_disable_wip);
+  da_handle_event_wip(sched_waking_wip);
+
+While the event "preempt_enabled" will use::
+
+  da_handle_start_event_wip(preempt_enable_wip);
+
+To notify the monitor that the system will be returning to the initial state,
+so the system and the monitor should be in sync.
+
+rv/ltl_monitor.h
+++++++++++++++++
+This file must be combined with the $(MODEL_NAME).h file (generated by `rvgen`)
+to be complete. For example, for the `pagefault` monitor, the `pagefault.c`
+source file must include::
+
+  #include "pagefault.h"
+  #include <rv/ltl_monitor.h>
+
+(the skeleton monitor file generated by `rvgen` already does this).
+
+`$(MODEL_NAME).h` (`pagefault.h` in the above example) includes the
+implementation of the Buchi automaton - a non-deterministic state machine that
+verifies the LTL specification. While `rv/ltl_monitor.h` includes the common
+helper functions to interact with the Buchi automaton and to implement an RV
+monitor. An important definition in `$(MODEL_NAME).h` is::
+
+  enum ltl_atom {
+      LTL_$(FIRST_ATOMIC_PROPOSITION),
+      LTL_$(SECOND_ATOMIC_PROPOSITION),
+      ...
+      LTL_NUM_ATOM
+  };
+
+which is the list of atomic propositions present in the LTL specification
+(prefixed with "LTL\_" to avoid name collision). This `enum` is passed to the
+functions interacting with the Buchi automaton.
+
+While generating code, `rvgen` cannot understand the meaning of the atomic
+propositions. Thus, that task is left for manual work. The recommended pratice
+is adding tracepoints to places where the atomic propositions change; and in the
+tracepoints' handlers: the Buchi automaton is executed using::
+
+  void ltl_atom_update(struct task_struct *task, enum ltl_atom atom, bool value)
+
+which tells the Buchi automaton that the atomic proposition `atom` is now
+`value`. The Buchi automaton checks whether the LTL specification is still
+satisfied, and invokes the monitor's error tracepoint and the reactor if
+violation is detected.
+
+Tracepoints and `ltl_atom_update()` should be used whenever possible. However,
+it is sometimes not the most convenient. For some atomic propositions which are
+changed in multiple places in the kernel, it is cumbersome to trace all those
+places. Furthermore, it may not be important that the atomic propositions are
+updated at precise times. For example, considering the following linear temporal
+logic::
+
+  RULE = always (RT imply not PAGEFAULT)
+
+This LTL states that a real-time task does not raise page faults. For this
+specification, it is not important when `RT` changes, as long as it has the
+correct value when `PAGEFAULT` is true.  Motivated by this case, another
+function is introduced::
+
+  void ltl_atom_fetch(struct task_struct *task, struct ltl_monitor *mon)
+
+This function is called whenever the Buchi automaton is triggered. Therefore, it
+can be manually implemented to "fetch" `RT`::
+
+  void ltl_atom_fetch(struct task_struct *task, struct ltl_monitor *mon)
+  {
+      ltl_atom_set(mon, LTL_RT, rt_task(task));
+  }
+
+Effectively, whenever `PAGEFAULT` is updated with a call to `ltl_atom_update()`,
+`RT` is also fetched. Thus, the LTL specification can be verified without
+tracing `RT` everywhere.
+
+For atomic propositions which act like events, they usually need to be set (or
+cleared) and then immediately cleared (or set). A convenient function is
+provided::
+
+  void ltl_atom_pulse(struct task_struct *task, enum ltl_atom atom, bool value)
+
+which is equivalent to::
+
+  ltl_atom_update(task, atom, value);
+  ltl_atom_update(task, atom, !value);
+
+To initialize the atomic propositions, the following function must be
+implemented::
+
+  ltl_atoms_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
+
+This function is called for all running tasks when the monitor is enabled. It is
+also called for new tasks created after the enabling the monitor. It should
+initialize as many atomic propositions as possible, for example::
+
+  void ltl_atom_init(struct task_struct *task, struct ltl_monitor *mon, bool task_creation)
+  {
+      ltl_atom_set(mon, LTL_RT, rt_task(task));
+      if (task_creation)
+          ltl_atom_set(mon, LTL_PAGEFAULT, false);
+  }
+
+Atomic propositions not initialized by `ltl_atom_init()` will stay in the
+unknown state until relevant tracepoints are hit, which can take some time. As
+monitoring for a task cannot be done until all atomic propositions is known for
+the task, the monitor may need some time to start validating tasks which have
+been running before the monitor is enabled. Therefore, it is recommended to
+start the tasks of interest after enabling the monitor.
+
+Final remarks
+-------------
+
+With the monitor synthesis in place using the header files and
+rvgen, the developer's work should be limited to the instrumentation
+of the system, increasing the confidence in the overall approach.
+
+[1] For details about deterministic automata format and the translation
+from one representation to another, see::
+
+  Documentation/trace/rv/deterministic_automata.rst
+
+[2] rvgen appends the monitor's name suffix to the events enums to
+avoid conflicting variables when exporting the global vmlinux.h
+use by BPF programs.
diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst
index decabcc77b56..b35c40e3abbe 100644
--- a/Documentation/trace/tracepoints.rst
+++ b/Documentation/trace/tracepoints.rst
@@ -71,7 +71,7 @@ In subsys/file.c (where the tracing statement must be added)::
 	void somefct(void)
 	{
 		...
-		trace_subsys_eventname(arg, task);
+		trace_subsys_eventname_tp(arg, task);
 		...
 	}
 
@@ -129,12 +129,12 @@ within an if statement with the following::
 		for (i = 0; i < count; i++)
 			tot += calculate_nuggets();
 
-		trace_foo_bar(tot);
+		trace_foo_bar_tp(tot);
 	}
 
-All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled()
+All trace_<tracepoint>_tp() calls have a matching trace_<tracepoint>_enabled()
 function defined that returns true if the tracepoint is enabled and
-false otherwise. The trace_<tracepoint>() should always be within the
+false otherwise. The trace_<tracepoint>_tp() should always be within the
 block of the if (trace_<tracepoint>_enabled()) to prevent races between
 the tracepoint being enabled and the check being seen.
 
@@ -143,7 +143,10 @@ the static_key of the tracepoint to allow the if statement to be implemented
 with jump labels and avoid conditional branches.
 
 .. note:: The convenience macro TRACE_EVENT provides an alternative way to
-      define tracepoints. Check http://lwn.net/Articles/379903,
+      define tracepoints. Note, DECLARE_TRACE(foo) creates a function
+      "trace_foo_tp()" whereas TRACE_EVENT(foo) creates a function
+      "trace_foo()", and also exposes the tracepoint as a trace event in
+      /sys/kernel/tracing/events directory.  Check http://lwn.net/Articles/379903,
       http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
       for a series of articles with more details.
 
@@ -159,7 +162,9 @@ In a C file::
 
 	void do_trace_foo_bar_wrapper(args)
 	{
-		trace_foo_bar(args);
+		trace_foo_bar_tp(args); // for tracepoints created via DECLARE_TRACE
+					//   or
+		trace_foo_bar(args);    // for tracepoints created via TRACE_EVENT
 	}
 
 In the header file::