summaryrefslogtreecommitdiff
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/debugging.rst159
-rw-r--r--Documentation/trace/events.rst24
-rw-r--r--Documentation/trace/fprobe.rst42
-rw-r--r--Documentation/trace/fprobetrace.rst31
-rw-r--r--Documentation/trace/ftrace-design.rst12
-rw-r--r--Documentation/trace/ftrace.rst27
-rw-r--r--Documentation/trace/hisi-ptt.rst4
-rw-r--r--Documentation/trace/histogram.rst2
-rw-r--r--Documentation/trace/index.rst2
-rw-r--r--Documentation/trace/kprobes.rst1
-rw-r--r--Documentation/trace/kprobetrace.rst17
-rw-r--r--Documentation/trace/osnoise-tracer.rst2
-rw-r--r--Documentation/trace/ring-buffer-map.rst106
-rw-r--r--Documentation/trace/rv/runtime-verification.rst4
-rw-r--r--Documentation/trace/tracepoints.rst2
-rw-r--r--Documentation/trace/user_events.rst27
16 files changed, 421 insertions, 41 deletions
diff --git a/Documentation/trace/debugging.rst b/Documentation/trace/debugging.rst
new file mode 100644
index 000000000000..54fb16239d70
--- /dev/null
+++ b/Documentation/trace/debugging.rst
@@ -0,0 +1,159 @@
+==============================
+Using the tracer for debugging
+==============================
+
+Copyright 2024 Google LLC.
+
+:Author: Steven Rostedt <rostedt@goodmis.org>
+:License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+
+- Written for: 6.12
+
+Introduction
+------------
+The tracing infrastructure can be very useful for debugging the Linux
+kernel. This document is a place to add various methods of using the tracer
+for debugging.
+
+First, make sure that the tracefs file system is mounted::
+
+ $ sudo mount -t tracefs tracefs /sys/kernel/tracing
+
+
+Using trace_printk()
+--------------------
+
+trace_printk() is a very lightweight utility that can be used in any context
+inside the kernel, with the exception of "noinstr" sections. It can be used
+in normal, softirq, interrupt and even NMI context. The trace data is
+written to the tracing ring buffer in a lockless way. To make it even
+lighter weight, when possible, it will only record the pointer to the format
+string, and save the raw arguments into the buffer. The format and the
+arguments will be post processed when the ring buffer is read. This way the
+trace_printk() format conversions are not done during the hot path, where
+the trace is being recorded.
+
+trace_printk() is meant only for debugging, and should never be added into
+a subsystem of the kernel. If you need debugging traces, add trace events
+instead. If a trace_printk() is found in the kernel, the following will
+appear in the dmesg::
+
+ **********************************************************
+ ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
+ ** **
+ ** trace_printk() being used. Allocating extra memory. **
+ ** **
+ ** This means that this is a DEBUG kernel and it is **
+ ** unsafe for production use. **
+ ** **
+ ** If you see this message and you are not debugging **
+ ** the kernel, report this immediately to your vendor! **
+ ** **
+ ** NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE NOTICE **
+ **********************************************************
+
+Debugging kernel crashes
+------------------------
+There is various methods of acquiring the state of the system when a kernel
+crash occurs. This could be from the oops message in printk, or one could
+use kexec/kdump. But these just show what happened at the time of the crash.
+It can be very useful in knowing what happened up to the point of the crash.
+The tracing ring buffer, by default, is a circular buffer than will
+overwrite older events with newer ones. When a crash happens, the content of
+the ring buffer will be all the events that lead up to the crash.
+
+There are several kernel command line parameters that can be used to help in
+this. The first is "ftrace_dump_on_oops". This will dump the tracing ring
+buffer when a oops occurs to the console. This can be useful if the console
+is being logged somewhere. If a serial console is used, it may be prudent to
+make sure the ring buffer is relatively small, otherwise the dumping of the
+ring buffer may take several minutes to hours to finish. Here's an example
+of the kernel command line::
+
+ ftrace_dump_on_oops trace_buf_size=50K
+
+Note, the tracing buffer is made up of per CPU buffers where each of these
+buffers is broken up into sub-buffers that are by default PAGE_SIZE. The
+above trace_buf_size option above sets each of the per CPU buffers to 50K,
+so, on a machine with 8 CPUs, that's actually 400K total.
+
+Persistent buffers across boots
+-------------------------------
+If the system memory allows it, the tracing ring buffer can be specified at
+a specific location in memory. If the location is the same across boots and
+the memory is not modified, the tracing buffer can be retrieved from the
+following boot. There's two ways to reserve memory for the use of the ring
+buffer.
+
+The more reliable way (on x86) is to reserve memory with the "memmap" kernel
+command line option and then use that memory for the trace_instance. This
+requires a bit of knowledge of the physical memory layout of the system. The
+advantage of using this method, is that the memory for the ring buffer will
+always be the same::
+
+ memmap==12M$0x284500000 trace_instance=boot_map@0x284500000:12M
+
+The memmap above reserves 12 megabytes of memory at the physical memory
+location 0x284500000. Then the trace_instance option will create a trace
+instance "boot_map" at that same location with the same amount of memory
+reserved. As the ring buffer is broke up into per CPU buffers, the 12
+megabytes will be broken up evenly between those CPUs. If you have 8 CPUs,
+each per CPU ring buffer will be 1.5 megabytes in size. Note, that also
+includes meta data, so the amount of memory actually used by the ring buffer
+will be slightly smaller.
+
+Another more generic but less robust way to allocate a ring buffer mapping
+at boot is with the "reserve_mem" option::
+
+ reserve_mem=12M:4096:trace trace_instance=boot_map@trace
+
+The reserve_mem option above will find 12 megabytes that are available at
+boot up, and align it by 4096 bytes. It will label this memory as "trace"
+that can be used by later command line options.
+
+The trace_instance option creates a "boot_map" instance and will use the
+memory reserved by reserve_mem that was labeled as "trace". This method is
+more generic but may not be as reliable. Due to KASLR, the memory reserved
+by reserve_mem may not be located at the same location. If this happens,
+then the ring buffer will not be from the previous boot and will be reset.
+
+Sometimes, by using a larger alignment, it can keep KASLR from moving things
+around in such a way that it will move the location of the reserve_mem. By
+using a larger alignment, you may find better that the buffer is more
+consistent to where it is placed::
+
+ reserve_mem=12M:0x2000000:trace trace_instance=boot_map@trace
+
+On boot up, the memory reserved for the ring buffer is validated. It will go
+through a series of tests to make sure that the ring buffer contains valid
+data. If it is, it will then set it up to be available to read from the
+instance. If it fails any of the tests, it will clear the entire ring buffer
+and initialize it as new.
+
+The layout of this mapped memory may not be consistent from kernel to
+kernel, so only the same kernel is guaranteed to work if the mapping is
+preserved. Switching to a different kernel version may find a different
+layout and mark the buffer as invalid.
+
+Using trace_printk() in the boot instance
+-----------------------------------------
+By default, the content of trace_printk() goes into the top level tracing
+instance. But this instance is never preserved across boots. To have the
+trace_printk() content, and some other internal tracing go to the preserved
+buffer (like dump stacks), either set the instance to be the trace_printk()
+destination from the kernel command line, or set it after boot up via the
+trace_printk_dest option.
+
+After boot up::
+
+ echo 1 > /sys/kernel/tracing/instances/boot_map/options/trace_printk_dest
+
+From the kernel command line::
+
+ reserve_mem=12M:4096:trace trace_instance=boot_map^traceprintk^traceoff@trace
+
+If setting it from the kernel command line, it is recommended to also
+disable tracing with the "traceoff" flag, and enable tracing after boot up.
+Otherwise the trace from the most recent boot will be mixed with the trace
+from the previous boot, and may make it confusing to read.
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index 759907c20e75..2d88a2acacc0 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -55,6 +55,30 @@ command::
# echo 'irq:*' > /sys/kernel/tracing/set_event
+The set_event file may also be used to enable events associated to only
+a specific module::
+
+ # echo ':mod:<module>' > /sys/kernel/tracing/set_event
+
+Will enable all events in the module ``<module>``. If the module is not yet
+loaded, the string will be saved and when a module is that matches ``<module>``
+is loaded, then it will apply the enabling of events then.
+
+The text before ``:mod:`` will be parsed to specify specific events that the
+module creates::
+
+ # echo '<match>:mod:<module>' > /sys/kernel/tracing/set_event
+
+The above will enable any system or event that ``<match>`` matches. If
+``<match>`` is ``"*"`` then it will match all events.
+
+To enable only a specific event within a system::
+
+ # echo '<system>:<event>:mod:<module>' > /sys/kernel/tracing/set_event
+
+If ``<event>`` is ``"*"`` then it will match all events within the system
+for a given module.
+
2.2 Via the 'enable' toggle
---------------------------
diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
index 196f52386aaa..71cd40472d36 100644
--- a/Documentation/trace/fprobe.rst
+++ b/Documentation/trace/fprobe.rst
@@ -9,9 +9,10 @@ Fprobe - Function entry/exit probe
Introduction
============
-Fprobe is a function entry/exit probe mechanism based on ftrace.
-Instead of using ftrace full feature, if you only want to attach callbacks
-on function entry and exit, similar to the kprobes and kretprobes, you can
+Fprobe is a function entry/exit probe based on the function-graph tracing
+feature in ftrace.
+Instead of tracing all functions, if you want to attach callbacks on specific
+function entry and exit, similar to the kprobes and kretprobes, you can
use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
instrumentation for multiple functions with single handler. This document
describes how to use fprobe.
@@ -91,12 +92,14 @@ The prototype of the entry/exit callback function are as follows:
.. code-block:: c
- int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data);
- void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct ftrace_regs *fregs, void *entry_data);
-Note that the @entry_ip is saved at function entry and passed to exit handler.
-If the entry callback function returns !0, the corresponding exit callback will be cancelled.
+Note that the @entry_ip is saved at function entry and passed to exit
+handler.
+If the entry callback function returns !0, the corresponding exit callback
+will be cancelled.
@fp
This is the address of `fprobe` data structure related to this handler.
@@ -112,12 +115,10 @@ If the entry callback function returns !0, the corresponding exit callback will
This is the return address that the traced function will return to,
somewhere in the caller. This can be used at both entry and exit.
-@regs
- This is the `pt_regs` data structure at the entry and exit. Note that
- the instruction pointer of @regs may be different from the @entry_ip
- in the entry_handler. If you need traced instruction pointer, you need
- to use @entry_ip. On the other hand, in the exit_handler, the instruction
- pointer of @regs is set to the current return address.
+@fregs
+ This is the `ftrace_regs` data structure at the entry and exit. This
+ includes the function parameters, or the return values. So user can
+ access thos values via appropriate `ftrace_regs_*` APIs.
@entry_data
This is a local storage to share the data between entry and exit handlers.
@@ -125,6 +126,17 @@ If the entry callback function returns !0, the corresponding exit callback will
and `entry_data_size` field when registering the fprobe, the storage is
allocated and passed to both `entry_handler` and `exit_handler`.
+Entry data size and exit handlers on the same function
+======================================================
+
+Since the entry data is passed via per-task stack and it has limited size,
+the entry data size per probe is limited to `15 * sizeof(long)`. You also need
+to take care that the different fprobes are probing on the same function, this
+limit becomes smaller. The entry data size is aligned to `sizeof(long)` and
+each fprobe which has exit handler uses a `sizeof(long)` space on the stack,
+you should keep the number of fprobes on the same function as small as
+possible.
+
Share the callbacks with kprobes
================================
@@ -165,8 +177,8 @@ This counter counts up when;
- fprobe fails to take ftrace_recursion lock. This usually means that a function
which is traced by other ftrace users is called from the entry_handler.
- - fprobe fails to setup the function exit because of the shortage of rethook
- (the shadow stack for hooking the function return.)
+ - fprobe fails to setup the function exit because of failing to allocate the
+ data buffer from the per-task shadow stack.
The `fprobe::nmissed` field counts up in both cases. Therefore, the former
skips both of entry and exit callback and the latter skips the exit
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
index e35e6b18df40..b4c2ca3d02c1 100644
--- a/Documentation/trace/fprobetrace.rst
+++ b/Documentation/trace/fprobetrace.rst
@@ -70,6 +70,14 @@ Synopsis of fprobe-events
For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`.
+Function arguments at exit
+--------------------------
+Function arguments can be accessed at exit probe using $arg<N> fetcharg. This
+is useful to record the function parameter and return value at once, and
+trace the difference of structure fields (for debugging a function whether it
+correctly updates the given data structure or not)
+See the :ref:`sample<fprobetrace_exit_args_sample>` below for how it works.
+
BTF arguments
-------------
BTF (BPF Type Format) argument allows user to trace function and tracepoint
@@ -218,3 +226,26 @@ traceprobe event, you can trace that field as below.
<idle>-0 [000] d..3. 5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
kworker/0:1-14 [000] d..3. 5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
<idle>-0 [000] d..3. 5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
+
+.. _fprobetrace_exit_args_sample:
+
+The return probe allows us to access the results of some functions, which returns
+the error code and its results are passed via function parameter, such as an
+structure-initialization function.
+
+For example, vfs_open() will link the file structure to the inode and update
+mode. You can trace that changes with return probe.
+::
+
+ # echo 'f vfs_open mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
+ # echo 'f vfs_open%%return mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
+ # echo 1 > events/fprobes/enable
+ # cat trace
+ sh-131 [006] ...1. 1945.714346: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x2 inode=0x0
+ sh-131 [006] ...1. 1945.714358: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4d801e inode=0xffff888008470168
+ cat-143 [007] ...1. 1945.717949: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
+ cat-143 [007] ...1. 1945.717956: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4a801d inode=0xffff888005f78d28
+ cat-143 [007] ...1. 1945.720616: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
+ cat-143 [007] ...1. 1945.728263: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0xa800d inode=0xffff888004ada8d8
+
+You can see the `file::f_mode` and `file::f_inode` are updated in `vfs_open()`.
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index 6893399157f0..dc82d64b3a44 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -217,18 +217,6 @@ along to ftrace_push_return_trace() instead of a stub value of 0.
Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer.
-HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
---------------------------------
-
-An arch may pass in a pointer to the return address on the stack. This
-prevents potential stack unwinding issues where the unwinder gets out of
-sync with ret_stack and the wrong addresses are reported by
-ftrace_graph_ret_addr().
-
-Adding support for it is easy: just define the macro in asm/ftrace.h and
-pass the return address pointer as the 'retp' argument to
-ftrace_push_return_trace().
-
HAVE_SYSCALL_TRACEPOINTS
------------------------
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 7e7b8ec17934..2b74f96d09d5 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -810,6 +810,12 @@ Here is the list of current tracers that may be configured.
to draw a graph of function calls similar to C code
source.
+ Note that the function graph calculates the timings of when the
+ function starts and returns internally and for each instance. If
+ there are two instances that run function graph tracer and traces
+ the same functions, the length of the timings may be slightly off as
+ each read the timestamp separately and not at the same time.
+
"blk"
The block tracer. The tracer used by the blktrace user
@@ -1031,14 +1037,15 @@ explains which is which.
CPU#: The CPU which the process was running on.
irqs-off: 'd' interrupts are disabled. '.' otherwise.
- .. caution:: If the architecture does not support a way to
- read the irq flags variable, an 'X' will always
- be printed here.
need-resched:
+ - 'B' all, TIF_NEED_RESCHED, PREEMPT_NEED_RESCHED and TIF_RESCHED_LAZY is set,
- 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
- 'n' only TIF_NEED_RESCHED is set,
- 'p' only PREEMPT_NEED_RESCHED is set,
+ - 'L' both PREEMPT_NEED_RESCHED and TIF_RESCHED_LAZY is set,
+ - 'b' both TIF_NEED_RESCHED and TIF_RESCHED_LAZY is set,
+ - 'l' only TIF_RESCHED_LAZY is set
- '.' otherwise.
hardirq/softirq:
@@ -1186,6 +1193,18 @@ Here are the available options:
trace_printk
Can disable trace_printk() from writing into the buffer.
+ trace_printk_dest
+ Set to have trace_printk() and similar internal tracing functions
+ write into this instance. Note, only one trace instance can have
+ this set. By setting this flag, it clears the trace_printk_dest flag
+ of the instance that had it set previously. By default, the top
+ level trace has this set, and will get it set again if another
+ instance has it set then clears it.
+
+ This flag cannot be cleared by the top level instance, as it is the
+ default instance. The only way the top level instance has this flag
+ cleared, is by it being set in another instance.
+
annotate
It is sometimes confusing when the CPU buffers are full
and one CPU buffer had a lot of events recently, thus
@@ -1968,7 +1987,7 @@ wakeup
One common case that people are interested in tracing is the
time it takes for a task that is woken to actually wake up.
Now for non Real-Time tasks, this can be arbitrary. But tracing
-it none the less can be interesting.
+it nonetheless can be interesting.
Without function tracing::
diff --git a/Documentation/trace/hisi-ptt.rst b/Documentation/trace/hisi-ptt.rst
index 989255eb5622..6eef28ebb0c7 100644
--- a/Documentation/trace/hisi-ptt.rst
+++ b/Documentation/trace/hisi-ptt.rst
@@ -40,7 +40,7 @@ IO dies (SICL, Super I/O Cluster), where there's one PCIe Root
Complex for each SICL.
::
- /sys/devices/hisi_ptt<sicl_id>_<core_id>
+ /sys/bus/event_source/devices/hisi_ptt<sicl_id>_<core_id>
Tune
====
@@ -53,7 +53,7 @@ Each event is presented as a file under $(PTT PMU dir)/tune, and
a simple open/read/write/close cycle will be used to tune the event.
::
- $ cd /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune
+ $ cd /sys/bus/event_source/devices/hisi_ptt<sicl_id>_<core_id>/tune
$ ls
qos_tx_cpl qos_tx_np qos_tx_p
tx_path_rx_req_alloc_buf_level
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 3c9b263de9c2..0aada18c38c6 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -81,7 +81,7 @@ Documentation written by Tom Zanussi
.usecs display a common_timestamp in microseconds
.percent display a number of percentage value
.graph display a bar-graph of a value
- .stacktrace display as a stacktrace (must by a long[] type)
+ .stacktrace display as a stacktrace (must be a long[] type)
============= =================================================
Note that in general the semantics of a given field aren't
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..2c991dc96ace 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -24,11 +24,13 @@ Linux Tracing Technologies
histogram
histogram-design
boottime-trace
+ debugging
hwlat_detector
osnoise-tracer
timerlat-tracer
intel_th
ring-buffer-design
+ ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/kprobes.rst b/Documentation/trace/kprobes.rst
index e1636e579c9c..5e606730cec6 100644
--- a/Documentation/trace/kprobes.rst
+++ b/Documentation/trace/kprobes.rst
@@ -322,6 +322,7 @@ architectures:
- s390
- parisc
- loongarch
+- riscv
Configuring Kprobes
===================
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index bf9cecb69fc9..3b6791c17e9b 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -58,8 +58,9 @@ Synopsis of kprobe_events
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
- (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
- and bitfield are supported.
+ (x8/x16/x32/x64), VFS layer common type(%pd/%pD), "char",
+ "string", "ustring", "symbol", "symstr" and bitfield are
+ supported.
(\*1) only for the probe on function entry (offs == 0). Note, this argument access
is best effort, because depending on the argument type, it may be passed on
@@ -70,6 +71,15 @@ Synopsis of kprobe_events
(\*3) this is useful for fetching a field of data structures.
(\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
+Function arguments at kretprobe
+-------------------------------
+Function arguments can be accessed at kretprobe using $arg<N> fetcharg. This
+is useful to record the function parameter and return value at once, and
+trace the difference of structure fields (for debugging a function whether it
+correctly updates the given data structure or not).
+See the :ref:`sample<fprobetrace_exit_args_sample>` in fprobe event for how
+it works.
+
.. _kprobetrace_types:
Types
@@ -113,6 +123,9 @@ With 'symstr' type, you can filter the event with wildcard pattern of the
symbols, and you don't need to solve symbol name by yourself.
For $comm, the default type is "string"; any other type is invalid.
+VFS layer common type(%pd/%pD) is a special type, which fetches dentry's or
+file's name from struct dentry's address or struct file's address.
+
.. _user_mem_access:
User Memory Access
diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst
index 140ef2533d26..a520adbd3476 100644
--- a/Documentation/trace/osnoise-tracer.rst
+++ b/Documentation/trace/osnoise-tracer.rst
@@ -108,7 +108,7 @@ The tracer has a set of options inside the osnoise directory, they are:
option.
- tracing_threshold: the minimum delta between two time() reads to be
considered as noise, in us. When set to 0, the default value will
- be used, which is currently 5 us.
+ be used, which is currently 1 us.
- osnoise/options: a set of on/off options that can be enabled by
writing the option name to the file or disabled by writing the option
name preceded with the 'NO\_' prefix. For example, writing
diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index 000000000000..8e296bcc0d7f
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Tracefs ring-buffer memory mapping
+==================================
+
+:Author: Vincent Donnefort <vdonnefort@google.com>
+
+Overview
+========
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+====================
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred to as the meta-page. One of the most important
+fields of the meta-page is the reader. It contains the sub-buffer ID which can
+be safely read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascending ID. It is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+ reader_id = meta->reader->id;
+ reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===========
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers on
+trace_pipe would be.
+
+Example
+=======
+
+.. code-block:: c
+
+ #include <fcntl.h>
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <unistd.h>
+
+ #include <linux/trace_mmap.h>
+
+ #include <sys/mman.h>
+ #include <sys/ioctl.h>
+
+ #define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+ int main(void)
+ {
+ int page_size = getpagesize(), fd, reader_id;
+ unsigned long meta_len, data_len;
+ struct trace_buffer_meta *meta;
+ void *map, *reader, *data;
+
+ fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+ if (fd < 0)
+ exit(EXIT_FAILURE);
+
+ map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+ if (map == MAP_FAILED)
+ exit(EXIT_FAILURE);
+
+ meta = (struct trace_buffer_meta *)map;
+ meta_len = meta->meta_page_size;
+
+ printf("entries: %llu\n", meta->entries);
+ printf("overrun: %llu\n", meta->overrun);
+ printf("read: %llu\n", meta->read);
+ printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+ data_len = meta->subbuf_size * meta->nr_subbufs;
+ data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len);
+ if (data == MAP_FAILED)
+ exit(EXIT_FAILURE);
+
+ if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+ exit(EXIT_FAILURE);
+
+ reader_id = meta->reader.id;
+ reader = data + meta->subbuf_size * reader_id;
+
+ printf("Current reader address: %p\n", reader);
+
+ munmap(data, data_len);
+ munmap(meta, meta_len);
+ close (fd);
+
+ return 0;
+ }
diff --git a/Documentation/trace/rv/runtime-verification.rst b/Documentation/trace/rv/runtime-verification.rst
index dae78dfa7cdc..c700dde9259c 100644
--- a/Documentation/trace/rv/runtime-verification.rst
+++ b/Documentation/trace/rv/runtime-verification.rst
@@ -8,14 +8,14 @@ checking* and *theorem proving*) with a more practical approach for complex
systems.
Instead of relying on a fine-grained model of a system (e.g., a
-re-implementation a instruction level), RV works by analyzing the trace of the
+re-implementation at instruction level), RV works by analyzing the trace of the
system's actual execution, comparing it against a formal specification of
the system behavior.
The main advantage is that RV can give precise information on the runtime
behavior of the monitored system, without the pitfalls of developing models
that require a re-implementation of the entire system in a modeling language.
-Moreover, given an efficient monitoring method, it is possible execute an
+Moreover, given an efficient monitoring method, it is possible to execute an
*online* verification of a system, enabling the *reaction* for unexpected
events, avoiding, for example, the propagation of a failure on safety-critical
systems.
diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst
index 0cb8d9ca3d60..decabcc77b56 100644
--- a/Documentation/trace/tracepoints.rst
+++ b/Documentation/trace/tracepoints.rst
@@ -27,7 +27,7 @@ the tracepoint site).
You can put tracepoints at important locations in the code. They are
lightweight hooks that can pass an arbitrary number of parameters,
-which prototypes are described in a tracepoint declaration placed in a
+whose prototypes are described in a tracepoint declaration placed in a
header file.
They can be used for tracing and performance accounting.
diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
index d8f12442aaa6..1d5a7626e6a6 100644
--- a/Documentation/trace/user_events.rst
+++ b/Documentation/trace/user_events.rst
@@ -92,6 +92,24 @@ The following flags are currently supported.
process closes or unregisters the event. Requires CAP_PERFMON otherwise
-EPERM is returned.
++ USER_EVENT_REG_MULTI_FORMAT: The event can contain multiple formats. This
+ allows programs to prevent themselves from being blocked when their event
+ format changes and they wish to use the same name. When this flag is used the
+ tracepoint name will be in the new format of "name.unique_id" vs the older
+ format of "name". A tracepoint will be created for each unique pair of name
+ and format. This means if several processes use the same name and format,
+ they will use the same tracepoint. If yet another process uses the same name,
+ but a different format than the other processes, it will use a different
+ tracepoint with a new unique id. Recording programs need to scan tracefs for
+ the various different formats of the event name they are interested in
+ recording. The system name of the tracepoint will also use "user_events_multi"
+ instead of "user_events". This prevents single-format event names conflicting
+ with any multi-format event names within tracefs. The unique_id is output as
+ a hex string. Recording programs should ensure the tracepoint name starts with
+ the event name they registered and has a suffix that starts with . and only
+ has hex characters. For example to find all versions of the event "test" you
+ can use the regex "^test\.[0-9a-fA-F]+$".
+
Upon successful registration the following is set.
+ write_index: The index to use for this file descriptor that represents this
@@ -106,6 +124,9 @@ or perf record -e user_events:[name] when attaching/recording.
**NOTE:** The event subsystem name by default is "user_events". Callers should
not assume it will always be "user_events". Operators reserve the right in the
future to change the subsystem name per-process to accommodate event isolation.
+In addition if the USER_EVENT_REG_MULTI_FORMAT flag is used the tracepoint name
+will have a unique id appended to it and the system name will be
+"user_events_multi" as described above.
Command Format
^^^^^^^^^^^^^^
@@ -156,7 +177,11 @@ to request deletes than the one used for registration due to this.
to the event. If programs do not want auto-delete, they must use the
USER_EVENT_REG_PERSIST flag when registering the event. Once that flag is used
the event exists until DIAG_IOCSDEL is invoked. Both register and delete of an
-event that persists requires CAP_PERFMON, otherwise -EPERM is returned.
+event that persists requires CAP_PERFMON, otherwise -EPERM is returned. When
+there are multiple formats of the same event name, all events with the same
+name will be attempted to be deleted. If only a specific version is wanted to
+be deleted then the /sys/kernel/tracing/dynamic_events file should be used for
+that specific format of the event.
Unregistering
-------------