summaryrefslogtreecommitdiff
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/boottime-trace.rst301
-rw-r--r--Documentation/trace/coresight/coresight-config.rst294
-rw-r--r--Documentation/trace/coresight/coresight-cpu-debug.rst3
-rw-r--r--Documentation/trace/coresight/coresight-dummy.rst32
-rw-r--r--Documentation/trace/coresight/coresight-ect.rst226
-rw-r--r--Documentation/trace/coresight/coresight-etm4x-reference.rst33
-rw-r--r--Documentation/trace/coresight/coresight-perf.rst158
-rw-r--r--Documentation/trace/coresight/coresight-tpda.rst52
-rw-r--r--Documentation/trace/coresight/coresight-tpdm.rst45
-rw-r--r--Documentation/trace/coresight/coresight-trbe.rst38
-rw-r--r--Documentation/trace/coresight/coresight.rst209
-rw-r--r--Documentation/trace/coresight/ultrasoc-smb.rst83
-rw-r--r--Documentation/trace/events-kmem.rst2
-rw-r--r--Documentation/trace/events-msr.rst6
-rw-r--r--Documentation/trace/events-nmi.rst6
-rw-r--r--Documentation/trace/events-power.rst21
-rw-r--r--Documentation/trace/events.rst652
-rw-r--r--Documentation/trace/fprobe.rst186
-rw-r--r--Documentation/trace/fprobetrace.rst251
-rw-r--r--Documentation/trace/ftrace-design.rst8
-rw-r--r--Documentation/trace/ftrace-uses.rst102
-rw-r--r--Documentation/trace/ftrace.rst372
-rw-r--r--Documentation/trace/hisi-ptt.rst304
-rw-r--r--Documentation/trace/histogram-design.rst2115
-rw-r--r--Documentation/trace/histogram.rst633
-rw-r--r--Documentation/trace/hwlat_detector.rst15
-rw-r--r--Documentation/trace/index.rst11
-rw-r--r--Documentation/trace/intel_th.rst2
-rw-r--r--Documentation/trace/kprobes.rst788
-rw-r--r--Documentation/trace/kprobetrace.rst92
-rw-r--r--Documentation/trace/mmiotrace.rst22
-rw-r--r--Documentation/trace/osnoise-tracer.rst180
-rw-r--r--Documentation/trace/postprocess/decode_msr.py2
-rw-r--r--Documentation/trace/postprocess/trace-pagealloc-postprocess.pl6
-rw-r--r--Documentation/trace/postprocess/trace-vmscan-postprocess.pl48
-rw-r--r--Documentation/trace/ring-buffer-design.rst (renamed from Documentation/trace/ring-buffer-design.txt)812
-rw-r--r--Documentation/trace/rv/da_monitor_instrumentation.rst171
-rw-r--r--Documentation/trace/rv/da_monitor_synthesis.rst147
-rw-r--r--Documentation/trace/rv/deterministic_automata.rst184
-rw-r--r--Documentation/trace/rv/index.rst14
-rw-r--r--Documentation/trace/rv/monitor_wip.rst55
-rw-r--r--Documentation/trace/rv/monitor_wwnr.rst45
-rw-r--r--Documentation/trace/rv/runtime-verification.rst231
-rw-r--r--Documentation/trace/stm.rst4
-rw-r--r--Documentation/trace/timerlat-tracer.rst260
-rw-r--r--Documentation/trace/tracepoint-analysis.rst8
-rw-r--r--Documentation/trace/tracepoints.rst27
-rw-r--r--Documentation/trace/uprobetracer.rst32
-rw-r--r--Documentation/trace/user_events.rst304
49 files changed, 8766 insertions, 826 deletions
diff --git a/Documentation/trace/boottime-trace.rst b/Documentation/trace/boottime-trace.rst
new file mode 100644
index 000000000000..d594597201fd
--- /dev/null
+++ b/Documentation/trace/boottime-trace.rst
@@ -0,0 +1,301 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================
+Boot-time tracing
+=================
+
+:Author: Masami Hiramatsu <mhiramat@kernel.org>
+
+Overview
+========
+
+Boot-time tracing allows users to trace boot-time process including
+device initialization with full features of ftrace including per-event
+filter and actions, histograms, kprobe-events and synthetic-events,
+and trace instances.
+Since kernel command line is not enough to control these complex features,
+this uses bootconfig file to describe tracing feature programming.
+
+Options in the Boot Config
+==========================
+
+Here is the list of available options list for boot time tracing in
+boot config file [1]_. All options are under "ftrace." or "kernel."
+prefix. See kernel parameters for the options which starts
+with "kernel." prefix [2]_.
+
+.. [1] See :ref:`Documentation/admin-guide/bootconfig.rst <bootconfig>`
+.. [2] See :ref:`Documentation/admin-guide/kernel-parameters.rst <kernelparameters>`
+
+Ftrace Global Options
+---------------------
+
+Ftrace global options have "kernel." prefix in boot config, which means
+these options are passed as a part of kernel legacy command line.
+
+kernel.tp_printk
+ Output trace-event data on printk buffer too.
+
+kernel.dump_on_oops [= MODE]
+ Dump ftrace on Oops. If MODE = 1 or omitted, dump trace buffer
+ on all CPUs. If MODE = 2, dump a buffer on a CPU which kicks Oops.
+
+kernel.traceoff_on_warning
+ Stop tracing if WARN_ON() occurs.
+
+kernel.fgraph_max_depth = MAX_DEPTH
+ Set MAX_DEPTH to maximum depth of fgraph tracer.
+
+kernel.fgraph_filters = FILTER[, FILTER2...]
+ Add fgraph tracing function filters.
+
+kernel.fgraph_notraces = FILTER[, FILTER2...]
+ Add fgraph non-tracing function filters.
+
+
+Ftrace Per-instance Options
+---------------------------
+
+These options can be used for each instance including global ftrace node.
+
+ftrace.[instance.INSTANCE.]options = OPT1[, OPT2[...]]
+ Enable given ftrace options.
+
+ftrace.[instance.INSTANCE.]tracing_on = 0|1
+ Enable/Disable tracing on this instance when starting boot-time tracing.
+ (you can enable it by the "traceon" event trigger action)
+
+ftrace.[instance.INSTANCE.]trace_clock = CLOCK
+ Set given CLOCK to ftrace's trace_clock.
+
+ftrace.[instance.INSTANCE.]buffer_size = SIZE
+ Configure ftrace buffer size to SIZE. You can use "KB" or "MB"
+ for that SIZE.
+
+ftrace.[instance.INSTANCE.]alloc_snapshot
+ Allocate snapshot buffer.
+
+ftrace.[instance.INSTANCE.]cpumask = CPUMASK
+ Set CPUMASK as trace cpu-mask.
+
+ftrace.[instance.INSTANCE.]events = EVENT[, EVENT2[...]]
+ Enable given events on boot. You can use a wild card in EVENT.
+
+ftrace.[instance.INSTANCE.]tracer = TRACER
+ Set TRACER to current tracer on boot. (e.g. function)
+
+ftrace.[instance.INSTANCE.]ftrace.filters
+ This will take an array of tracing function filter rules.
+
+ftrace.[instance.INSTANCE.]ftrace.notraces
+ This will take an array of NON-tracing function filter rules.
+
+
+Ftrace Per-Event Options
+------------------------
+
+These options are setting per-event options.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.enable
+ Enable GROUP:EVENT tracing.
+
+ftrace.[instance.INSTANCE.]event.GROUP.enable
+ Enable all event tracing within GROUP.
+
+ftrace.[instance.INSTANCE.]event.enable
+ Enable all event tracing.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.filter = FILTER
+ Set FILTER rule to the GROUP:EVENT.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.actions = ACTION[, ACTION2[...]]
+ Set ACTIONs to the GROUP:EVENT.
+
+ftrace.[instance.INSTANCE.]event.kprobes.EVENT.probes = PROBE[, PROBE2[...]]
+ Defines new kprobe event based on PROBEs. It is able to define
+ multiple probes on one event, but those must have same type of
+ arguments. This option is available only for the event which
+ group name is "kprobes".
+
+ftrace.[instance.INSTANCE.]event.synthetic.EVENT.fields = FIELD[, FIELD2[...]]
+ Defines new synthetic event with FIELDs. Each field should be
+ "type varname".
+
+Note that kprobe and synthetic event definitions can be written under
+instance node, but those are also visible from other instances. So please
+take care for event name conflict.
+
+Ftrace Histogram Options
+------------------------
+
+Since it is too long to write a histogram action as a string for per-event
+action option, there are tree-style options under per-event 'hist' subkey
+for the histogram actions. For the detail of the each parameter,
+please read the event histogram document (Documentation/trace/histogram.rst)
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]keys = KEY1[, KEY2[...]]
+ Set histogram key parameters. (Mandatory)
+ The 'N' is a digit string for the multiple histogram. You can omit it
+ if there is one histogram on the event.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]values = VAL1[, VAL2[...]]
+ Set histogram value parameters.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]sort = SORT1[, SORT2[...]]
+ Set histogram sort parameter options.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]size = NR_ENTRIES
+ Set histogram size (number of entries).
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]name = NAME
+ Set histogram name.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]var.VARIABLE = EXPR
+ Define a new VARIABLE by EXPR expression.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]<pause|continue|clear>
+ Set histogram control parameter. You can set one of them.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onmatch.[M.]event = GROUP.EVENT
+ Set histogram 'onmatch' handler matching event parameter.
+ The 'M' is a digit string for the multiple 'onmatch' handler. You can omit it
+ if there is one 'onmatch' handler on this histogram.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onmatch.[M.]trace = EVENT[, ARG1[...]]
+ Set histogram 'trace' action for 'onmatch'.
+ EVENT must be a synthetic event name, and ARG1... are parameters
+ for that event. Mandatory if 'onmatch.event' option is set.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onmax.[M.]var = VAR
+ Set histogram 'onmax' handler variable parameter.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]onchange.[M.]var = VAR
+ Set histogram 'onchange' handler variable parameter.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]<onmax|onchange>.[M.]save = ARG1[, ARG2[...]]
+ Set histogram 'save' action parameters for 'onmax' or 'onchange' handler.
+ This option or below 'snapshot' option is mandatory if 'onmax.var' or
+ 'onchange.var' option is set.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.[N.]<onmax|onchange>.[M.]snapshot
+ Set histogram 'snapshot' action for 'onmax' or 'onchange' handler.
+ This option or above 'save' option is mandatory if 'onmax.var' or
+ 'onchange.var' option is set.
+
+ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist.filter = FILTER_EXPR
+ Set histogram filter expression. You don't need 'if' in the FILTER_EXPR.
+
+Note that this 'hist' option can conflict with the per-event 'actions'
+option if the 'actions' option has a histogram action.
+
+
+When to Start
+=============
+
+All boot-time tracing options starting with ``ftrace`` will be enabled at the
+end of core_initcall. This means you can trace the events from postcore_initcall.
+Most of the subsystems and architecture dependent drivers will be initialized
+after that (arch_initcall or subsys_initcall). Thus, you can trace those with
+boot-time tracing.
+If you want to trace events before core_initcall, you can use the options
+starting with ``kernel``. Some of them will be enabled eariler than the initcall
+processing (for example,. ``kernel.ftrace=function`` and ``kernel.trace_event``
+will start before the initcall.)
+
+
+Examples
+========
+
+For example, to add filter and actions for each event, define kprobe
+events, and synthetic events with histogram, write a boot config like
+below::
+
+ ftrace.event {
+ task.task_newtask {
+ filter = "pid < 128"
+ enable
+ }
+ kprobes.vfs_read {
+ probes = "vfs_read $arg1 $arg2"
+ filter = "common_pid < 200"
+ enable
+ }
+ synthetic.initcall_latency {
+ fields = "unsigned long func", "u64 lat"
+ hist {
+ keys = func.sym, lat
+ values = lat
+ sort = lat
+ }
+ }
+ initcall.initcall_start.hist {
+ keys = func
+ var.ts0 = common_timestamp.usecs
+ }
+ initcall.initcall_finish.hist {
+ keys = func
+ var.lat = common_timestamp.usecs - $ts0
+ onmatch {
+ event = initcall.initcall_start
+ trace = initcall_latency, func, $lat
+ }
+ }
+ }
+
+Also, boot-time tracing supports "instance" node, which allows us to run
+several tracers for different purpose at once. For example, one tracer
+is for tracing functions starting with "user\_", and others tracing
+"kernel\_" functions, you can write boot config as below::
+
+ ftrace.instance {
+ foo {
+ tracer = "function"
+ ftrace.filters = "user_*"
+ }
+ bar {
+ tracer = "function"
+ ftrace.filters = "kernel_*"
+ }
+ }
+
+The instance node also accepts event nodes so that each instance
+can customize its event tracing.
+
+With the trigger action and kprobes, you can trace function-graph while
+a function is called. For example, this will trace all function calls in
+the pci_proc_init()::
+
+ ftrace {
+ tracing_on = 0
+ tracer = function_graph
+ event.kprobes {
+ start_event {
+ probes = "pci_proc_init"
+ actions = "traceon"
+ }
+ end_event {
+ probes = "pci_proc_init%return"
+ actions = "traceoff"
+ }
+ }
+ }
+
+
+This boot-time tracing also supports ftrace kernel parameters via boot
+config.
+For example, following kernel parameters::
+
+ trace_options=sym-addr trace_event=initcall:* tp_printk trace_buf_size=1M ftrace=function ftrace_filter="vfs*"
+
+This can be written in boot config like below::
+
+ kernel {
+ trace_options = sym-addr
+ trace_event = "initcall:*"
+ tp_printk
+ trace_buf_size = 1M
+ ftrace = function
+ ftrace_filter = "vfs*"
+ }
+
+Note that parameters start with "kernel" prefix instead of "ftrace".
diff --git a/Documentation/trace/coresight/coresight-config.rst b/Documentation/trace/coresight/coresight-config.rst
new file mode 100644
index 000000000000..6d5ffa6f7347
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-config.rst
@@ -0,0 +1,294 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+CoreSight System Configuration Manager
+======================================
+
+ :Author: Mike Leach <mike.leach@linaro.org>
+ :Date: October 2020
+
+Introduction
+============
+
+The CoreSight System Configuration manager is an API that allows the
+programming of the CoreSight system with pre-defined configurations that
+can then be easily enabled from sysfs or perf.
+
+Many CoreSight components can be programmed in complex ways - especially ETMs.
+In addition, components can interact across the CoreSight system, often via
+the cross trigger components such as CTI and CTM. These system settings can
+be defined and enabled as named configurations.
+
+
+Basic Concepts
+==============
+
+This section introduces the basic concepts of a CoreSight system configuration.
+
+
+Features
+--------
+
+A feature is a named set of programming for a CoreSight device. The programming
+is device dependent, and can be defined in terms of absolute register values,
+resource usage and parameter values.
+
+The feature is defined using a descriptor. This descriptor is used to load onto
+a matching device, either when the feature is loaded into the system, or when the
+CoreSight device is registered with the configuration manager.
+
+The load process involves interpreting the descriptor into a set of register
+accesses in the driver - the resource usage and parameter descriptions
+translated into appropriate register accesses. This interpretation makes it easy
+and efficient for the feature to be programmed onto the device when required.
+
+The feature will not be active on the device until the feature is enabled, and
+the device itself is enabled. When the device is enabled then enabled features
+will be programmed into the device hardware.
+
+A feature is enabled as part of a configuration being enabled on the system.
+
+
+Parameter Value
+~~~~~~~~~~~~~~~
+
+A parameter value is a named value that may be set by the user prior to the
+feature being enabled that can adjust the behaviour of the operation programmed
+by the feature.
+
+For example, this could be a count value in a programmed operation that repeats
+at a given rate. When the feature is enabled then the current value of the
+parameter is used in programming the device.
+
+The feature descriptor defines a default value for a parameter, which is used
+if the user does not supply a new value.
+
+Users can update parameter values using the configfs API for the CoreSight
+system - which is described below.
+
+The current value of the parameter is loaded into the device when the feature
+is enabled on that device.
+
+
+Configurations
+--------------
+
+A configuration defines a set of features that are to be used in a trace
+session where the configuration is selected. For any trace session only one
+configuration may be selected.
+
+The features defined may be on any type of device that is registered
+to support system configuration. A configuration may select features to be
+enabled on a class of devices - i.e. any ETMv4, or specific devices, e.g. a
+specific CTI on the system.
+
+As with the feature, a descriptor is used to define the configuration.
+This will define the features that must be enabled as part of the configuration
+as well as any preset values that can be used to override default parameter
+values.
+
+
+Preset Values
+~~~~~~~~~~~~~
+
+Preset values are easily selectable sets of parameter values for the features
+that the configuration uses. The number of values in a single preset set, equals
+the sum of parameter values in the features used by the configuration.
+
+e.g. a configuration consists of 3 features, one has 2 parameters, one has
+a single parameter, and another has no parameters. A single preset set will
+therefore have 3 values.
+
+Presets are optionally defined by the configuration, up to 15 can be defined.
+If no preset is selected, then the parameter values defined in the feature
+are used as normal.
+
+
+Operation
+~~~~~~~~~
+
+The following steps take place in the operation of a configuration.
+
+1) In this example, the configuration is 'autofdo', which has an
+ associated feature 'strobing' that works on ETMv4 CoreSight Devices.
+
+2) The configuration is enabled. For example 'perf' may select the
+ configuration as part of its command line::
+
+ perf record -e cs_etm/autofdo/ myapp
+
+ which will enable the 'autofdo' configuration.
+
+3) perf starts tracing on the system. As each ETMv4 that perf uses for
+ trace is enabled, the configuration manager will check if the ETMv4
+ has a feature that relates to the currently active configuration.
+ In this case 'strobing' is enabled & programmed into the ETMv4.
+
+4) When the ETMv4 is disabled, any registers marked as needing to be
+ saved will be read back.
+
+5) At the end of the perf session, the configuration will be disabled.
+
+
+Viewing Configurations and Features
+===================================
+
+The set of configurations and features that are currently loaded into the
+system can be viewed using the configfs API.
+
+Mount configfs as normal and the 'cs-syscfg' subsystem will appear::
+
+ $ ls /config
+ cs-syscfg stp-policy
+
+This has two sub-directories::
+
+ $ cd cs-syscfg/
+ $ ls
+ configurations features
+
+The system has the configuration 'autofdo' built in. It may be examined as
+follows::
+
+ $ cd configurations/
+ $ ls
+ autofdo
+ $ cd autofdo/
+ $ ls
+ description feature_refs preset1 preset3 preset5 preset7 preset9
+ enable preset preset2 preset4 preset6 preset8
+ $ cat description
+ Setup ETMs with strobing for autofdo
+ $ cat feature_refs
+ strobing
+
+Each preset declared has a 'preset<n>' subdirectory declared. The values for
+the preset can be examined::
+
+ $ cat preset1/values
+ strobing.window = 0x1388 strobing.period = 0x2
+ $ cat preset2/values
+ strobing.window = 0x1388 strobing.period = 0x4
+
+The 'enable' and 'preset' files allow the control of a configuration when
+using CoreSight with sysfs.
+
+The features referenced by the configuration can be examined in the features
+directory::
+
+ $ cd ../../features/strobing/
+ $ ls
+ description matches nr_params params
+ $ cat description
+ Generate periodic trace capture windows.
+ parameter 'window': a number of CPU cycles (W)
+ parameter 'period': trace enabled for W cycles every period x W cycles
+ $ cat matches
+ SRC_ETMV4
+ $ cat nr_params
+ 2
+
+Move to the params directory to examine and adjust parameters::
+
+ cd params
+ $ ls
+ period window
+ $ cd period
+ $ ls
+ value
+ $ cat value
+ 0x2710
+ # echo 15000 > value
+ # cat value
+ 0x3a98
+
+Parameters adjusted in this way are reflected in all device instances that have
+loaded the feature.
+
+
+Using Configurations in perf
+============================
+
+The configurations loaded into the CoreSight configuration management are
+also declared in the perf 'cs_etm' event infrastructure so that they can
+be selected when running trace under perf::
+
+ $ ls /sys/devices/cs_etm
+ cpu0 cpu2 events nr_addr_filters power subsystem uevent
+ cpu1 cpu3 format perf_event_mux_interval_ms sinks type
+
+The key directory here is 'events' - a generic perf directory which allows
+selection on the perf command line. As with the sinks entries, this provides
+a hash of the configuration name.
+
+The entry in the 'events' directory uses perfs built in syntax generator
+to substitute the syntax for the name when evaluating the command::
+
+ $ ls events/
+ autofdo
+ $ cat events/autofdo
+ configid=0xa7c3dddd
+
+The 'autofdo' configuration may be selected on the perf command line::
+
+ $ perf record -e cs_etm/autofdo/u --per-thread <application>
+
+A preset to override the current parameter values can also be selected::
+
+ $ perf record -e cs_etm/autofdo,preset=1/u --per-thread <application>
+
+When configurations are selected in this way, then the trace sink used is
+automatically selected.
+
+Using Configurations in sysfs
+=============================
+
+Coresight can be controlled using sysfs. When this is in use then a configuration
+can be made active for the devices that are used in the sysfs session.
+
+In a configuration there are 'enable' and 'preset' files.
+
+To enable a configuration for use with sysfs::
+
+ $ cd configurations/autofdo
+ $ echo 1 > enable
+
+This will then use any default parameter values in the features - which can be
+adjusted as described above.
+
+To use a preset<n> set of parameter values::
+
+ $ echo 3 > preset
+
+This will select preset3 for the configuration.
+The valid values for preset are 0 - to deselect presets, and any value of
+<n> where a preset<n> sub-directory is present.
+
+Note that the active sysfs configuration is a global parameter, therefore
+only a single configuration can be active for sysfs at any one time.
+Attempting to enable a second configuration will result in an error.
+Additionally, attempting to disable the configuration while in use will
+also result in an error.
+
+The use of the active configuration by sysfs is independent of the configuration
+used in perf.
+
+
+Creating and Loading Custom Configurations
+==========================================
+
+Custom configurations and / or features can be dynamically loaded into the
+system by using a loadable module.
+
+An example of a custom configuration is found in ./samples/coresight.
+
+This creates a new configuration that uses the existing built in
+strobing feature, but provides a different set of presets.
+
+When the module is loaded, then the configuration appears in the configfs
+file system and is selectable in the same way as the built in configuration
+described above.
+
+Configurations can use previously loaded features. The system will ensure
+that it is not possible to unload a feature that is currently in use, by
+enforcing the unload order as the strict reverse of the load order.
diff --git a/Documentation/trace/coresight/coresight-cpu-debug.rst b/Documentation/trace/coresight/coresight-cpu-debug.rst
index 993dd294b81b..836b35532667 100644
--- a/Documentation/trace/coresight/coresight-cpu-debug.rst
+++ b/Documentation/trace/coresight/coresight-cpu-debug.rst
@@ -117,7 +117,8 @@ divide into below cases:
Device Tree Bindings
--------------------
-See Documentation/devicetree/bindings/arm/coresight-cpu-debug.txt for details.
+See Documentation/devicetree/bindings/arm/arm,coresight-cpu-debug.yaml for
+details.
How to use the module
diff --git a/Documentation/trace/coresight/coresight-dummy.rst b/Documentation/trace/coresight/coresight-dummy.rst
new file mode 100644
index 000000000000..eff7c553ee6c
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-dummy.rst
@@ -0,0 +1,32 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================
+Coresight Dummy Trace Module
+=============================
+
+ :Author: Hao Zhang <quic_hazha@quicinc.com>
+ :Date: June 2023
+
+Introduction
+------------
+
+The Coresight dummy trace module is for the specific devices that kernel don't
+have permission to access or configure, e.g., CoreSight TPDMs on Qualcomm
+platforms. For these devices, a dummy driver is needed to register them as
+Coresight devices. The module may also be used to define components that may
+not have any programming interfaces, so that paths can be created in the driver.
+It provides Coresight API for operations on dummy devices, such as enabling and
+disabling them. It also provides the Coresight dummy sink/source paths for
+debugging.
+
+Config details
+--------------
+
+There are two types of nodes, dummy sink and dummy source. These nodes
+are available at ``/sys/bus/coresight/devices``.
+
+Example output::
+
+ $ ls -l /sys/bus/coresight/devices | grep dummy
+ dummy_sink0 -> ../../../devices/platform/soc@0/soc@0:sink/dummy_sink0
+ dummy_source0 -> ../../../devices/platform/soc@0/soc@0:source/dummy_source0
diff --git a/Documentation/trace/coresight/coresight-ect.rst b/Documentation/trace/coresight/coresight-ect.rst
new file mode 100644
index 000000000000..a68732c5c6d6
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-ect.rst
@@ -0,0 +1,226 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============================================
+CoreSight Embedded Cross Trigger (CTI & CTM).
+=============================================
+
+ :Author: Mike Leach <mike.leach@linaro.org>
+ :Date: November 2019
+
+Hardware Description
+--------------------
+
+The CoreSight Cross Trigger Interface (CTI) is a hardware device that takes
+individual input and output hardware signals known as triggers to and from
+devices and interconnects them via the Cross Trigger Matrix (CTM) to other
+devices via numbered channels, in order to propagate events between devices.
+
+e.g.::
+
+ 0000000 in_trigs :::::::
+ 0 C 0----------->: : +======>(other CTI channel IO)
+ 0 P 0<-----------: : v
+ 0 U 0 out_trigs : : Channels ***** :::::::
+ 0000000 : CTI :<=========>*CTM*<====>: CTI :---+
+ ####### in_trigs : : (id 0-3) ***** ::::::: v
+ # ETM #----------->: : ^ #######
+ # #<-----------: : +---# ETR #
+ ####### out_trigs ::::::: #######
+
+The CTI driver enables the programming of the CTI to attach triggers to
+channels. When an input trigger becomes active, the attached channel will
+become active. Any output trigger attached to that channel will also
+become active. The active channel is propagated to other CTIs via the CTM,
+activating connected output triggers there, unless filtered by the CTI
+channel gate.
+
+It is also possible to activate a channel using system software directly
+programming registers in the CTI.
+
+The CTIs are registered by the system to be associated with CPUs and/or other
+CoreSight devices on the trace data path. When these devices are enabled the
+attached CTIs will also be enabled. By default/on power up the CTIs have
+no programmed trigger/channel attachments, so will not affect the system
+until explicitly programmed.
+
+The hardware trigger connections between CTIs and devices is implementation
+defined, unless the CPU/ETM combination is a v8 architecture, in which case
+the connections have an architecturally defined standard layout.
+
+The hardware trigger signals can also be connected to non-CoreSight devices
+(e.g. UART), or be propagated off chip as hardware IO lines.
+
+All the CTI devices are associated with a CTM. On many systems there will be a
+single effective CTM (one CTM, or multiple CTMs all interconnected), but it is
+possible that systems can have nets of CTIs+CTM that are not interconnected by
+a CTM to each other. On these systems a CTM index is declared to associate
+CTI devices that are interconnected via a given CTM.
+
+Sysfs files and directories
+---------------------------
+
+The CTI devices appear on the existing CoreSight bus alongside the other
+CoreSight devices::
+
+ >$ ls /sys/bus/coresight/devices
+ cti_cpu0 cti_cpu2 cti_sys0 etm0 etm2 funnel0 replicator0 tmc_etr0
+ cti_cpu1 cti_cpu3 cti_sys1 etm1 etm3 funnel1 tmc_etf0 tpiu0
+
+The ``cti_cpu<N>`` named CTIs are associated with a CPU, and any ETM used by
+that core. The ``cti_sys<N>`` CTIs are general system infrastructure CTIs that
+can be associated with other CoreSight devices, or other system hardware
+capable of generating or using trigger signals.::
+
+ >$ ls /sys/bus/coresight/devices/etm0/cti_cpu0
+ channels ctmid enable nr_trigger_cons mgmt power powered regs
+ connections subsystem triggers0 triggers1 uevent
+
+*Key file items are:-*
+ * ``enable``: enables/disables the CTI. Read to determine current state.
+ If this shows as enabled (1), but ``powered`` shows unpowered (0), then
+ the enable indicates a request to enabled when the device is powered.
+ * ``ctmid`` : associated CTM - only relevant if system has multiple CTI+CTM
+ clusters that are not interconnected.
+ * ``nr_trigger_cons`` : total connections - triggers<N> directories.
+ * ``powered`` : Read to determine if the CTI is currently powered.
+
+*Sub-directories:-*
+ * ``triggers<N>``: contains list of triggers for an individual connection.
+ * ``channels``: Contains the channel API - CTI main programming interface.
+ * ``regs``: Gives access to the raw programmable CTI regs.
+ * ``mgmt``: the standard CoreSight management registers.
+ * ``connections``: Links to connected *CoreSight* devices. The number of
+ links can be 0 to ``nr_trigger_cons``. Actual number given by ``nr_links``
+ in this directory.
+
+
+triggers<N> directories
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Individual trigger connection information. This describes trigger signals for
+CoreSight and non-CoreSight connections.
+
+Each triggers directory has a set of parameters describing the triggers for
+the connection.
+
+ * ``name`` : name of connection
+ * ``in_signals`` : input trigger signal indexes used in this connection.
+ * ``in_types`` : functional types for in signals.
+ * ``out_signals`` : output trigger signals for this connection.
+ * ``out_types`` : functional types for out signals.
+
+e.g::
+
+ >$ ls ./cti_cpu0/triggers0/
+ in_signals in_types name out_signals out_types
+ >$ cat ./cti_cpu0/triggers0/name
+ cpu0
+ >$ cat ./cti_cpu0/triggers0/out_signals
+ 0-2
+ >$ cat ./cti_cpu0/triggers0/out_types
+ pe_edbgreq pe_dbgrestart pe_ctiirq
+ >$ cat ./cti_cpu0/triggers0/in_signals
+ 0-1
+ >$ cat ./cti_cpu0/triggers0/in_types
+ pe_dbgtrigger pe_pmuirq
+
+If a connection has zero signals in either the 'in' or 'out' triggers then
+those parameters will be omitted.
+
+Channels API Directory
+~~~~~~~~~~~~~~~~~~~~~~
+
+This provides an easy way to attach triggers to channels, without needing
+the multiple register operations that are required if manipulating the
+'regs' sub-directory elements directly.
+
+A number of files provide this API::
+
+ >$ ls ./cti_sys0/channels/
+ chan_clear chan_inuse chan_xtrigs_out trigin_attach
+ chan_free chan_pulse chan_xtrigs_reset trigin_detach
+ chan_gate_disable chan_set chan_xtrigs_sel trigout_attach
+ chan_gate_enable chan_xtrigs_in trig_filter_enable trigout_detach
+ trigout_filtered
+
+Most access to these elements take the form::
+
+ echo <chan> [<trigger>] > /<device_path>/<operation>
+
+where the optional <trigger> is only needed for trigXX_attach | detach
+operations.
+
+e.g.::
+
+ >$ echo 0 1 > ./cti_sys0/channels/trigout_attach
+ >$ echo 0 > ./cti_sys0/channels/chan_set
+
+Attaches trigout(1) to channel(0), then activates channel(0) generating a
+set state on cti_sys0.trigout(1)
+
+
+*API operations*
+
+ * ``trigin_attach, trigout_attach``: Attach a channel to a trigger signal.
+ * ``trigin_detach, trigout_detach``: Detach a channel from a trigger signal.
+ * ``chan_set``: Set the channel - the set state will be propagated around
+ the CTM to other connected devices.
+ * ``chan_clear``: Clear the channel.
+ * ``chan_pulse``: Set the channel for a single CoreSight clock cycle.
+ * ``chan_gate_enable``: Write operation sets the CTI gate to propagate
+ (enable) the channel to other devices. This operation takes a channel
+ number. CTI gate is enabled for all channels by default at power up. Read
+ to list the currently enabled channels on the gate.
+ * ``chan_gate_disable``: Write channel number to disable gate for that
+ channel.
+ * ``chan_inuse``: Show the current channels attached to any signal
+ * ``chan_free``: Show channels with no attached signals.
+ * ``chan_xtrigs_sel``: write a channel number to select a channel to view,
+ read to show the selected channel number.
+ * ``chan_xtrigs_in``: Read to show the input triggers attached to
+ the selected view channel.
+ * ``chan_xtrigs_out``:Read to show the output triggers attached to
+ the selected view channel.
+ * ``trig_filter_enable``: Defaults to enabled, disable to allow potentially
+ dangerous output signals to be set.
+ * ``trigout_filtered``: Trigger out signals that are prevented from being
+ set if filtering ``trig_filter_enable`` is enabled. One use is to prevent
+ accidental ``EDBGREQ`` signals stopping a core.
+ * ``chan_xtrigs_reset``: Write 1 to clear all channel / trigger programming.
+ Resets device hardware to default state.
+
+
+The example below attaches input trigger index 1 to channel 2, and output
+trigger index 6 to the same channel. It then examines the state of the
+channel / trigger connections using the appropriate sysfs attributes.
+
+The settings mean that if either input trigger 1, or channel 2 go active then
+trigger out 6 will go active. We then enable the CTI, and use the software
+channel control to activate channel 2. We see the active channel on the
+``choutstatus`` register and the active signal on the ``trigoutstatus``
+register. Finally clearing the channel removes this.
+
+e.g.::
+
+ .../cti_sys0/channels# echo 2 1 > trigin_attach
+ .../cti_sys0/channels# echo 2 6 > trigout_attach
+ .../cti_sys0/channels# cat chan_free
+ 0-1,3
+ .../cti_sys0/channels# cat chan_inuse
+ 2
+ .../cti_sys0/channels# echo 2 > chan_xtrigs_sel
+ .../cti_sys0/channels# cat chan_xtrigs_trigin
+ 1
+ .../cti_sys0/channels# cat chan_xtrigs_trigout
+ 6
+ .../cti_sys0/# echo 1 > enable
+ .../cti_sys0/channels# echo 2 > chan_set
+ .../cti_sys0/channels# cat ../regs/choutstatus
+ 0x4
+ .../cti_sys0/channels# cat ../regs/trigoutstatus
+ 0x40
+ .../cti_sys0/channels# echo 2 > chan_clear
+ .../cti_sys0/channels# cat ../regs/trigoutstatus
+ 0x0
+ .../cti_sys0/channels# cat ../regs/choutstatus
+ 0x0
diff --git a/Documentation/trace/coresight/coresight-etm4x-reference.rst b/Documentation/trace/coresight/coresight-etm4x-reference.rst
index b64d9a9c79df..89ac4e6fc96f 100644
--- a/Documentation/trace/coresight/coresight-etm4x-reference.rst
+++ b/Documentation/trace/coresight/coresight-etm4x-reference.rst
@@ -71,6 +71,20 @@ the ‘TRC’ prefix.
----
+:File: ``ts_source`` (ro)
+:Trace Registers: None.
+:Notes:
+ When FEAT_TRF is implemented, value of TRFCR_ELx.TS used for trace session. Otherwise -1
+ indicates an unknown time source. Check trcidr0.tssize to see if a global timestamp is
+ available.
+
+:Example:
+ ``$> cat ts_source``
+
+ ``$> 1``
+
+----
+
:File: ``addr_idx`` (rw)
:Trace Registers: None.
:Notes:
@@ -427,7 +441,7 @@ the ‘TRC’ prefix.
:Syntax:
``echo idx > vmid_idx``
- Where idx <  numvmidc
+ Where idx < numvmidc
----
@@ -650,13 +664,26 @@ Bit assignments shown below:-
parameter is set this value is applied to the currently indexed
address range.
+.. _coresight-branch-broadcast:
**bit (4):**
ETM_MODE_BB
**description:**
- Set to enable branch broadcast if supported in hardware [IDR0].
+ Set to enable branch broadcast if supported in hardware [IDR0]. The primary use for this feature
+ is when code is patched dynamically at run time and the full program flow may not be able to be
+ reconstructed using only conditional branches.
+
+ There is currently no support in Perf for supplying modified binaries to the decoder, so this
+ feature is only intended to be used for debugging purposes or with a 3rd party tool.
+
+ Choosing this option will result in a significant increase in the amount of trace generated -
+ possible danger of overflows, or fewer instructions covered. Note, that this option also
+ overrides any setting of :ref:`ETM_MODE_RETURNSTACK <coresight-return-stack>`, so where a branch
+ broadcast range overlaps a return stack range, return stacks will not be available for that
+ range.
+.. _coresight-cycle-accurate:
**bit (5):**
ETMv4_MODE_CYCACC
@@ -678,6 +705,7 @@ Bit assignments shown below:-
**description:**
Set to enable virtual machine ID tracing if supported [IDR2].
+.. _coresight-timestamp:
**bit (11):**
ETMv4_MODE_TIMESTAMP
@@ -685,6 +713,7 @@ Bit assignments shown below:-
**description:**
Set to enable timestamp generation if supported [IDR0].
+.. _coresight-return-stack:
**bit (12):**
ETM_MODE_RETURNSTACK
diff --git a/Documentation/trace/coresight/coresight-perf.rst b/Documentation/trace/coresight/coresight-perf.rst
new file mode 100644
index 000000000000..d087aae7d492
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-perf.rst
@@ -0,0 +1,158 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+================
+CoreSight - Perf
+================
+
+ :Author: Carsten Haitzler <carsten.haitzler@arm.com>
+ :Date: June 29th, 2022
+
+Perf is able to locally access CoreSight trace data and store it to the
+output perf data files. This data can then be later decoded to give the
+instructions that were traced for debugging or profiling purposes. You
+can log such data with a perf record command like::
+
+ perf record -e cs_etm//u testbinary
+
+This would run some test binary (testbinary) until it exits and record
+a perf.data trace file. That file would have AUX sections if CoreSight
+is working correctly. You can dump the content of this file as
+readable text with a command like::
+
+ perf report --stdio --dump -i perf.data
+
+You should find some sections of this file have AUX data blocks like::
+
+ 0x1e78 [0x30]: PERF_RECORD_AUXTRACE size: 0x11dd0 offset: 0 ref: 0x1b614fc1061b0ad1 idx: 0 tid: 531230 cpu: -1
+
+ . ... CoreSight ETM Trace data: size 73168 bytes
+ Idx:0; ID:10; I_ASYNC : Alignment Synchronisation.
+ Idx:12; ID:10; I_TRACE_INFO : Trace Info.; INFO=0x0 { CC.0 }
+ Idx:17; ID:10; I_ADDR_L_64IS0 : Address, Long, 64 bit, IS0.; Addr=0x0000000000000000;
+ Idx:26; ID:10; I_TRACE_ON : Trace On.
+ Idx:27; ID:10; I_ADDR_CTXT_L_64IS0 : Address & Context, Long, 64 bit, IS0.; Addr=0x0000FFFFB6069140; Ctxt: AArch64,EL0, NS;
+ Idx:38; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEEEEEEEEEEEEEE
+ Idx:39; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEEEEEEEEEEEEEE
+ Idx:40; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEEEEEEEEEEEEEE
+ Idx:41; ID:10; I_ATOM_F6 : Atom format 6.; EEEEEEEEEEEN
+ ...
+
+If you see these above, then your system is tracing CoreSight data
+correctly.
+
+To compile perf with CoreSight support in the tools/perf directory do::
+
+ make CORESIGHT=1
+
+This requires OpenCSD to build. You may install distribution packages
+for the support such as libopencsd and libopencsd-dev or download it
+and build yourself. Upstream OpenCSD is located at:
+
+ https://github.com/Linaro/OpenCSD
+
+For complete information on building perf with CoreSight support and
+more extensive usage look at:
+
+ https://github.com/Linaro/OpenCSD/blob/master/HOWTO.md
+
+
+Kernel CoreSight Support
+------------------------
+
+You will also want CoreSight support enabled in your kernel config.
+Ensure it is enabled with::
+
+ CONFIG_CORESIGHT=y
+
+There are various other CoreSight options you probably also want
+enabled like::
+
+ CONFIG_CORESIGHT_LINKS_AND_SINKS=y
+ CONFIG_CORESIGHT_LINK_AND_SINK_TMC=y
+ CONFIG_CORESIGHT_CATU=y
+ CONFIG_CORESIGHT_SINK_TPIU=y
+ CONFIG_CORESIGHT_SINK_ETBV10=y
+ CONFIG_CORESIGHT_SOURCE_ETM4X=y
+ CONFIG_CORESIGHT_CTI=y
+ CONFIG_CORESIGHT_CTI_INTEGRATION_REGS=y
+
+Please refer to the kernel configuration help for more information.
+
+Perf test - Verify kernel and userspace perf CoreSight work
+-----------------------------------------------------------
+
+When you run perf test, it will do a lot of self tests. Some of those
+tests will cover CoreSight (only if enabled and on ARM64). You
+generally would run perf test from the tools/perf directory in the
+kernel tree. Some tests will check some internal perf support like:
+
+ Check Arm CoreSight trace data recording and synthesized samples
+ Check Arm SPE trace data recording and synthesized samples
+
+Some others will actually use perf record and some test binaries that
+are in tests/shell/coresight and will collect traces to ensure a
+minimum level of functionality is met. The scripts that launch these
+tests are in the same directory. These will all look like:
+
+ CoreSight / ASM Pure Loop
+ CoreSight / Memcpy 16k 10 Threads
+ CoreSight / Thread Loop 10 Threads - Check TID
+ etc.
+
+These perf record tests will not run if the tool binaries do not exist
+in tests/shell/coresight/\*/ and will be skipped. If you do not have
+CoreSight support in hardware then either do not build perf with
+CoreSight support or remove these binaries in order to not have these
+tests fail and have them skip instead.
+
+These tests will log historical results in the current working
+directory (e.g. tools/perf) and will be named stats-\*.csv like:
+
+ stats-asm_pure_loop-out.csv
+ stats-memcpy_thread-16k_10.csv
+ ...
+
+These statistic files log some aspects of the AUX data sections in
+the perf data output counting some numbers of certain encodings (a
+good way to know that it's working in a very simple way). One problem
+with CoreSight is that given a large enough amount of data needing to
+be logged, some of it can be lost due to the processor not waking up
+in time to read out all the data from buffers etc.. You will notice
+that the amount of data collected can vary a lot per run of perf test.
+If you wish to see how this changes over time, simply run perf test
+multiple times and all these csv files will have more and more data
+appended to it that you can later examine, graph and otherwise use to
+figure out if things have become worse or better.
+
+This means sometimes these tests fail as they don't capture all the
+data needed. This is about tracking quality and amount of data
+produced over time and to see when changes to the Linux kernel improve
+quality of traces.
+
+Be aware that some of these tests take quite a while to run, specifically
+in processing the perf data file and dumping contents to then examine what
+is inside.
+
+You can change where these csv logs are stored by setting the
+PERF_TEST_CORESIGHT_STATDIR environment variable before running perf
+test like::
+
+ export PERF_TEST_CORESIGHT_STATDIR=/var/tmp
+ perf test
+
+They will also store resulting perf output data in the current
+directory for later inspection like::
+
+ perf-asm_pure_loop-out.data
+ perf-memcpy_thread-16k_10.data
+ ...
+
+You can alter where the perf data files are stored by setting the
+PERF_TEST_CORESIGHT_DATADIR environment variable such as::
+
+ PERF_TEST_CORESIGHT_DATADIR=/var/tmp
+ perf test
+
+You may wish to set these above environment variables if you wish to
+keep the output of tests outside of the current working directory for
+longer term storage and examination.
diff --git a/Documentation/trace/coresight/coresight-tpda.rst b/Documentation/trace/coresight/coresight-tpda.rst
new file mode 100644
index 000000000000..a37f387ceaea
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-tpda.rst
@@ -0,0 +1,52 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=================================================================
+The trace performance monitoring and diagnostics aggregator(TPDA)
+=================================================================
+
+ :Author: Jinlong Mao <quic_jinlmao@quicinc.com>
+ :Date: January 2023
+
+Hardware Description
+--------------------
+
+TPDA - The trace performance monitoring and diagnostics aggregator or
+TPDA in short serves as an arbitration and packetization engine for the
+performance monitoring and diagnostics network specification.
+The primary use case of the TPDA is to provide packetization, funneling
+and timestamping of Monitor data.
+
+
+Sysfs files and directories
+---------------------------
+Root: ``/sys/bus/coresight/devices/tpda<N>``
+
+Config details
+---------------------------
+
+The tpdm and tpda nodes should be observed at the coresight path
+"/sys/bus/coresight/devices".
+e.g.
+/sys/bus/coresight/devices # ls -l | grep tpd
+tpda0 -> ../../../devices/platform/soc@0/6004000.tpda/tpda0
+tpdm0 -> ../../../devices/platform/soc@0/6c08000.mm.tpdm/tpdm0
+
+We can use the commands are similar to the below to validate TPDMs.
+Enable coresight sink first. The port of tpda which is connected to
+the tpdm will be enabled after commands below.
+
+echo 1 > /sys/bus/coresight/devices/tmc_etf0/enable_sink
+echo 1 > /sys/bus/coresight/devices/tpdm0/enable_source
+echo 1 > /sys/bus/coresight/devices/tpdm0/integration_test
+echo 2 > /sys/bus/coresight/devices/tpdm0/integration_test
+
+The test data will be collected in the coresight sink which is enabled.
+If rwp register of the sink is keeping updating when do
+integration_test (by cat tmc_etf0/mgmt/rwp), it means there is data
+generated from TPDM to sink.
+
+There must be a tpda between tpdm and the sink. When there are some
+other trace event hw components in the same HW block with tpdm, tpdm
+and these hw components will connect to the coresight funnel. When
+there is only tpdm trace hw in the HW block, tpdm will connect to
+tpda directly.
diff --git a/Documentation/trace/coresight/coresight-tpdm.rst b/Documentation/trace/coresight/coresight-tpdm.rst
new file mode 100644
index 000000000000..72fd5c855d45
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-tpdm.rst
@@ -0,0 +1,45 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================================================
+Trace performance monitoring and diagnostics monitor(TPDM)
+==========================================================
+
+ :Author: Jinlong Mao <quic_jinlmao@quicinc.com>
+ :Date: January 2023
+
+Hardware Description
+--------------------
+TPDM - The trace performance monitoring and diagnostics monitor or TPDM in
+short serves as data collection component for various dataset types.
+The primary use case of the TPDM is to collect data from different data
+sources and send it to a TPDA for packetization, timestamping and funneling.
+
+Sysfs files and directories
+---------------------------
+Root: ``/sys/bus/coresight/devices/tpdm<N>``
+
+----
+
+:File: ``enable_source`` (RW)
+:Notes:
+ - > 0 : enable the datasets of TPDM.
+
+ - = 0 : disable the datasets of TPDM.
+
+:Syntax:
+ ``echo 1 > enable_source``
+
+----
+
+:File: ``integration_test`` (wo)
+:Notes:
+ Integration test will generate test data for tpdm.
+
+:Syntax:
+ ``echo value > integration_test``
+
+ value - 1 or 2.
+
+----
+
+.. This text is intentionally added to make Sphinx happy.
diff --git a/Documentation/trace/coresight/coresight-trbe.rst b/Documentation/trace/coresight/coresight-trbe.rst
new file mode 100644
index 000000000000..b9928ef148da
--- /dev/null
+++ b/Documentation/trace/coresight/coresight-trbe.rst
@@ -0,0 +1,38 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+Trace Buffer Extension (TRBE).
+==============================
+
+ :Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ :Date: November 2020
+
+Hardware Description
+--------------------
+
+Trace Buffer Extension (TRBE) is a percpu hardware which captures in system
+memory, CPU traces generated from a corresponding percpu tracing unit. This
+gets plugged in as a coresight sink device because the corresponding trace
+generators (ETE), are plugged in as source device.
+
+The TRBE is not compliant to CoreSight architecture specifications, but is
+driven via the CoreSight driver framework to support the ETE (which is
+CoreSight compliant) integration.
+
+Sysfs files and directories
+---------------------------
+
+The TRBE devices appear on the existing coresight bus alongside the other
+coresight devices::
+
+ >$ ls /sys/bus/coresight/devices
+ trbe0 trbe1 trbe2 trbe3
+
+The ``trbe<N>`` named TRBEs are associated with a CPU.::
+
+ >$ ls /sys/bus/coresight/devices/trbe0/
+ align flag
+
+*Key file items are:-*
+ * ``align``: TRBE write pointer alignment
+ * ``flag``: TRBE updates memory with access and dirty flags
diff --git a/Documentation/trace/coresight/coresight.rst b/Documentation/trace/coresight/coresight.rst
index a566719f8e7e..d4f93d6a2d63 100644
--- a/Documentation/trace/coresight/coresight.rst
+++ b/Documentation/trace/coresight/coresight.rst
@@ -130,7 +130,7 @@ Misc:
Device Tree Bindings
--------------------
-See Documentation/devicetree/bindings/arm/coresight.txt for details.
+See ``Documentation/devicetree/bindings/arm/arm,coresight-*.yaml`` for details.
As of this writing drivers for ITM, STMs and CTIs are not provided but are
expected to be added as the solution matures.
@@ -241,6 +241,92 @@ to the newer scheme, to give a confirmation that what you see on your
system is not unexpected. One must use the "names" as they appear on
the system under specified locations.
+Topology Representation
+-----------------------
+
+Each CoreSight component has a ``connections`` directory which will contain
+links to other CoreSight components. This allows the user to explore the trace
+topology and for larger systems, determine the most appropriate sink for a
+given source. The connection information can also be used to establish
+which CTI devices are connected to a given component. This directory contains a
+``nr_links`` attribute detailing the number of links in the directory.
+
+For an ETM source, in this case ``etm0`` on a Juno platform, a typical
+arrangement will be::
+
+ linaro-developer:~# ls - l /sys/bus/coresight/devices/etm0/connections
+ <file details> cti_cpu0 -> ../../../23020000.cti/cti_cpu0
+ <file details> nr_links
+ <file details> out:0 -> ../../../230c0000.funnel/funnel2
+
+Following the out port to ``funnel2``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel2/connections
+ <file details> in:0 -> ../../../23040000.etm/etm0
+ <file details> in:1 -> ../../../23140000.etm/etm3
+ <file details> in:2 -> ../../../23240000.etm/etm4
+ <file details> in:3 -> ../../../23340000.etm/etm5
+ <file details> nr_links
+ <file details> out:0 -> ../../../20040000.funnel/funnel0
+
+And again to ``funnel0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel0/connections
+ <file details> in:0 -> ../../../220c0000.funnel/funnel1
+ <file details> in:1 -> ../../../230c0000.funnel/funnel2
+ <file details> nr_links
+ <file details> out:0 -> ../../../20010000.etf/tmc_etf0
+
+Finding the first sink ``tmc_etf0``. This can be used to collect data
+as a sink, or as a link to propagate further along the chain::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/tmc_etf0/connections
+ <file details> cti_sys0 -> ../../../20020000.cti/cti_sys0
+ <file details> in:0 -> ../../../20040000.funnel/funnel0
+ <file details> nr_links
+ <file details> out:0 -> ../../../20150000.funnel/funnel4
+
+via ``funnel4``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/funnel4/connections
+ <file details> in:0 -> ../../../20010000.etf/tmc_etf0
+ <file details> in:1 -> ../../../20140000.etf/tmc_etf1
+ <file details> nr_links
+ <file details> out:0 -> ../../../20120000.replicator/replicator0
+
+and a ``replicator0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/replicator0/connections
+ <file details> in:0 -> ../../../20150000.funnel/funnel4
+ <file details> nr_links
+ <file details> out:0 -> ../../../20030000.tpiu/tpiu0
+ <file details> out:1 -> ../../../20070000.etr/tmc_etr0
+
+Arriving at the final sink in the chain, ``tmc_etr0``::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/tmc_etr0/connections
+ <file details> cti_sys0 -> ../../../20020000.cti/cti_sys0
+ <file details> in:0 -> ../../../20120000.replicator/replicator0
+ <file details> nr_links
+
+As described below, when using sysfs it is sufficient to enable a sink and
+a source for successful trace. The framework will correctly enable all
+intermediate links as required.
+
+Note: ``cti_sys0`` appears in two of the connections lists above.
+CTIs can connect to multiple devices and are arranged in a star topology
+via the CTM. See (Documentation/trace/coresight/coresight-ect.rst)
+[#fourth]_ for further details.
+Looking at this device we see 4 connections::
+
+ linaro-developer:~# ls -l /sys/bus/coresight/devices/cti_sys0/connections
+ <file details> nr_links
+ <file details> stm0 -> ../../../20100000.stm/stm0
+ <file details> tmc_etf0 -> ../../../20010000.etf/tmc_etf0
+ <file details> tmc_etr0 -> ../../../20070000.etr/tmc_etr0
+ <file details> tpiu0 -> ../../../20030000.tpiu/tpiu0
+
+
How to use the tracer modules
-----------------------------
@@ -253,7 +339,8 @@ Preference is given to the former as using the sysFS interface
requires a deep understanding of the Coresight HW. The following sections
provide details on using both methods.
-1) Using the sysFS interface:
+Using the sysFS interface
+~~~~~~~~~~~~~~~~~~~~~~~~~
Before trace collection can start, a coresight sink needs to be identified.
There is no limit on the amount of sinks (nor sources) that can be enabled at
@@ -360,7 +447,8 @@ wealth of possibilities that coresight provides.
Instruction 0 0x8026B588 E8BD8000 true LDM sp!,{pc}
Timestamp Timestamp: 17107041535
-2) Using perf framework:
+Using perf framework
+~~~~~~~~~~~~~~~~~~~~
Coresight tracers are represented using the Perf framework's Performance
Monitoring Unit (PMU) abstraction. As such the perf framework takes charge of
@@ -409,7 +497,11 @@ More information on the above and other example on how to use Coresight with
the perf tools can be found in the "HOWTO.md" file of the openCSD gitHub
repository [#third]_.
-2.1) AutoFDO analysis using the perf tools:
+Advanced perf framework usage
+-----------------------------
+
+AutoFDO analysis using the perf tools
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
perf can be used to record and analyze trace of programs.
@@ -427,9 +519,42 @@ The --itrace option controls the type and frequency of synthesized events
Note that only 64-bit programs are currently supported - further work is
required to support instruction decode of 32-bit Arm programs.
+Tracing PID
+~~~~~~~~~~~
+
+The kernel can be built to write the PID value into the PE ContextID registers.
+For a kernel running at EL1, the PID is stored in CONTEXTIDR_EL1. A PE may
+implement Arm Virtualization Host Extensions (VHE), which the kernel can
+run at EL2 as a virtualisation host; in this case, the PID value is stored in
+CONTEXTIDR_EL2.
+
+perf provides PMU formats that program the ETM to insert these values into the
+trace data; the PMU formats are defined as below:
+
+ "contextid1": Available on both EL1 kernel and EL2 kernel. When the
+ kernel is running at EL1, "contextid1" enables the PID
+ tracing; when the kernel is running at EL2, this enables
+ tracing the PID of guest applications.
+
+ "contextid2": Only usable when the kernel is running at EL2. When
+ selected, enables PID tracing on EL2 kernel.
+
+ "contextid": Will be an alias for the option that enables PID
+ tracing. I.e,
+ contextid == contextid1, on EL1 kernel.
+ contextid == contextid2, on EL2 kernel.
+
+perf will always enable PID tracing at the relevant EL, this is accomplished by
+automatically enable the "contextid" config - but for EL2 it is possible to make
+specific adjustments using configs "contextid1" and "contextid2", E.g. if a user
+wants to trace PIDs for both host and guest, the two configs "contextid1" and
+"contextid2" can be set at the same time:
+
+ perf record -e cs_etm/contextid1,contextid2/u -- vm
+
Generating coverage files for Feedback Directed Optimization: AutoFDO
----------------------------------------------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
'perf inject' accepts the --itrace option in which case tracing data is
removed and replaced with the synthesized events. e.g.
@@ -460,6 +585,49 @@ sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tuto
Bubble sorting array of 30000 elements
5806 ms
+Config option formats
+~~~~~~~~~~~~~~~~~~~~~
+
+The following strings can be provided between // on the perf command line to enable various options.
+They are also listed in the folder /sys/bus/event_source/devices/cs_etm/format/
+
+.. list-table::
+ :header-rows: 1
+
+ * - Option
+ - Description
+ * - branch_broadcast
+ - Session local version of the system wide setting:
+ :ref:`ETM_MODE_BB <coresight-branch-broadcast>`
+ * - contextid
+ - See `Tracing PID`_
+ * - contextid1
+ - See `Tracing PID`_
+ * - contextid2
+ - See `Tracing PID`_
+ * - configid
+ - Selection for a custom configuration. This is an implementation detail and not used directly,
+ see :ref:`trace/coresight/coresight-config:Using Configurations in perf`
+ * - preset
+ - Override for parameters in a custom configuration, see
+ :ref:`trace/coresight/coresight-config:Using Configurations in perf`
+ * - sinkid
+ - Hashed version of the string to select a sink, automatically set when using the @ notation.
+ This is an internal implementation detail and is not used directly, see `Using perf
+ framework`_.
+ * - cycacc
+ - Session local version of the system wide setting: :ref:`ETMv4_MODE_CYCACC
+ <coresight-cycle-accurate>`
+ * - retstack
+ - Session local version of the system wide setting: :ref:`ETM_MODE_RETURNSTACK
+ <coresight-return-stack>`
+ * - timestamp
+ - Session local version of the system wide setting: :ref:`ETMv4_MODE_TIMESTAMP
+ <coresight-timestamp>`
+ * - cc_threshold
+ - Cycle count threshold value. If nothing is provided here or the provided value is 0, then the
+ default value i.e 0x100 will be used. If provided value is less than minimum cycles threshold
+ value, as indicated via TRCIDR3.CCITMIN, then the minimum value will be used instead.
How to use the STM module
-------------------------
@@ -489,10 +657,39 @@ interface provided for that purpose by the generic STM API::
crw------- 1 root root 10, 61 Jan 3 18:11 /dev/stm0
root@genericarmv8:~#
-Details on how to use the generic STM API can be found here:- :doc:`../stm` [#second]_.
+Details on how to use the generic STM API can be found here:
+- Documentation/trace/stm.rst [#second]_.
+
+The CTI & CTM Modules
+---------------------
+
+The CTI (Cross Trigger Interface) provides a set of trigger signals between
+individual CTIs and components, and can propagate these between all CTIs via
+channels on the CTM (Cross Trigger Matrix).
+
+A separate documentation file is provided to explain the use of these devices.
+(Documentation/trace/coresight/coresight-ect.rst) [#fourth]_.
+
+CoreSight System Configuration
+------------------------------
+
+CoreSight components can be complex devices with many programming options.
+Furthermore, components can be programmed to interact with each other across the
+complete system.
+
+A CoreSight System Configuration manager is provided to allow these complex programming
+configurations to be selected and used easily from perf and sysfs.
+
+See the separate document for further information.
+(Documentation/trace/coresight/coresight-config.rst) [#fifth]_.
+
.. [#first] Documentation/ABI/testing/sysfs-bus-coresight-devices-stm
.. [#second] Documentation/trace/stm.rst
.. [#third] https://github.com/Linaro/perf-opencsd
+
+.. [#fourth] Documentation/trace/coresight/coresight-ect.rst
+
+.. [#fifth] Documentation/trace/coresight/coresight-config.rst
diff --git a/Documentation/trace/coresight/ultrasoc-smb.rst b/Documentation/trace/coresight/ultrasoc-smb.rst
new file mode 100644
index 000000000000..a05488a75ff0
--- /dev/null
+++ b/Documentation/trace/coresight/ultrasoc-smb.rst
@@ -0,0 +1,83 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+UltraSoc - HW Assisted Tracing on SoC
+======================================
+ :Author: Qi Liu <liuqi115@huawei.com>
+ :Date: January 2023
+
+Introduction
+------------
+
+UltraSoc SMB is a per SCCL (Super CPU Cluster) hardware. It provides a
+way to buffer and store CPU trace messages in a region of shared system
+memory. The device acts as a coresight sink device and the
+corresponding trace generators (ETM) are attached as source devices.
+
+Sysfs files and directories
+---------------------------
+
+The SMB devices appear on the existing coresight bus alongside other
+devices::
+
+ $# ls /sys/bus/coresight/devices/
+ ultra_smb0 ultra_smb1 ultra_smb2 ultra_smb3
+
+The ``ultra_smb<N>`` names SMB device associated with SCCL.::
+
+ $# ls /sys/bus/coresight/devices/ultra_smb0
+ enable_sink mgmt
+ $# ls /sys/bus/coresight/devices/ultra_smb0/mgmt
+ buf_size buf_status read_pos write_pos
+
+Key file items are:
+
+ * ``read_pos``: Shows the value on the read pointer register.
+ * ``write_pos``: Shows the value on the write pointer register.
+ * ``buf_status``: Shows the value on the status register.
+ BIT(0) is zero value which means the buffer is empty.
+ * ``buf_size``: Shows the buffer size of each device.
+
+Firmware Bindings
+-----------------
+
+The device is only supported with ACPI. Its binding describes device
+identifier, resource information and graph structure.
+
+The device is identified as ACPI HID "HISI03A1". Device resources are allocated
+using the _CRS method. Each device must present two base address; the first one
+is the configuration base address of the device, the second one is the 32-bit
+base address of shared system memory.
+
+Example::
+
+ Device(USMB) { \
+ Name(_HID, "HISI03A1") \
+ Name(_CRS, ResourceTemplate() { \
+ QWordMemory (ResourceConsumer, , MinFixed, MaxFixed, NonCacheable, \
+ ReadWrite, 0x0, 0x95100000, 0x951FFFFF, 0x0, 0x100000) \
+ QWordMemory (ResourceConsumer, , MinFixed, MaxFixed, Cacheable, \
+ ReadWrite, 0x0, 0x50000000, 0x53FFFFFF, 0x0, 0x4000000) \
+ }) \
+ Name(_DSD, Package() { \
+ ToUUID("ab02a46b-74c7-45a2-bd68-f7d344ef2153"), \
+ /* Use CoreSight Graph ACPI bindings to describe connections topology */
+ Package() { \
+ 0, \
+ 1, \
+ Package() { \
+ 1, \
+ ToUUID("3ecbc8b6-1d0e-4fb3-8107-e627f805c6cd"), \
+ 8, \
+ Package() {0x8, 0, \_SB.S00.SL11.CL28.F008, 0}, \
+ Package() {0x9, 0, \_SB.S00.SL11.CL29.F009, 0}, \
+ Package() {0xa, 0, \_SB.S00.SL11.CL2A.F010, 0}, \
+ Package() {0xb, 0, \_SB.S00.SL11.CL2B.F011, 0}, \
+ Package() {0xc, 0, \_SB.S00.SL11.CL2C.F012, 0}, \
+ Package() {0xd, 0, \_SB.S00.SL11.CL2D.F013, 0}, \
+ Package() {0xe, 0, \_SB.S00.SL11.CL2E.F014, 0}, \
+ Package() {0xf, 0, \_SB.S00.SL11.CL2F.F015, 0}, \
+ } \
+ } \
+ }) \
+ }
diff --git a/Documentation/trace/events-kmem.rst b/Documentation/trace/events-kmem.rst
index 555484110e36..68fa75247488 100644
--- a/Documentation/trace/events-kmem.rst
+++ b/Documentation/trace/events-kmem.rst
@@ -69,7 +69,7 @@ When pages are freed in batch, the also mm_page_free_batched is triggered.
Broadly speaking, pages are taken off the LRU lock in bulk and
freed in batch with a page list. Significant amounts of activity here could
indicate that the system is under memory pressure and can also indicate
-contention on the zone->lru_lock.
+contention on the lruvec->lru_lock.
4. Per-CPU Allocator Activity
=============================
diff --git a/Documentation/trace/events-msr.rst b/Documentation/trace/events-msr.rst
index e938aa0b6f4f..35d06dc66bc2 100644
--- a/Documentation/trace/events-msr.rst
+++ b/Documentation/trace/events-msr.rst
@@ -4,11 +4,11 @@ MSR Trace Events
The x86 kernel supports tracing most MSR (Model Specific Register) accesses.
To see the definition of the MSRs on Intel systems please see the SDM
-at http://www.intel.com/sdm (Volume 3)
+at https://www.intel.com/sdm (Volume 3)
Available trace points:
-/sys/kernel/debug/tracing/events/msr/
+/sys/kernel/tracing/events/msr/
Trace MSR reads:
@@ -34,7 +34,7 @@ rdpmc
The trace data can be post processed with the postprocess/decode_msr.py script::
- cat /sys/kernel/debug/tracing/trace | decode_msr.py /usr/src/linux/include/asm/msr-index.h
+ cat /sys/kernel/tracing/trace | decode_msr.py /usr/src/linux/include/asm/msr-index.h
to add symbolic MSR names.
diff --git a/Documentation/trace/events-nmi.rst b/Documentation/trace/events-nmi.rst
index 9e0a7289d80a..22ac1be0ea6f 100644
--- a/Documentation/trace/events-nmi.rst
+++ b/Documentation/trace/events-nmi.rst
@@ -4,7 +4,7 @@ NMI Trace Events
These events normally show up here:
- /sys/kernel/debug/tracing/events/nmi
+ /sys/kernel/tracing/events/nmi
nmi_handler
@@ -31,13 +31,13 @@ really hogging a lot of CPU time, like a millisecond at a time.
Note that the kernel's output is in milliseconds, but the input
to the filter is in nanoseconds! You can filter on 'delta_ns'::
- cd /sys/kernel/debug/tracing/events/nmi/nmi_handler
+ cd /sys/kernel/tracing/events/nmi/nmi_handler
echo 'handler==0xffffffff81625600 && delta_ns>1000000' > filter
echo 1 > enable
Your output would then look like::
- $ cat /sys/kernel/debug/tracing/trace_pipe
+ $ cat /sys/kernel/tracing/trace_pipe
<idle>-0 [000] d.h3 505.397558: nmi_handler: perf_event_nmi_handler() delta_ns: 3236765 handled: 1
<idle>-0 [000] d.h3 505.805893: nmi_handler: perf_event_nmi_handler() delta_ns: 3174234 handled: 1
<idle>-0 [000] d.h3 506.158206: nmi_handler: perf_event_nmi_handler() delta_ns: 3084642 handled: 1
diff --git a/Documentation/trace/events-power.rst b/Documentation/trace/events-power.rst
index 2ef318962e29..f45bf11fa88d 100644
--- a/Documentation/trace/events-power.rst
+++ b/Documentation/trace/events-power.rst
@@ -75,16 +75,6 @@ The PM QoS events are used for QoS add/update/remove request and for
target/flags update.
::
- pm_qos_add_request "pm_qos_class=%s value=%d"
- pm_qos_update_request "pm_qos_class=%s value=%d"
- pm_qos_remove_request "pm_qos_class=%s value=%d"
- pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld"
-
-The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY").
-The second parameter is value to be added/updated/removed.
-The third parameter is timeout value in usec.
-::
-
pm_qos_update_target "action=%s prev_value=%d curr_value=%d"
pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x"
@@ -92,7 +82,7 @@ The first parameter gives the QoS action name (e.g. "ADD_REQ").
The second parameter is the previous QoS value.
The third parameter is the current QoS value to update.
-And, there are also events used for device PM QoS add/update/remove request.
+There are also events used for device PM QoS add/update/remove request.
::
dev_pm_qos_add_request "device=%s type=%s new_value=%d"
@@ -103,3 +93,12 @@ The first parameter gives the device name which tries to add/update/remove
QoS requests.
The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY").
The third parameter is value to be added/updated/removed.
+
+And, there are events used for CPU latency QoS add/update/remove request.
+::
+
+ pm_qos_add_request "value=%d"
+ pm_qos_update_request "value=%d"
+ pm_qos_remove_request "value=%d"
+
+The parameter is the value to be added/updated/removed.
diff --git a/Documentation/trace/events.rst b/Documentation/trace/events.rst
index f7e1fcc0953c..759907c20e75 100644
--- a/Documentation/trace/events.rst
+++ b/Documentation/trace/events.rst
@@ -24,27 +24,27 @@ tracing information should be printed.
---------------------------------
The events which are available for tracing can be found in the file
-/sys/kernel/debug/tracing/available_events.
+/sys/kernel/tracing/available_events.
To enable a particular event, such as 'sched_wakeup', simply echo it
-to /sys/kernel/debug/tracing/set_event. For example::
+to /sys/kernel/tracing/set_event. For example::
- # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event
+ # echo sched_wakeup >> /sys/kernel/tracing/set_event
.. Note:: '>>' is necessary, otherwise it will firstly disable all the events.
To disable an event, echo the event name to the set_event file prefixed
with an exclamation point::
- # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
+ # echo '!sched_wakeup' >> /sys/kernel/tracing/set_event
To disable all events, echo an empty line to the set_event file::
- # echo > /sys/kernel/debug/tracing/set_event
+ # echo > /sys/kernel/tracing/set_event
To enable all events, echo ``*:*`` or ``*:`` to the set_event file::
- # echo *:* > /sys/kernel/debug/tracing/set_event
+ # echo *:* > /sys/kernel/tracing/set_event
The events are organized into subsystems, such as ext4, irq, sched,
etc., and a full event name looks like this: <subsystem>:<event>. The
@@ -53,29 +53,29 @@ file. All of the events in a subsystem can be specified via the syntax
``<subsystem>:*``; for example, to enable all irq events, you can use the
command::
- # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
+ # echo 'irq:*' > /sys/kernel/tracing/set_event
2.2 Via the 'enable' toggle
---------------------------
-The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy
+The events available are also listed in /sys/kernel/tracing/events/ hierarchy
of directories.
To enable event 'sched_wakeup'::
- # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
+ # echo 1 > /sys/kernel/tracing/events/sched/sched_wakeup/enable
To disable it::
- # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
+ # echo 0 > /sys/kernel/tracing/events/sched/sched_wakeup/enable
To enable all events in sched subsystem::
- # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
+ # echo 1 > /sys/kernel/tracing/events/sched/enable
To enable all events::
- # echo 1 > /sys/kernel/debug/tracing/events/enable
+ # echo 1 > /sys/kernel/tracing/events/enable
When reading one of these enable files, there are four results:
@@ -126,7 +126,7 @@ is the size of the data item, in bytes.
For example, here's the information displayed for the 'sched_wakeup'
event::
- # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format
+ # cat /sys/kernel/tracing/events/sched/sched_wakeup/format
name: sched_wakeup
ID: 60
@@ -198,6 +198,41 @@ The glob (~) accepts a wild card character (\*,?) and character classes
prev_comm ~ "*sh*"
prev_comm ~ "ba*sh"
+If the field is a pointer that points into user space (for example
+"filename" from sys_enter_openat), then you have to append ".ustring" to the
+field name::
+
+ filename.ustring ~ "password"
+
+As the kernel will have to know how to retrieve the memory that the pointer
+is at from user space.
+
+You can convert any long type to a function address and search by function name::
+
+ call_site.function == security_prepare_creds
+
+The above will filter when the field "call_site" falls on the address within
+"security_prepare_creds". That is, it will compare the value of "call_site" and
+the filter will return true if it is greater than or equal to the start of
+the function "security_prepare_creds" and less than the end of that function.
+
+The ".function" postfix can only be attached to values of size long, and can only
+be compared with "==" or "!=".
+
+Cpumask fields or scalar fields that encode a CPU number can be filtered using
+a user-provided cpumask in cpulist format. The format is as follows::
+
+ CPUS{$cpulist}
+
+Operators available to cpumask filtering are:
+
+& (intersection), ==, !=
+
+For example, this will filter events that have their .target_cpu field present
+in the given cpumask::
+
+ target_cpu & CPUS{17-42}
+
5.2 Setting filters
-------------------
@@ -206,19 +241,19 @@ to the 'filter' file for the given event.
For example::
- # cd /sys/kernel/debug/tracing/events/sched/sched_wakeup
+ # cd /sys/kernel/tracing/events/sched/sched_wakeup
# echo "common_preempt_count > 4" > filter
A slightly more involved example::
- # cd /sys/kernel/debug/tracing/events/signal/signal_generate
+ # cd /sys/kernel/tracing/events/signal/signal_generate
# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
If there is an error in the expression, you'll get an 'Invalid
argument' error when setting it, and the erroneous string along with
an error message can be seen by looking at the filter e.g.::
- # cd /sys/kernel/debug/tracing/events/signal/signal_generate
+ # cd /sys/kernel/tracing/events/signal/signal_generate
# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
-bash: echo: write error: Invalid argument
# cat filter
@@ -230,6 +265,16 @@ Currently the caret ('^') for an error always appears at the beginning of
the filter string; the error message should still be useful though
even without more accurate position info.
+5.2.1 Filter limitations
+------------------------
+
+If a filter is placed on a string pointer ``(char *)`` that does not point
+to a string on the ring buffer, but instead points to kernel or user space
+memory, then, for safety reasons, at most 1024 bytes of the content is
+copied onto a temporary buffer to do the compare. If the copy of the memory
+faults (the pointer points to memory that should not be accessed), then the
+string compare will be treated as not matching.
+
5.3 Clearing filters
--------------------
@@ -239,7 +284,7 @@ file.
To clear the filters for all events in a subsystem, write a '0' to the
subsystem's filter file.
-5.3 Subsystem filters
+5.4 Subsystem filters
---------------------
For convenience, filters for every event in a subsystem can be set or
@@ -258,7 +303,7 @@ above points:
Clear the filters on all events in the sched subsystem::
- # cd /sys/kernel/debug/tracing/events/sched
+ # cd /sys/kernel/tracing/events/sched
# echo 0 > filter
# cat sched_switch/filter
none
@@ -268,7 +313,7 @@ Clear the filters on all events in the sched subsystem::
Set a filter using only common fields for all events in the sched
subsystem (all events end up with the same filter)::
- # cd /sys/kernel/debug/tracing/events/sched
+ # cd /sys/kernel/tracing/events/sched
# echo common_pid == 0 > filter
# cat sched_switch/filter
common_pid == 0
@@ -279,14 +324,14 @@ Attempt to set a filter using a non-common field for all events in the
sched subsystem (all events but those that have a prev_pid field retain
their old filters)::
- # cd /sys/kernel/debug/tracing/events/sched
+ # cd /sys/kernel/tracing/events/sched
# echo prev_pid == 0 > filter
# cat sched_switch/filter
prev_pid == 0
# cat sched_wakeup/filter
common_pid == 0
-5.4 PID filtering
+5.5 PID filtering
-----------------
The set_event_pid file in the same directory as the top events directory
@@ -294,7 +339,7 @@ exists, will filter all events from tracing any task that does not have the
PID listed in the set_event_pid file.
::
- # cd /sys/kernel/debug/tracing
+ # cd /sys/kernel/tracing
# echo $$ > set_event_pid
# echo 1 > events/enable
@@ -342,7 +387,8 @@ section of Documentation/trace/ftrace.rst), but there are major
differences and the implementation isn't currently tied to it in any
way, so beware about making generalizations between the two.
-Note: Writing into trace_marker (See Documentation/trace/ftrace.rst)
+.. Note::
+ Writing into trace_marker (See Documentation/trace/ftrace.rst)
can also enable triggers that are written into
/sys/kernel/tracing/events/ftrace/print/trigger
@@ -389,14 +435,14 @@ The following commands are supported:
specifies that this enablement happens only once::
# echo 'enable_event:kmem:kmalloc:1' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
+ /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
The following trigger causes kmalloc events to stop being traced
when a read system call exits. This disablement happens on every
read system call exit::
# echo 'disable_event:kmem:kmalloc' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
+ /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger
The format is::
@@ -406,10 +452,10 @@ The following commands are supported:
To remove the above commands::
# echo '!enable_event:kmem:kmalloc:1' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
+ /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
# echo '!disable_event:kmem:kmalloc' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger
+ /sys/kernel/tracing/events/syscalls/sys_exit_read/trigger
Note that there can be any number of enable/disable_event triggers
per triggering event, but there can only be one trigger per
@@ -428,13 +474,13 @@ The following commands are supported:
kmalloc tracepoint is hit::
# echo 'stacktrace' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
The following trigger dumps a stacktrace the first 5 times a kmalloc
request happens with a size >= 64K::
# echo 'stacktrace:5 if bytes_req >= 65536' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
The format is::
@@ -443,16 +489,16 @@ The following commands are supported:
To remove the above commands::
# echo '!stacktrace' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
# echo '!stacktrace:5 if bytes_req >= 65536' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
The latter can also be removed more simply by the following (without
the filter)::
# echo '!stacktrace:5' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
Note that there can be only one stacktrace trigger per triggering
event.
@@ -468,20 +514,20 @@ The following commands are supported:
capture those events when the trigger event occurred::
# echo 'snapshot if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
To only snapshot once::
# echo 'snapshot:1 if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
To remove the above commands::
# echo '!snapshot if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
# echo '!snapshot:1 if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
Note that there can be only one snapshot trigger per triggering
event.
@@ -499,20 +545,20 @@ The following commands are supported:
trigger event::
# echo 'traceoff:1 if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
To always disable tracing when nr_rq > 1::
# echo 'traceoff if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
To remove the above commands::
# echo '!traceoff:1 if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
# echo '!traceoff if nr_rq > 1' > \
- /sys/kernel/debug/tracing/events/block/block_unplug/trigger
+ /sys/kernel/tracing/events/block/block_unplug/trigger
Note that there can be only one traceon or traceoff trigger per
triggering event.
@@ -525,3 +571,529 @@ The following commands are supported:
event counts (hitcount).
See Documentation/trace/histogram.rst for details and examples.
+
+7. In-kernel trace event API
+============================
+
+In most cases, the command-line interface to trace events is more than
+sufficient. Sometimes, however, applications might find the need for
+more complex relationships than can be expressed through a simple
+series of linked command-line expressions, or putting together sets of
+commands may be simply too cumbersome. An example might be an
+application that needs to 'listen' to the trace stream in order to
+maintain an in-kernel state machine detecting, for instance, when an
+illegal kernel state occurs in the scheduler.
+
+The trace event subsystem provides an in-kernel API allowing modules
+or other kernel code to generate user-defined 'synthetic' events at
+will, which can be used to either augment the existing trace stream
+and/or signal that a particular important state has occurred.
+
+A similar in-kernel API is also available for creating kprobe and
+kretprobe events.
+
+Both the synthetic event and k/ret/probe event APIs are built on top
+of a lower-level "dynevent_cmd" event command API, which is also
+available for more specialized applications, or as the basis of other
+higher-level trace event APIs.
+
+The API provided for these purposes is describe below and allows the
+following:
+
+ - dynamically creating synthetic event definitions
+ - dynamically creating kprobe and kretprobe event definitions
+ - tracing synthetic events from in-kernel code
+ - the low-level "dynevent_cmd" API
+
+7.1 Dyamically creating synthetic event definitions
+---------------------------------------------------
+
+There are a couple ways to create a new synthetic event from a kernel
+module or other kernel code.
+
+The first creates the event in one step, using synth_event_create().
+In this method, the name of the event to create and an array defining
+the fields is supplied to synth_event_create(). If successful, a
+synthetic event with that name and fields will exist following that
+call. For example, to create a new "schedtest" synthetic event::
+
+ ret = synth_event_create("schedtest", sched_fields,
+ ARRAY_SIZE(sched_fields), THIS_MODULE);
+
+The sched_fields param in this example points to an array of struct
+synth_field_desc, each of which describes an event field by type and
+name::
+
+ static struct synth_field_desc sched_fields[] = {
+ { .type = "pid_t", .name = "next_pid_field" },
+ { .type = "char[16]", .name = "next_comm_field" },
+ { .type = "u64", .name = "ts_ns" },
+ { .type = "u64", .name = "ts_ms" },
+ { .type = "unsigned int", .name = "cpu" },
+ { .type = "char[64]", .name = "my_string_field" },
+ { .type = "int", .name = "my_int_field" },
+ };
+
+See synth_field_size() for available types.
+
+If field_name contains [n], the field is considered to be a static array.
+
+If field_names contains[] (no subscript), the field is considered to
+be a dynamic array, which will only take as much space in the event as
+is required to hold the array.
+
+Because space for an event is reserved before assigning field values
+to the event, using dynamic arrays implies that the piecewise
+in-kernel API described below can't be used with dynamic arrays. The
+other non-piecewise in-kernel APIs can, however, be used with dynamic
+arrays.
+
+If the event is created from within a module, a pointer to the module
+must be passed to synth_event_create(). This will ensure that the
+trace buffer won't contain unreadable events when the module is
+removed.
+
+At this point, the event object is ready to be used for generating new
+events.
+
+In the second method, the event is created in several steps. This
+allows events to be created dynamically and without the need to create
+and populate an array of fields beforehand.
+
+To use this method, an empty or partially empty synthetic event should
+first be created using synth_event_gen_cmd_start() or
+synth_event_gen_cmd_array_start(). For synth_event_gen_cmd_start(),
+the name of the event along with one or more pairs of args each pair
+representing a 'type field_name;' field specification should be
+supplied. For synth_event_gen_cmd_array_start(), the name of the
+event along with an array of struct synth_field_desc should be
+supplied. Before calling synth_event_gen_cmd_start() or
+synth_event_gen_cmd_array_start(), the user should create and
+initialize a dynevent_cmd object using synth_event_cmd_init().
+
+For example, to create a new "schedtest" synthetic event with two
+fields::
+
+ struct dynevent_cmd cmd;
+ char *buf;
+
+ /* Create a buffer to hold the generated command */
+ buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
+
+ /* Before generating the command, initialize the cmd object */
+ synth_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
+
+ ret = synth_event_gen_cmd_start(&cmd, "schedtest", THIS_MODULE,
+ "pid_t", "next_pid_field",
+ "u64", "ts_ns");
+
+Alternatively, using an array of struct synth_field_desc fields
+containing the same information::
+
+ ret = synth_event_gen_cmd_array_start(&cmd, "schedtest", THIS_MODULE,
+ fields, n_fields);
+
+Once the synthetic event object has been created, it can then be
+populated with more fields. Fields are added one by one using
+synth_event_add_field(), supplying the dynevent_cmd object, a field
+type, and a field name. For example, to add a new int field named
+"intfield", the following call should be made::
+
+ ret = synth_event_add_field(&cmd, "int", "intfield");
+
+See synth_field_size() for available types. If field_name contains [n]
+the field is considered to be an array.
+
+A group of fields can also be added all at once using an array of
+synth_field_desc with add_synth_fields(). For example, this would add
+just the first four sched_fields::
+
+ ret = synth_event_add_fields(&cmd, sched_fields, 4);
+
+If you already have a string of the form 'type field_name',
+synth_event_add_field_str() can be used to add it as-is; it will
+also automatically append a ';' to the string.
+
+Once all the fields have been added, the event should be finalized and
+registered by calling the synth_event_gen_cmd_end() function::
+
+ ret = synth_event_gen_cmd_end(&cmd);
+
+At this point, the event object is ready to be used for tracing new
+events.
+
+7.2 Tracing synthetic events from in-kernel code
+------------------------------------------------
+
+To trace a synthetic event, there are several options. The first
+option is to trace the event in one call, using synth_event_trace()
+with a variable number of values, or synth_event_trace_array() with an
+array of values to be set. A second option can be used to avoid the
+need for a pre-formed array of values or list of arguments, via
+synth_event_trace_start() and synth_event_trace_end() along with
+synth_event_add_next_val() or synth_event_add_val() to add the values
+piecewise.
+
+7.2.1 Tracing a synthetic event all at once
+-------------------------------------------
+
+To trace a synthetic event all at once, the synth_event_trace() or
+synth_event_trace_array() functions can be used.
+
+The synth_event_trace() function is passed the trace_event_file
+representing the synthetic event (which can be retrieved using
+trace_get_event_file() using the synthetic event name, "synthetic" as
+the system name, and the trace instance name (NULL if using the global
+trace array)), along with an variable number of u64 args, one for each
+synthetic event field, and the number of values being passed.
+
+So, to trace an event corresponding to the synthetic event definition
+above, code like the following could be used::
+
+ ret = synth_event_trace(create_synth_test, 7, /* number of values */
+ 444, /* next_pid_field */
+ (u64)"clackers", /* next_comm_field */
+ 1000000, /* ts_ns */
+ 1000, /* ts_ms */
+ smp_processor_id(),/* cpu */
+ (u64)"Thneed", /* my_string_field */
+ 999); /* my_int_field */
+
+All vals should be cast to u64, and string vals are just pointers to
+strings, cast to u64. Strings will be copied into space reserved in
+the event for the string, using these pointers.
+
+Alternatively, the synth_event_trace_array() function can be used to
+accomplish the same thing. It is passed the trace_event_file
+representing the synthetic event (which can be retrieved using
+trace_get_event_file() using the synthetic event name, "synthetic" as
+the system name, and the trace instance name (NULL if using the global
+trace array)), along with an array of u64, one for each synthetic
+event field.
+
+To trace an event corresponding to the synthetic event definition
+above, code like the following could be used::
+
+ u64 vals[7];
+
+ vals[0] = 777; /* next_pid_field */
+ vals[1] = (u64)"tiddlywinks"; /* next_comm_field */
+ vals[2] = 1000000; /* ts_ns */
+ vals[3] = 1000; /* ts_ms */
+ vals[4] = smp_processor_id(); /* cpu */
+ vals[5] = (u64)"thneed"; /* my_string_field */
+ vals[6] = 398; /* my_int_field */
+
+The 'vals' array is just an array of u64, the number of which must
+match the number of field in the synthetic event, and which must be in
+the same order as the synthetic event fields.
+
+All vals should be cast to u64, and string vals are just pointers to
+strings, cast to u64. Strings will be copied into space reserved in
+the event for the string, using these pointers.
+
+In order to trace a synthetic event, a pointer to the trace event file
+is needed. The trace_get_event_file() function can be used to get
+it - it will find the file in the given trace instance (in this case
+NULL since the top trace array is being used) while at the same time
+preventing the instance containing it from going away::
+
+ schedtest_event_file = trace_get_event_file(NULL, "synthetic",
+ "schedtest");
+
+Before tracing the event, it should be enabled in some way, otherwise
+the synthetic event won't actually show up in the trace buffer.
+
+To enable a synthetic event from the kernel, trace_array_set_clr_event()
+can be used (which is not specific to synthetic events, so does need
+the "synthetic" system name to be specified explicitly).
+
+To enable the event, pass 'true' to it::
+
+ trace_array_set_clr_event(schedtest_event_file->tr,
+ "synthetic", "schedtest", true);
+
+To disable it pass false::
+
+ trace_array_set_clr_event(schedtest_event_file->tr,
+ "synthetic", "schedtest", false);
+
+Finally, synth_event_trace_array() can be used to actually trace the
+event, which should be visible in the trace buffer afterwards::
+
+ ret = synth_event_trace_array(schedtest_event_file, vals,
+ ARRAY_SIZE(vals));
+
+To remove the synthetic event, the event should be disabled, and the
+trace instance should be 'put' back using trace_put_event_file()::
+
+ trace_array_set_clr_event(schedtest_event_file->tr,
+ "synthetic", "schedtest", false);
+ trace_put_event_file(schedtest_event_file);
+
+If those have been successful, synth_event_delete() can be called to
+remove the event::
+
+ ret = synth_event_delete("schedtest");
+
+7.2.2 Tracing a synthetic event piecewise
+-----------------------------------------
+
+To trace a synthetic using the piecewise method described above, the
+synth_event_trace_start() function is used to 'open' the synthetic
+event trace::
+
+ struct synth_event_trace_state trace_state;
+
+ ret = synth_event_trace_start(schedtest_event_file, &trace_state);
+
+It's passed the trace_event_file representing the synthetic event
+using the same methods as described above, along with a pointer to a
+struct synth_event_trace_state object, which will be zeroed before use and
+used to maintain state between this and following calls.
+
+Once the event has been opened, which means space for it has been
+reserved in the trace buffer, the individual fields can be set. There
+are two ways to do that, either one after another for each field in
+the event, which requires no lookups, or by name, which does. The
+tradeoff is flexibility in doing the assignments vs the cost of a
+lookup per field.
+
+To assign the values one after the other without lookups,
+synth_event_add_next_val() should be used. Each call is passed the
+same synth_event_trace_state object used in the synth_event_trace_start(),
+along with the value to set the next field in the event. After each
+field is set, the 'cursor' points to the next field, which will be set
+by the subsequent call, continuing until all the fields have been set
+in order. The same sequence of calls as in the above examples using
+this method would be (without error-handling code)::
+
+ /* next_pid_field */
+ ret = synth_event_add_next_val(777, &trace_state);
+
+ /* next_comm_field */
+ ret = synth_event_add_next_val((u64)"slinky", &trace_state);
+
+ /* ts_ns */
+ ret = synth_event_add_next_val(1000000, &trace_state);
+
+ /* ts_ms */
+ ret = synth_event_add_next_val(1000, &trace_state);
+
+ /* cpu */
+ ret = synth_event_add_next_val(smp_processor_id(), &trace_state);
+
+ /* my_string_field */
+ ret = synth_event_add_next_val((u64)"thneed_2.01", &trace_state);
+
+ /* my_int_field */
+ ret = synth_event_add_next_val(395, &trace_state);
+
+To assign the values in any order, synth_event_add_val() should be
+used. Each call is passed the same synth_event_trace_state object used in
+the synth_event_trace_start(), along with the field name of the field
+to set and the value to set it to. The same sequence of calls as in
+the above examples using this method would be (without error-handling
+code)::
+
+ ret = synth_event_add_val("next_pid_field", 777, &trace_state);
+ ret = synth_event_add_val("next_comm_field", (u64)"silly putty",
+ &trace_state);
+ ret = synth_event_add_val("ts_ns", 1000000, &trace_state);
+ ret = synth_event_add_val("ts_ms", 1000, &trace_state);
+ ret = synth_event_add_val("cpu", smp_processor_id(), &trace_state);
+ ret = synth_event_add_val("my_string_field", (u64)"thneed_9",
+ &trace_state);
+ ret = synth_event_add_val("my_int_field", 3999, &trace_state);
+
+Note that synth_event_add_next_val() and synth_event_add_val() are
+incompatible if used within the same trace of an event - either one
+can be used but not both at the same time.
+
+Finally, the event won't be actually traced until it's 'closed',
+which is done using synth_event_trace_end(), which takes only the
+struct synth_event_trace_state object used in the previous calls::
+
+ ret = synth_event_trace_end(&trace_state);
+
+Note that synth_event_trace_end() must be called at the end regardless
+of whether any of the add calls failed (say due to a bad field name
+being passed in).
+
+7.3 Dyamically creating kprobe and kretprobe event definitions
+--------------------------------------------------------------
+
+To create a kprobe or kretprobe trace event from kernel code, the
+kprobe_event_gen_cmd_start() or kretprobe_event_gen_cmd_start()
+functions can be used.
+
+To create a kprobe event, an empty or partially empty kprobe event
+should first be created using kprobe_event_gen_cmd_start(). The name
+of the event and the probe location should be specified along with one
+or args each representing a probe field should be supplied to this
+function. Before calling kprobe_event_gen_cmd_start(), the user
+should create and initialize a dynevent_cmd object using
+kprobe_event_cmd_init().
+
+For example, to create a new "schedtest" kprobe event with two fields::
+
+ struct dynevent_cmd cmd;
+ char *buf;
+
+ /* Create a buffer to hold the generated command */
+ buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
+
+ /* Before generating the command, initialize the cmd object */
+ kprobe_event_cmd_init(&cmd, buf, MAX_DYNEVENT_CMD_LEN);
+
+ /*
+ * Define the gen_kprobe_test event with the first 2 kprobe
+ * fields.
+ */
+ ret = kprobe_event_gen_cmd_start(&cmd, "gen_kprobe_test", "do_sys_open",
+ "dfd=%ax", "filename=%dx");
+
+Once the kprobe event object has been created, it can then be
+populated with more fields. Fields can be added using
+kprobe_event_add_fields(), supplying the dynevent_cmd object along
+with a variable arg list of probe fields. For example, to add a
+couple additional fields, the following call could be made::
+
+ ret = kprobe_event_add_fields(&cmd, "flags=%cx", "mode=+4($stack)");
+
+Once all the fields have been added, the event should be finalized and
+registered by calling the kprobe_event_gen_cmd_end() or
+kretprobe_event_gen_cmd_end() functions, depending on whether a kprobe
+or kretprobe command was started::
+
+ ret = kprobe_event_gen_cmd_end(&cmd);
+
+or::
+
+ ret = kretprobe_event_gen_cmd_end(&cmd);
+
+At this point, the event object is ready to be used for tracing new
+events.
+
+Similarly, a kretprobe event can be created using
+kretprobe_event_gen_cmd_start() with a probe name and location and
+additional params such as $retval::
+
+ ret = kretprobe_event_gen_cmd_start(&cmd, "gen_kretprobe_test",
+ "do_sys_open", "$retval");
+
+Similar to the synthetic event case, code like the following can be
+used to enable the newly created kprobe event::
+
+ gen_kprobe_test = trace_get_event_file(NULL, "kprobes", "gen_kprobe_test");
+
+ ret = trace_array_set_clr_event(gen_kprobe_test->tr,
+ "kprobes", "gen_kprobe_test", true);
+
+Finally, also similar to synthetic events, the following code can be
+used to give the kprobe event file back and delete the event::
+
+ trace_put_event_file(gen_kprobe_test);
+
+ ret = kprobe_event_delete("gen_kprobe_test");
+
+7.4 The "dynevent_cmd" low-level API
+------------------------------------
+
+Both the in-kernel synthetic event and kprobe interfaces are built on
+top of a lower-level "dynevent_cmd" interface. This interface is
+meant to provide the basis for higher-level interfaces such as the
+synthetic and kprobe interfaces, which can be used as examples.
+
+The basic idea is simple and amounts to providing a general-purpose
+layer that can be used to generate trace event commands. The
+generated command strings can then be passed to the command-parsing
+and event creation code that already exists in the trace event
+subsystem for creating the corresponding trace events.
+
+In a nutshell, the way it works is that the higher-level interface
+code creates a struct dynevent_cmd object, then uses a couple
+functions, dynevent_arg_add() and dynevent_arg_pair_add() to build up
+a command string, which finally causes the command to be executed
+using the dynevent_create() function. The details of the interface
+are described below.
+
+The first step in building a new command string is to create and
+initialize an instance of a dynevent_cmd. Here, for instance, we
+create a dynevent_cmd on the stack and initialize it::
+
+ struct dynevent_cmd cmd;
+ char *buf;
+ int ret;
+
+ buf = kzalloc(MAX_DYNEVENT_CMD_LEN, GFP_KERNEL);
+
+ dynevent_cmd_init(cmd, buf, maxlen, DYNEVENT_TYPE_FOO,
+ foo_event_run_command);
+
+The dynevent_cmd initialization needs to be given a user-specified
+buffer and the length of the buffer (MAX_DYNEVENT_CMD_LEN can be used
+for this purpose - at 2k it's generally too big to be comfortably put
+on the stack, so is dynamically allocated), a dynevent type id, which
+is meant to be used to check that further API calls are for the
+correct command type, and a pointer to an event-specific run_command()
+callback that will be called to actually execute the event-specific
+command function.
+
+Once that's done, the command string can by built up by successive
+calls to argument-adding functions.
+
+To add a single argument, define and initialize a struct dynevent_arg
+or struct dynevent_arg_pair object. Here's an example of the simplest
+possible arg addition, which is simply to append the given string as
+a whitespace-separated argument to the command::
+
+ struct dynevent_arg arg;
+
+ dynevent_arg_init(&arg, NULL, 0);
+
+ arg.str = name;
+
+ ret = dynevent_arg_add(cmd, &arg);
+
+The arg object is first initialized using dynevent_arg_init() and in
+this case the parameters are NULL or 0, which means there's no
+optional sanity-checking function or separator appended to the end of
+the arg.
+
+Here's another more complicated example using an 'arg pair', which is
+used to create an argument that consists of a couple components added
+together as a unit, for example, a 'type field_name;' arg or a simple
+expression arg e.g. 'flags=%cx'::
+
+ struct dynevent_arg_pair arg_pair;
+
+ dynevent_arg_pair_init(&arg_pair, dynevent_foo_check_arg_fn, 0, ';');
+
+ arg_pair.lhs = type;
+ arg_pair.rhs = name;
+
+ ret = dynevent_arg_pair_add(cmd, &arg_pair);
+
+Again, the arg_pair is first initialized, in this case with a callback
+function used to check the sanity of the args (for example, that
+neither part of the pair is NULL), along with a character to be used
+to add an operator between the pair (here none) and a separator to be
+appended onto the end of the arg pair (here ';').
+
+There's also a dynevent_str_add() function that can be used to simply
+add a string as-is, with no spaces, delimiters, or arg check.
+
+Any number of dynevent_*_add() calls can be made to build up the string
+(until its length surpasses cmd->maxlen). When all the arguments have
+been added and the command string is complete, the only thing left to
+do is run the command, which happens by simply calling
+dynevent_create()::
+
+ ret = dynevent_create(&cmd);
+
+At that point, if the return value is 0, the dynamic event has been
+created and is ready to use.
+
+See the dynevent_cmd function definitions themselves for the details
+of the API.
diff --git a/Documentation/trace/fprobe.rst b/Documentation/trace/fprobe.rst
new file mode 100644
index 000000000000..196f52386aaa
--- /dev/null
+++ b/Documentation/trace/fprobe.rst
@@ -0,0 +1,186 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+Fprobe - Function entry/exit probe
+==================================
+
+.. Author: Masami Hiramatsu <mhiramat@kernel.org>
+
+Introduction
+============
+
+Fprobe is a function entry/exit probe mechanism based on ftrace.
+Instead of using ftrace full feature, if you only want to attach callbacks
+on function entry and exit, similar to the kprobes and kretprobes, you can
+use fprobe. Compared with kprobes and kretprobes, fprobe gives faster
+instrumentation for multiple functions with single handler. This document
+describes how to use fprobe.
+
+The usage of fprobe
+===================
+
+The fprobe is a wrapper of ftrace (+ kretprobe-like return callback) to
+attach callbacks to multiple function entry and exit. User needs to set up
+the `struct fprobe` and pass it to `register_fprobe()`.
+
+Typically, `fprobe` data structure is initialized with the `entry_handler`
+and/or `exit_handler` as below.
+
+.. code-block:: c
+
+ struct fprobe fp = {
+ .entry_handler = my_entry_callback,
+ .exit_handler = my_exit_callback,
+ };
+
+To enable the fprobe, call one of register_fprobe(), register_fprobe_ips(), and
+register_fprobe_syms(). These functions register the fprobe with different types
+of parameters.
+
+The register_fprobe() enables a fprobe by function-name filters.
+E.g. this enables @fp on "func*()" function except "func2()".::
+
+ register_fprobe(&fp, "func*", "func2");
+
+The register_fprobe_ips() enables a fprobe by ftrace-location addresses.
+E.g.
+
+.. code-block:: c
+
+ unsigned long ips[] = { 0x.... };
+
+ register_fprobe_ips(&fp, ips, ARRAY_SIZE(ips));
+
+And the register_fprobe_syms() enables a fprobe by symbol names.
+E.g.
+
+.. code-block:: c
+
+ char syms[] = {"func1", "func2", "func3"};
+
+ register_fprobe_syms(&fp, syms, ARRAY_SIZE(syms));
+
+To disable (remove from functions) this fprobe, call::
+
+ unregister_fprobe(&fp);
+
+You can temporally (soft) disable the fprobe by::
+
+ disable_fprobe(&fp);
+
+and resume by::
+
+ enable_fprobe(&fp);
+
+The above is defined by including the header::
+
+ #include <linux/fprobe.h>
+
+Same as ftrace, the registered callbacks will start being called some time
+after the register_fprobe() is called and before it returns. See
+:file:`Documentation/trace/ftrace.rst`.
+
+Also, the unregister_fprobe() will guarantee that the both enter and exit
+handlers are no longer being called by functions after unregister_fprobe()
+returns as same as unregister_ftrace_function().
+
+The fprobe entry/exit handler
+=============================
+
+The prototype of the entry/exit callback function are as follows:
+
+.. code-block:: c
+
+ int entry_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
+
+ void exit_callback(struct fprobe *fp, unsigned long entry_ip, unsigned long ret_ip, struct pt_regs *regs, void *entry_data);
+
+Note that the @entry_ip is saved at function entry and passed to exit handler.
+If the entry callback function returns !0, the corresponding exit callback will be cancelled.
+
+@fp
+ This is the address of `fprobe` data structure related to this handler.
+ You can embed the `fprobe` to your data structure and get it by
+ container_of() macro from @fp. The @fp must not be NULL.
+
+@entry_ip
+ This is the ftrace address of the traced function (both entry and exit).
+ Note that this may not be the actual entry address of the function but
+ the address where the ftrace is instrumented.
+
+@ret_ip
+ This is the return address that the traced function will return to,
+ somewhere in the caller. This can be used at both entry and exit.
+
+@regs
+ This is the `pt_regs` data structure at the entry and exit. Note that
+ the instruction pointer of @regs may be different from the @entry_ip
+ in the entry_handler. If you need traced instruction pointer, you need
+ to use @entry_ip. On the other hand, in the exit_handler, the instruction
+ pointer of @regs is set to the current return address.
+
+@entry_data
+ This is a local storage to share the data between entry and exit handlers.
+ This storage is NULL by default. If the user specify `exit_handler` field
+ and `entry_data_size` field when registering the fprobe, the storage is
+ allocated and passed to both `entry_handler` and `exit_handler`.
+
+Share the callbacks with kprobes
+================================
+
+Since the recursion safeness of the fprobe (and ftrace) is a bit different
+from the kprobes, this may cause an issue if user wants to run the same
+code from the fprobe and the kprobes.
+
+Kprobes has per-cpu 'current_kprobe' variable which protects the kprobe
+handler from recursion in all cases. On the other hand, fprobe uses
+only ftrace_test_recursion_trylock(). This allows interrupt context to
+call another (or same) fprobe while the fprobe user handler is running.
+
+This is not a matter if the common callback code has its own recursion
+detection, or it can handle the recursion in the different contexts
+(normal/interrupt/NMI.)
+But if it relies on the 'current_kprobe' recursion lock, it has to check
+kprobe_running() and use kprobe_busy_*() APIs.
+
+Fprobe has FPROBE_FL_KPROBE_SHARED flag to do this. If your common callback
+code will be shared with kprobes, please set FPROBE_FL_KPROBE_SHARED
+*before* registering the fprobe, like:
+
+.. code-block:: c
+
+ fprobe.flags = FPROBE_FL_KPROBE_SHARED;
+
+ register_fprobe(&fprobe, "func*", NULL);
+
+This will protect your common callback from the nested call.
+
+The missed counter
+==================
+
+The `fprobe` data structure has `fprobe::nmissed` counter field as same as
+kprobes.
+This counter counts up when;
+
+ - fprobe fails to take ftrace_recursion lock. This usually means that a function
+ which is traced by other ftrace users is called from the entry_handler.
+
+ - fprobe fails to setup the function exit because of the shortage of rethook
+ (the shadow stack for hooking the function return.)
+
+The `fprobe::nmissed` field counts up in both cases. Therefore, the former
+skips both of entry and exit callback and the latter skips the exit
+callback, but in both case the counter will increase by 1.
+
+Note that if you set the FTRACE_OPS_FL_RECURSION and/or FTRACE_OPS_FL_RCU to
+`fprobe::ops::flags` (ftrace_ops::flags) when registering the fprobe, this
+counter may not work correctly, because ftrace skips the fprobe function which
+increase the counter.
+
+
+Functions and structures
+========================
+
+.. kernel-doc:: include/linux/fprobe.h
+.. kernel-doc:: kernel/trace/fprobe.c
+
diff --git a/Documentation/trace/fprobetrace.rst b/Documentation/trace/fprobetrace.rst
new file mode 100644
index 000000000000..0f187e3796e4
--- /dev/null
+++ b/Documentation/trace/fprobetrace.rst
@@ -0,0 +1,251 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Fprobe-based Event Tracing
+==========================
+
+.. Author: Masami Hiramatsu <mhiramat@kernel.org>
+
+Overview
+--------
+
+Fprobe event is similar to the kprobe event, but limited to probe on
+the function entry and exit only. It is good enough for many use cases
+which only traces some specific functions.
+
+This document also covers tracepoint probe events (tprobe) since this
+is also works only on the tracepoint entry. User can trace a part of
+tracepoint argument, or the tracepoint without trace-event, which is
+not exposed on tracefs.
+
+As same as other dynamic events, fprobe events and tracepoint probe
+events are defined via `dynamic_events` interface file on tracefs.
+
+Synopsis of fprobe-events
+-------------------------
+::
+
+ f[:[GRP1/][EVENT1]] SYM [FETCHARGS] : Probe on function entry
+ f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS] : Probe on function exit
+ t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS] : Probe on tracepoint
+
+ GRP1 : Group name for fprobe. If omitted, use "fprobes" for it.
+ GRP2 : Group name for tprobe. If omitted, use "tracepoints" for it.
+ EVENT1 : Event name for fprobe. If omitted, the event name is
+ "SYM__entry" or "SYM__exit".
+ EVENT2 : Event name for tprobe. If omitted, the event name is
+ the same as "TRACEPOINT", but if the "TRACEPOINT" starts
+ with a digit character, "_TRACEPOINT" is used.
+ MAXACTIVE : Maximum number of instances of the specified function that
+ can be probed simultaneously, or 0 for the default value
+ as defined in Documentation/trace/fprobe.rst
+
+ FETCHARGS : Arguments. Each probe can have up to 128 args.
+ ARG : Fetch "ARG" function argument using BTF (only for function
+ entry or tracepoint.) (\*1)
+ @ADDR : Fetch memory at ADDR (ADDR should be in kernel)
+ @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
+ $stackN : Fetch Nth entry of stack (N >= 0)
+ $stack : Fetch stack address.
+ $argN : Fetch the Nth function argument. (N >= 1) (\*2)
+ $retval : Fetch return value.(\*3)
+ $comm : Fetch current task comm.
+ +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5)
+ \IMM : Store an immediate value to the argument.
+ NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
+ FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
+ (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
+ (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
+ and bitfield are supported.
+
+ (\*1) This is available only when BTF is enabled.
+ (\*2) only for the probe on function entry (offs == 0). Note, this argument access
+ is best effort, because depending on the argument type, it may be passed on
+ the stack. But this only support the arguments via registers.
+ (\*3) only for return probe. Note that this is also best effort. Depending on the
+ return value type, it might be passed via a pair of registers. But this only
+ accesses one register.
+ (\*4) this is useful for fetching a field of data structures.
+ (\*5) "u" means user-space dereference.
+
+For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`.
+
+Function arguments at exit
+--------------------------
+Function arguments can be accessed at exit probe using $arg<N> fetcharg. This
+is useful to record the function parameter and return value at once, and
+trace the difference of structure fields (for debuging a function whether it
+correctly updates the given data structure or not)
+See the :ref:`sample<fprobetrace_exit_args_sample>` below for how it works.
+
+BTF arguments
+-------------
+BTF (BPF Type Format) argument allows user to trace function and tracepoint
+parameters by its name instead of ``$argN``. This feature is available if the
+kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF.
+If user only specify the BTF argument, the event's argument name is also
+automatically set by the given name. ::
+
+ # echo 'f:myprobe vfs_read count pos' >> dynamic_events
+ # cat dynamic_events
+ f:fprobes/myprobe vfs_read count=count pos=pos
+
+It also chooses the fetch type from BTF information. For example, in the above
+example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus,
+both are converted to 64bit unsigned long, but only ``pos`` has "%Lx"
+print-format as below ::
+
+ # cat events/fprobes/myprobe/format
+ name: myprobe
+ ID: 1313
+ format:
+ field:unsigned short common_type; offset:0; size:2; signed:0;
+ field:unsigned char common_flags; offset:2; size:1; signed:0;
+ field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
+ field:int common_pid; offset:4; size:4; signed:1;
+
+ field:unsigned long __probe_ip; offset:8; size:8; signed:0;
+ field:u64 count; offset:16; size:8; signed:0;
+ field:u64 pos; offset:24; size:8; signed:0;
+
+ print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos
+
+If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*``
+is expanded to all function arguments of the function or the tracepoint. ::
+
+ # echo 'f:myprobe vfs_read $arg*' >> dynamic_events
+ # cat dynamic_events
+ f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos
+
+BTF also affects the ``$retval``. If user doesn't set any type, the retval
+type is automatically picked from the BTF. If the function returns ``void``,
+``$retval`` is rejected.
+
+You can access the data fields of a data structure using allow operator ``->``
+(for pointer type) and dot operator ``.`` (for data structure type.)::
+
+# echo 't sched_switch preempt prev_pid=prev->pid next_pid=next->pid' >> dynamic_events
+
+The field access operators, ``->`` and ``.`` can be combined for accessing deeper
+members and other structure members pointed by the member. e.g. ``foo->bar.baz->qux``
+If there is non-name union member, you can directly access it as the C code does.
+For example::
+
+ struct {
+ union {
+ int a;
+ int b;
+ };
+ } *foo;
+
+To access ``a`` and ``b``, use ``foo->a`` and ``foo->b`` in this case.
+
+This data field access is available for the return value via ``$retval``,
+e.g. ``$retval->name``.
+
+For these BTF arguments and fields, ``:string`` and ``:ustring`` change the
+behavior. If these are used for BTF argument or field, it checks whether
+the BTF type of the argument or the data field is ``char *`` or ``char []``,
+or not. If not, it rejects applying the string types. Also, with the BTF
+support, you don't need a memory dereference operator (``+0(PTR)``) for
+accessing the string pointed by a ``PTR``. It automatically adds the memory
+dereference operator according to the BTF type. e.g. ::
+
+# echo 't sched_switch prev->comm:string' >> dynamic_events
+# echo 'f getname_flags%return $retval->name:string' >> dynamic_events
+
+The ``prev->comm`` is an embedded char array in the data structure, and
+``$retval->name`` is a char pointer in the data structure. But in both
+cases, you can use ``:string`` type to get the string.
+
+
+Usage examples
+--------------
+Here is an example to add fprobe events on ``vfs_read()`` function entry
+and exit, with BTF arguments.
+::
+
+ # echo 'f vfs_read $arg*' >> dynamic_events
+ # echo 'f vfs_read%return $retval' >> dynamic_events
+ # cat dynamic_events
+ f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos
+ f:fprobes/vfs_read__exit vfs_read%return arg1=$retval
+ # echo 1 > events/fprobes/enable
+ # head -n 20 trace | tail
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ sh-70 [000] ...1. 335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
+ sh-70 [000] ..... 335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
+ sh-70 [000] ...1. 335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
+ sh-70 [000] ..... 335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
+ sh-70 [000] ...1. 335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08
+ sh-70 [000] ..... 335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
+ sh-70 [000] ...1. 336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08
+ sh-70 [000] ..... 336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1
+
+You can see all function arguments and return values are recorded as signed int.
+
+Also, here is an example of tracepoint events on ``sched_switch`` tracepoint.
+To compare the result, this also enables the ``sched_switch`` traceevent too.
+::
+
+ # echo 't sched_switch $arg*' >> dynamic_events
+ # echo 1 > events/sched/sched_switch/enable
+ # echo 1 > events/tracepoints/sched_switch/enable
+ # echo > trace
+ # head -n 20 trace | tail
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ sh-70 [000] d..2. 3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120
+ sh-70 [000] d..3. 3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1
+ <idle>-0 [000] d..2. 3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
+ <idle>-0 [000] d..3. 3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
+ rcu_preempt-16 [000] d..2. 3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120
+ rcu_preempt-16 [000] d..3. 3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026
+ <idle>-0 [000] d..2. 3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120
+ <idle>-0 [000] d..3. 3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0
+
+As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on
+the other hand, the ``sched_switch`` tracepoint probe event shows *raw*
+parameters. This means you can access any field values in the task
+structure pointed by the ``prev`` and ``next`` arguments.
+
+For example, usually ``task_struct::start_time`` is not traced, but with this
+traceprobe event, you can trace that field as below.
+::
+
+ # echo 't sched_switch comm=next->comm:string next->start_time' > dynamic_events
+ # head -n 20 trace | tail
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ sh-70 [000] d..3. 5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
+ rcu_preempt-16 [000] d..3. 5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526
+ sh-70 [000] d..3. 5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
+ <idle>-0 [000] d..3. 5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000
+ rcu_preempt-16 [000] d..3. 5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
+ <idle>-0 [000] d..3. 5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
+ kworker/0:1-14 [000] d..3. 5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0
+ <idle>-0 [000] d..3. 5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000
+
+.. _fprobetrace_exit_args_sample:
+
+The return probe allows us to access the results of some functions, which returns
+the error code and its results are passed via function parameter, such as an
+structure-initialization function.
+
+For example, vfs_open() will link the file structure to the inode and update
+mode. You can trace that changes with return probe.
+::
+
+ # echo 'f vfs_open mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
+ # echo 'f vfs_open%%return mode=file->f_mode:x32 inode=file->f_inode:x64' >> dynamic_events
+ # echo 1 > events/fprobes/enable
+ # cat trace
+ sh-131 [006] ...1. 1945.714346: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x2 inode=0x0
+ sh-131 [006] ...1. 1945.714358: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4d801e inode=0xffff888008470168
+ cat-143 [007] ...1. 1945.717949: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
+ cat-143 [007] ...1. 1945.717956: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0x4a801d inode=0xffff888005f78d28
+ cat-143 [007] ...1. 1945.720616: vfs_open__entry: (vfs_open+0x4/0x40) mode=0x1 inode=0x0
+ cat-143 [007] ...1. 1945.728263: vfs_open__exit: (do_open+0x274/0x3d0 <- vfs_open) mode=0xa800d inode=0xffff888004ada8d8
+
+You can see the `file::f_mode` and `file::f_inode` are upated in `vfs_open()`.
diff --git a/Documentation/trace/ftrace-design.rst b/Documentation/trace/ftrace-design.rst
index a8e22e0db63c..6893399157f0 100644
--- a/Documentation/trace/ftrace-design.rst
+++ b/Documentation/trace/ftrace-design.rst
@@ -229,14 +229,6 @@ Adding support for it is easy: just define the macro in asm/ftrace.h and
pass the return address pointer as the 'retp' argument to
ftrace_push_return_trace().
-HAVE_FTRACE_NMI_ENTER
----------------------
-
-If you can't trace NMI functions, then skip this option.
-
-<details to be filled>
-
-
HAVE_SYSCALL_TRACEPOINTS
------------------------
diff --git a/Documentation/trace/ftrace-uses.rst b/Documentation/trace/ftrace-uses.rst
index 2a05e770618a..e198854ace79 100644
--- a/Documentation/trace/ftrace-uses.rst
+++ b/Documentation/trace/ftrace-uses.rst
@@ -30,8 +30,8 @@ The ftrace context
This requires extra care to what can be done inside a callback. A callback
can be called outside the protective scope of RCU.
-The ftrace infrastructure has some protections against recursions and RCU
-but one must still be very careful how they use the callbacks.
+There are helper functions to help against recursion, and making sure
+RCU is watching. These are explained below.
The ftrace_ops structure
@@ -55,17 +55,17 @@ an ftrace_ops with ftrace:
Both .flags and .private are optional. Only .func is required.
-To enable tracing call:
+To enable tracing call::
-.. c:function:: register_ftrace_function(&ops);
+ register_ftrace_function(&ops);
-To disable tracing call:
+To disable tracing call::
-.. c:function:: unregister_ftrace_function(&ops);
+ unregister_ftrace_function(&ops);
-The above is defined by including the header:
+The above is defined by including the header::
-.. c:function:: #include <linux/ftrace.h>
+ #include <linux/ftrace.h>
The registered callback will start being called some time after the
register_ftrace_function() is called and before it returns. The exact time
@@ -108,6 +108,58 @@ The prototype of the callback function is as follows (as of v4.14):
at the start of the function where ftrace was tracing. Otherwise it
either contains garbage, or NULL.
+Protect your callback
+=====================
+
+As functions can be called from anywhere, and it is possible that a function
+called by a callback may also be traced, and call that same callback,
+recursion protection must be used. There are two helper functions that
+can help in this regard. If you start your code with:
+
+.. code-block:: c
+
+ int bit;
+
+ bit = ftrace_test_recursion_trylock(ip, parent_ip);
+ if (bit < 0)
+ return;
+
+and end it with:
+
+.. code-block:: c
+
+ ftrace_test_recursion_unlock(bit);
+
+The code in between will be safe to use, even if it ends up calling a
+function that the callback is tracing. Note, on success,
+ftrace_test_recursion_trylock() will disable preemption, and the
+ftrace_test_recursion_unlock() will enable it again (if it was previously
+enabled). The instruction pointer (ip) and its parent (parent_ip) is passed to
+ftrace_test_recursion_trylock() to record where the recursion happened
+(if CONFIG_FTRACE_RECORD_RECURSION is set).
+
+Alternatively, if the FTRACE_OPS_FL_RECURSION flag is set on the ftrace_ops
+(as explained below), then a helper trampoline will be used to test
+for recursion for the callback and no recursion test needs to be done.
+But this is at the expense of a slightly more overhead from an extra
+function call.
+
+If your callback accesses any data or critical section that requires RCU
+protection, it is best to make sure that RCU is "watching", otherwise
+that data or critical section will not be protected as expected. In this
+case add:
+
+.. code-block:: c
+
+ if (!rcu_is_watching())
+ return;
+
+Alternatively, if the FTRACE_OPS_FL_RCU flag is set on the ftrace_ops
+(as explained below), then a helper trampoline will be used to test
+for rcu_is_watching for the callback and no other test needs to be done.
+But this is at the expense of a slightly more overhead from an extra
+function call.
+
The ftrace FLAGS
================
@@ -128,26 +180,20 @@ FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED
will not fail with this flag set. But the callback must check if
regs is NULL or not to determine if the architecture supports it.
-FTRACE_OPS_FL_RECURSION_SAFE
- By default, a wrapper is added around the callback to
- make sure that recursion of the function does not occur. That is,
- if a function that is called as a result of the callback's execution
- is also traced, ftrace will prevent the callback from being called
- again. But this wrapper adds some overhead, and if the callback is
- safe from recursion, it can set this flag to disable the ftrace
- protection.
-
- Note, if this flag is set, and recursion does occur, it could cause
- the system to crash, and possibly reboot via a triple fault.
-
- It is OK if another callback traces a function that is called by a
- callback that is marked recursion safe. Recursion safe callbacks
- must never trace any function that are called by the callback
- itself or any nested functions that those functions call.
-
- If this flag is set, it is possible that the callback will also
- be called with preemption enabled (when CONFIG_PREEMPTION is set),
- but this is not guaranteed.
+FTRACE_OPS_FL_RECURSION
+ By default, it is expected that the callback can handle recursion.
+ But if the callback is not that worried about overhead, then
+ setting this bit will add the recursion protection around the
+ callback by calling a helper function that will do the recursion
+ protection and only call the callback if it did not recurse.
+
+ Note, if this flag is not set, and recursion does occur, it could
+ cause the system to crash, and possibly reboot via a triple fault.
+
+ Note, if this flag is set, then the callback will always be called
+ with preemption disabled. If it is not set, then it is possible
+ (but not guaranteed) that the callback will be called in
+ preemptable context.
FTRACE_OPS_FL_IPMODIFY
Requires FTRACE_OPS_FL_SAVE_REGS set. If the callback is to "hijack"
diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index d2b5657ed33e..7e7b8ec17934 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -34,13 +34,13 @@ Throughout the kernel is hundreds of static event points that
can be enabled via the tracefs file system to see what is
going on in certain parts of the kernel.
-See events.txt for more information.
+See events.rst for more information.
Implementation Details
----------------------
-See :doc:`ftrace-design` for details for arch porters and such.
+See Documentation/trace/ftrace-design.rst for details for arch porters and such.
The File System
@@ -95,7 +95,8 @@ of ftrace. Here is a list of some of the key files:
current_tracer:
This is used to set or display the current tracer
- that is configured.
+ that is configured. Changing the current tracer clears
+ the ring buffer content as well as the "snapshot" buffer.
available_tracers:
@@ -124,9 +125,13 @@ of ftrace. Here is a list of some of the key files:
trace:
This file holds the output of the trace in a human
- readable format (described below). Note, tracing is temporarily
- disabled when the file is open for reading. Once all readers
- are closed, tracing is re-enabled.
+ readable format (described below). Opening this file for
+ writing with the O_TRUNC flag clears the ring buffer content.
+ Note, this file is not a consumer. If tracing is off
+ (no tracer running, or tracing_on is zero), it will produce
+ the same output each time it is read. When tracing is on,
+ it may produce inconsistent results as it tries to read
+ the entire buffer without consuming it.
trace_pipe:
@@ -140,9 +145,7 @@ of ftrace. Here is a list of some of the key files:
will not be read again with a sequential read. The
"trace" file is static, and if the tracer is not
adding more data, it will display the same
- information every time it is read. Unlike the
- "trace" file, opening this file for reading will not
- temporarily disable tracing.
+ information every time it is read.
trace_options:
@@ -177,6 +180,21 @@ of ftrace. Here is a list of some of the key files:
Only active when the file contains a number greater than 0.
(in microseconds)
+ buffer_percent:
+
+ This is the watermark for how much the ring buffer needs to be filled
+ before a waiter is woken up. That is, if an application calls a
+ blocking read syscall on one of the per_cpu trace_pipe_raw files, it
+ will block until the given amount of data specified by buffer_percent
+ is in the ring buffer before it wakes the reader up. This also
+ controls how the splice system calls are blocked on this file::
+
+ 0 - means to wake up as soon as there is any data in the ring buffer.
+ 50 - means to wake up when roughly half of the ring buffer sub-buffers
+ are full.
+ 100 - means to block until the ring buffer is totally full and is
+ about to start overwriting the older data.
+
buffer_size_kb:
This sets or displays the number of kilobytes each CPU
@@ -185,7 +203,8 @@ of ftrace. Here is a list of some of the key files:
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
that the kernel uses for allocation, usually 4 KB in size).
- If the last page allocated has room for more bytes
+ A few extra pages may be allocated to accommodate buffer management
+ meta-data. If the last page allocated has room for more bytes
than requested, the rest of the page will be used,
making the actual allocation bigger than requested or shown.
( Note, the size may not be a multiple of the page size
@@ -199,6 +218,27 @@ of ftrace. Here is a list of some of the key files:
This displays the total combined size of all the trace buffers.
+ buffer_subbuf_size_kb:
+
+ This sets or displays the sub buffer size. The ring buffer is broken up
+ into several same size "sub buffers". An event can not be bigger than
+ the size of the sub buffer. Normally, the sub buffer is the size of the
+ architecture's page (4K on x86). The sub buffer also contains meta data
+ at the start which also limits the size of an event. That means when
+ the sub buffer is a page size, no event can be larger than the page
+ size minus the sub buffer meta data.
+
+ Note, the buffer_subbuf_size_kb is a way for the user to specify the
+ minimum size of the subbuffer. The kernel may make it bigger due to the
+ implementation details, or simply fail the operation if the kernel can
+ not handle the request.
+
+ Changing the sub buffer size allows for events to be larger than the
+ page size.
+
+ Note: When changing the sub-buffer size, tracing is stopped and any
+ data in the ring buffer and the snapshot buffer will be discarded.
+
free_buffer:
If a process is performing tracing, and the ring buffer should be
@@ -235,7 +275,7 @@ of ftrace. Here is a list of some of the key files:
This interface also allows for commands to be used. See the
"Filter commands" section for more details.
- As a speed up, since processing strings can't be quite expensive
+ As a speed up, since processing strings can be quite expensive
and requires a check of all functions registered to tracing, instead
an index can be written into this file. A number (starting with "1")
written will instead select the same corresponding at the line position
@@ -259,6 +299,20 @@ of ftrace. Here is a list of some of the key files:
traced by the function tracer as well. This option will also
cause PIDs of tasks that exit to be removed from the file.
+ set_ftrace_notrace_pid:
+
+ Have the function tracer ignore threads whose PID are listed in
+ this file.
+
+ If the "function-fork" option is set, then when a task whose
+ PID is listed in this file forks, the child's PID will
+ automatically be added to this file, and the child will not be
+ traced by the function tracer as well. This option will also
+ cause PIDs of tasks that exit to be removed from the file.
+
+ If a PID is in both this file and "set_ftrace_pid", then this
+ file takes precedence, and the thread will not be traced.
+
set_event_pid:
Have the events only trace a task with a PID listed in this file.
@@ -270,6 +324,19 @@ of ftrace. Here is a list of some of the key files:
cause the PIDs of tasks to be removed from this file when the task
exits.
+ set_event_notrace_pid:
+
+ Have the events not trace a task with a PID listed in this file.
+ Note, sched_switch and sched_wakeup will trace threads not listed
+ in this file, even if a thread's PID is in the file if the
+ sched_switch or sched_wakeup events also trace a thread that should
+ be traced.
+
+ To have the PIDs of children of tasks with their PID in this file
+ added on fork, enable the "event-fork" option. That option will also
+ cause the PIDs of tasks to be removed from this file when the task
+ exits.
+
set_graph_function:
Functions listed in this file will cause the function graph
@@ -293,6 +360,12 @@ of ftrace. Here is a list of some of the key files:
"set_graph_function", or "set_graph_notrace".
(See the section "dynamic ftrace" below for more details.)
+ available_filter_functions_addrs:
+
+ Similar to available_filter_functions, but with address displayed
+ for each function. The displayed address is the patch-site address
+ and can differ from /proc/kallsyms address.
+
dyn_ftrace_total_info:
This file is for debugging purposes. The number of functions that
@@ -319,15 +392,40 @@ of ftrace. Here is a list of some of the key files:
an 'I' will be displayed on the same line as the function that
can be overridden.
+ If a non ftrace trampoline is attached (BPF) a 'D' will be displayed.
+ Note, normal ftrace trampolines can also be attached, but only one
+ "direct" trampoline can be attached to a given function at a time.
+
+ Some architectures can not call direct trampolines, but instead have
+ the ftrace ops function located above the function entry point. In
+ such cases an 'O' will be displayed.
+
+ If a function had either the "ip modify" or a "direct" call attached to
+ it in the past, a 'M' will be shown. This flag is never cleared. It is
+ used to know if a function was every modified by the ftrace infrastructure,
+ and can be used for debugging.
+
If the architecture supports it, it will also show what callback
is being directly called by the function. If the count is greater
than 1 it most likely will be ftrace_ops_list_func().
- If the callback of the function jumps to a trampoline that is
- specific to a the callback and not the standard trampoline,
+ If the callback of a function jumps to a trampoline that is
+ specific to the callback and which is not the standard trampoline,
its address will be printed as well as the function that the
trampoline calls.
+ touched_functions:
+
+ This file contains all the functions that ever had a function callback
+ to it via the ftrace infrastructure. It has the same format as
+ enabled_functions but shows all functions that have every been
+ traced.
+
+ To see any function that has every been modified by "ip modify" or a
+ direct trampoline, one can perform the following command:
+
+ grep ' M ' /sys/kernel/tracing/touched_functions
+
function_profile_enabled:
When set it will enable all functions with either the function
@@ -345,11 +443,11 @@ of ftrace. Here is a list of some of the key files:
kprobe_events:
- Enable dynamic trace points. See kprobetrace.txt.
+ Enable dynamic trace points. See kprobetrace.rst.
kprobe_profile:
- Dynamic trace points stats. See kprobetrace.txt.
+ Dynamic trace points stats. See kprobetrace.rst.
max_graph_depth:
@@ -382,7 +480,7 @@ of ftrace. Here is a list of some of the key files:
By default, 128 comms are saved (see "saved_cmdlines" above). To
increase or decrease the amount of comms that are cached, echo
- in a the number of comms to cache, into this file.
+ the number of comms to cache into this file.
saved_tgids:
@@ -486,10 +584,25 @@ of ftrace. Here is a list of some of the key files:
processing should be able to handle them. See comments in the
ktime_get_boot_fast_ns() function for more information.
+ tai:
+ This is the tai clock (CLOCK_TAI) and is derived from the wall-
+ clock time. However, this clock does not experience
+ discontinuities and backwards jumps caused by NTP inserting leap
+ seconds. Since the clock access is designed for use in tracing,
+ side effects are possible. The clock access may yield wrong
+ readouts in case the internal TAI offset is updated e.g., caused
+ by setting the system time or using adjtimex() with an offset.
+ These effects are rare and post processing should be able to
+ handle them. See comments in the ktime_get_tai_fast_ns()
+ function for more information.
+
To set a clock, simply echo the clock name into this file::
# echo global > trace_clock
+ Setting a clock clears the ring buffer content as well as the
+ "snapshot" buffer.
+
trace_marker:
This is a very useful file for synchronizing user space
@@ -518,7 +631,7 @@ of ftrace. Here is a list of some of the key files:
start::
- trace_fd = open("trace_marker", WR_ONLY);
+ trace_fd = open("trace_marker", O_WRONLY);
Note: Writing into the trace_marker file can also initiate triggers
that are written into /sys/kernel/tracing/events/ftrace/print/trigger
@@ -527,14 +640,14 @@ of ftrace. Here is a list of some of the key files:
trace_marker_raw:
- This is similar to trace_marker above, but is meant for for binary data
+ This is similar to trace_marker above, but is meant for binary data
to be written to it, where a tool can be used to parse the data
from trace_pipe_raw.
uprobe_events:
Add dynamic tracepoints in programs.
- See uprobetracer.txt
+ See uprobetracer.rst
uprobe_profile:
@@ -555,19 +668,19 @@ of ftrace. Here is a list of some of the key files:
files at various levels that can enable the tracepoints
when a "1" is written to them.
- See events.txt for more information.
+ See events.rst for more information.
set_event:
By echoing in the event into this file, will enable that event.
- See events.txt for more information.
+ See events.rst for more information.
available_events:
A list of events that can be enabled in tracing.
- See events.txt for more information.
+ See events.rst for more information.
timestamp_mode:
@@ -784,10 +897,10 @@ Error conditions
The extended error information and usage takes the form shown in
this example::
- # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
+ # echo xxx > /sys/kernel/tracing/events/sched/sched_wakeup/trigger
echo: write error: Invalid argument
- # cat /sys/kernel/debug/tracing/error_log
+ # cat /sys/kernel/tracing/error_log
[ 5348.887237] location: error: Couldn't yyy: zzz
Command: xxx
^
@@ -797,7 +910,7 @@ Error conditions
To clear the error log, echo the empty string into it::
- # echo > /sys/kernel/debug/tracing/error_log
+ # echo > /sys/kernel/tracing/error_log
Examples of using the tracer
----------------------------
@@ -981,6 +1094,7 @@ To see what is available, simply cat the file::
nohex
nobin
noblock
+ nofields
trace_printk
annotate
nouserstacktrace
@@ -1064,6 +1178,11 @@ Here are the available options:
block
When set, reading trace_pipe will not block when polled.
+ fields
+ Print the fields as described by their types. This is a better
+ option than using hex, bin or raw, as it gives a better parsing
+ of the content of the event.
+
trace_printk
Can disable trace_printk() from writing into the buffer.
@@ -1119,6 +1238,18 @@ Here are the available options:
the trace displays additional information about the
latency, as described in "Latency trace format".
+ pause-on-trace
+ When set, opening the trace file for read, will pause
+ writing to the ring buffer (as if tracing_on was set to zero).
+ This simulates the original behavior of the trace file.
+ When the file is closed, tracing will be enabled again.
+
+ hash-ptr
+ When set, "%p" in the event printk format displays the
+ hashed pointer value instead of real address.
+ This will be useful if you want to find out which hashed
+ value is corresponding to the real value in trace log.
+
record-cmd
When any event or tracer is enabled, a hook is enabled
in the sched_switch trace point to fill comm cache
@@ -1170,6 +1301,8 @@ Here are the available options:
tasks fork. Also, when tasks with PIDs in set_event_pid exit,
their PIDs will be removed from the file.
+ This affects PIDs listed in set_event_notrace_pid as well.
+
function-trace
The latency tracers will enable function tracing
if this option is enabled (default it is). When
@@ -1184,6 +1317,8 @@ Here are the available options:
set_ftrace_pid exit, their PIDs will be removed from the
file.
+ This affects PIDs in set_ftrace_notrace_pid as well.
+
display-graph
When set, the latency tracers (irqsoff, wakeup, etc) will
use function graph tracing instead of function tracing.
@@ -1266,6 +1401,19 @@ Options for function_graph tracer:
only a closing curly bracket "}" is displayed for
the return of a function.
+ funcgraph-retval
+ When set, the return value of each traced function
+ will be printed after an equal sign "=". By default
+ this is off.
+
+ funcgraph-retval-hex
+ When set, the return value will always be printed
+ in hexadecimal format. If the option is not set and
+ the return value is an error code, it will be printed
+ in signed decimal format; otherwise it will also be
+ printed in hexadecimal format. By default, this option
+ is off.
+
sleep-time
When running function graph tracer, to include
the time a task schedules out in its function.
@@ -1350,7 +1498,7 @@ an example::
=> x86_64_start_reservations
=> x86_64_start_kernel
-Here we see that that we had a latency of 16 microseconds (which is
+Here we see that we had a latency of 16 microseconds (which is
very good). The _raw_spin_lock_irq in run_timer_softirq disabled
interrupts. The difference between the 16 and the displayed
timestamp 25us occurred because the clock was incremented
@@ -1409,7 +1557,7 @@ function-trace, we get a much larger output::
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
- => generic_make_request
+ => submit_bio_noacct
=> submit_bio
=> submit_bh
=> __ext3_get_inode_loc
@@ -1480,7 +1628,7 @@ display-graph option::
=> remove_vma
=> exit_mmap
=> mmput
- => flush_old_exec
+ => begin_new_exec
=> load_elf_binary
=> search_binary_handler
=> __do_execve_file.isra.32
@@ -1694,7 +1842,7 @@ tracers.
=> __blk_run_queue_uncond
=> __blk_run_queue
=> blk_queue_bio
- => generic_make_request
+ => submit_bio_noacct
=> submit_bio
=> submit_bh
=> ext3_bread
@@ -2120,6 +2268,8 @@ periodically make a CPU constantly busy with interrupts disabled.
# cat trace
# tracer: hwlat
#
+ # entries-in-buffer/entries-written: 13/13 #P:8
+ #
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
@@ -2127,12 +2277,18 @@ periodically make a CPU constantly busy with interrupts disabled.
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
- <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
- <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
- <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
- <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
- <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
- <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
+ <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6
+ <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1
+ <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5
+ <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1
+ <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1
+ <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2
+ <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1
+ <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1
+ <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1
+ <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
+ <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
+ <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1
The above output is somewhat the same in the header. All events will have
@@ -2142,7 +2298,7 @@ interrupts disabled 'd'. Under the FUNCTION title there is:
This is the count of events recorded that were greater than the
tracing_threshold (See below).
- inner/outer(us): 12/14
+ inner/outer(us): 11/11
This shows two numbers as "inner latency" and "outer latency". The test
runs in a loop checking a timestamp twice. The latency detected within
@@ -2150,11 +2306,15 @@ interrupts disabled 'd'. Under the FUNCTION title there is:
after the previous timestamp and the next timestamp in the loop is
the "outer latency".
- ts:1499801089.066141940
+ ts:1581527483.343962693
+
+ The absolute timestamp that the first latency was recorded in the window.
- The absolute timestamp that the event happened.
+ count:6
- nmi-total:4 nmi-count:1
+ The number of times a latency was detected during the window.
+
+ nmi-total:7 nmi-count:1
On architectures that support it, if an NMI comes in during the
test, the time spent in NMI is reported in "nmi-total" (in
@@ -2380,11 +2540,10 @@ Or this simple script!
#!/bin/bash
tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
- echo nop > $tracefs/tracing/current_tracer
- echo 0 > $tracefs/tracing/tracing_on
- echo $$ > $tracefs/tracing/set_ftrace_pid
- echo function > $tracefs/tracing/current_tracer
- echo 1 > $tracefs/tracing/tracing_on
+ echo 0 > $tracefs/tracing_on
+ echo $$ > $tracefs/set_ftrace_pid
+ echo function > $tracefs/current_tracer
+ echo 1 > $tracefs/tracing_on
exec "$@"
@@ -2451,7 +2610,7 @@ want, depending on your needs.
- The cpu number on which the function executed is default
enabled. It is sometimes better to only trace one cpu (see
- tracing_cpu_mask file) or you might sometimes see unordered
+ tracing_cpumask file) or you might sometimes see unordered
function calls while cpu tracing switch.
- hide: echo nofuncgraph-cpu > trace_options
@@ -2600,6 +2759,119 @@ It is default disabled.
0) 1.757 us | } /* kmem_cache_free() */
0) 2.861 us | } /* putname() */
+The return value of each traced function can be displayed after
+an equal sign "=". When encountering system call failures, it
+can be very helpful to quickly locate the function that first
+returns an error code.
+
+ - hide: echo nofuncgraph-retval > trace_options
+ - show: echo funcgraph-retval > trace_options
+
+ Example with funcgraph-retval::
+
+ 1) | cgroup_migrate() {
+ 1) 0.651 us | cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
+ 1) | cgroup_migrate_execute() {
+ 1) | cpu_cgroup_can_attach() {
+ 1) | cgroup_taskset_first() {
+ 1) 0.732 us | cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
+ 1) 1.232 us | } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
+ 1) 0.380 us | sched_rt_can_attach(); /* = 0x0 */
+ 1) 2.335 us | } /* cpu_cgroup_can_attach = -22 */
+ 1) 4.369 us | } /* cgroup_migrate_execute = -22 */
+ 1) 7.143 us | } /* cgroup_migrate = -22 */
+
+The above example shows that the function cpu_cgroup_can_attach
+returned the error code -22 firstly, then we can read the code
+of this function to get the root cause.
+
+When the option funcgraph-retval-hex is not set, the return value can
+be displayed in a smart way. Specifically, if it is an error code,
+it will be printed in signed decimal format, otherwise it will
+printed in hexadecimal format.
+
+ - smart: echo nofuncgraph-retval-hex > trace_options
+ - hexadecimal: echo funcgraph-retval-hex > trace_options
+
+ Example with funcgraph-retval-hex::
+
+ 1) | cgroup_migrate() {
+ 1) 0.651 us | cgroup_migrate_add_task(); /* = 0xffff93fcfd346c00 */
+ 1) | cgroup_migrate_execute() {
+ 1) | cpu_cgroup_can_attach() {
+ 1) | cgroup_taskset_first() {
+ 1) 0.732 us | cgroup_taskset_next(); /* = 0xffff93fc8fb20000 */
+ 1) 1.232 us | } /* cgroup_taskset_first = 0xffff93fc8fb20000 */
+ 1) 0.380 us | sched_rt_can_attach(); /* = 0x0 */
+ 1) 2.335 us | } /* cpu_cgroup_can_attach = 0xffffffea */
+ 1) 4.369 us | } /* cgroup_migrate_execute = 0xffffffea */
+ 1) 7.143 us | } /* cgroup_migrate = 0xffffffea */
+
+At present, there are some limitations when using the funcgraph-retval
+option, and these limitations will be eliminated in the future:
+
+- Even if the function return type is void, a return value will still
+ be printed, and you can just ignore it.
+
+- Even if return values are stored in multiple registers, only the
+ value contained in the first register will be recorded and printed.
+ To illustrate, in the x86 architecture, eax and edx are used to store
+ a 64-bit return value, with the lower 32 bits saved in eax and the
+ upper 32 bits saved in edx. However, only the value stored in eax
+ will be recorded and printed.
+
+- In certain procedure call standards, such as arm64's AAPCS64, when a
+ type is smaller than a GPR, it is the responsibility of the consumer
+ to perform the narrowing, and the upper bits may contain UNKNOWN values.
+ Therefore, it is advisable to check the code for such cases. For instance,
+ when using a u8 in a 64-bit GPR, bits [63:8] may contain arbitrary values,
+ especially when larger types are truncated, whether explicitly or implicitly.
+ Here are some specific cases to illustrate this point:
+
+ **Case One**:
+
+ The function narrow_to_u8 is defined as follows::
+
+ u8 narrow_to_u8(u64 val)
+ {
+ // implicitly truncated
+ return val;
+ }
+
+ It may be compiled to::
+
+ narrow_to_u8:
+ < ... ftrace instrumentation ... >
+ RET
+
+ If you pass 0x123456789abcdef to this function and want to narrow it,
+ it may be recorded as 0x123456789abcdef instead of 0xef.
+
+ **Case Two**:
+
+ The function error_if_not_4g_aligned is defined as follows::
+
+ int error_if_not_4g_aligned(u64 val)
+ {
+ if (val & GENMASK(31, 0))
+ return -EINVAL;
+
+ return 0;
+ }
+
+ It could be compiled to::
+
+ error_if_not_4g_aligned:
+ CBNZ w0, .Lnot_aligned
+ RET // bits [31:0] are zero, bits
+ // [63:32] are UNKNOWN
+ .Lnot_aligned:
+ MOV x0, #-EINVAL
+ RET
+
+ When passing 0x2_0000_0000 to it, the return value may be recorded as
+ 0x2_0000_0000 instead of 0.
+
You can put some comments on specific functions by using
trace_printk() For example, if you want to put a comment inside
the __might_sleep() function, you just have to include
@@ -2700,7 +2972,7 @@ listed in:
put_prev_task_idle
kmem_cache_create
pick_next_task_rt
- get_online_cpus
+ cpus_read_lock
pick_next_task_fair
mutex_lock
[...]
@@ -2867,7 +3139,7 @@ Produces::
bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
bash-1994 [000] .... 4342.324899: do_truncate <-do_last
- bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
+ bash-1994 [000] .... 4342.324899: setattr_should_drop_suidgid <-do_truncate
bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
@@ -3309,7 +3581,7 @@ one of the latency tracers, you will get the following results.
Instances
---------
-In the tracefs tracing directory is a directory called "instances".
+In the tracefs tracing directory, there is a directory called "instances".
This directory can have new directories created inside of it using
mkdir, and removing directories with rmdir. The directory created
with mkdir in this directory will already contain files and other
@@ -3324,7 +3596,7 @@ directories after it is created.
As you can see, the new directory looks similar to the tracing directory
itself. In fact, it is very similar, except that the buffer and
-events are agnostic from the main director, or from any other
+events are agnostic from the main directory, or from any other
instances that are created.
The files in the new directory work just like the files with the
@@ -3437,7 +3709,7 @@ directories, the rmdir will fail with EBUSY.
Stack trace
-----------
Since the kernel has a fixed sized stack, it is important not to
-waste it in functions. A kernel developer must be conscience of
+waste it in functions. A kernel developer must be conscious of
what they allocate on the stack. If they add too much, the system
can be in danger of a stack overflow, and corruption will occur,
usually leading to a system panic.
diff --git a/Documentation/trace/hisi-ptt.rst b/Documentation/trace/hisi-ptt.rst
new file mode 100644
index 000000000000..989255eb5622
--- /dev/null
+++ b/Documentation/trace/hisi-ptt.rst
@@ -0,0 +1,304 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================================
+HiSilicon PCIe Tune and Trace device
+======================================
+
+Introduction
+============
+
+HiSilicon PCIe tune and trace device (PTT) is a PCIe Root Complex
+integrated Endpoint (RCiEP) device, providing the capability
+to dynamically monitor and tune the PCIe link's events (tune),
+and trace the TLP headers (trace). The two functions are independent,
+but is recommended to use them together to analyze and enhance the
+PCIe link's performance.
+
+On Kunpeng 930 SoC, the PCIe Root Complex is composed of several
+PCIe cores. Each PCIe core includes several Root Ports and a PTT
+RCiEP, like below. The PTT device is capable of tuning and
+tracing the links of the PCIe core.
+::
+
+ +--------------Core 0-------+
+ | | [ PTT ] |
+ | | [Root Port]---[Endpoint]
+ | | [Root Port]---[Endpoint]
+ | | [Root Port]---[Endpoint]
+ Root Complex |------Core 1-------+
+ | | [ PTT ] |
+ | | [Root Port]---[ Switch ]---[Endpoint]
+ | | [Root Port]---[Endpoint] `-[Endpoint]
+ | | [Root Port]---[Endpoint]
+ +---------------------------+
+
+The PTT device driver registers one PMU device for each PTT device.
+The name of each PTT device is composed of 'hisi_ptt' prefix with
+the id of the SICL and the Core where it locates. The Kunpeng 930
+SoC encapsulates multiple CPU dies (SCCL, Super CPU Cluster) and
+IO dies (SICL, Super I/O Cluster), where there's one PCIe Root
+Complex for each SICL.
+::
+
+ /sys/devices/hisi_ptt<sicl_id>_<core_id>
+
+Tune
+====
+
+PTT tune is designed for monitoring and adjusting PCIe link parameters (events).
+Currently we support events in 2 classes. The scope of the events
+covers the PCIe core to which the PTT device belongs.
+
+Each event is presented as a file under $(PTT PMU dir)/tune, and
+a simple open/read/write/close cycle will be used to tune the event.
+::
+
+ $ cd /sys/devices/hisi_ptt<sicl_id>_<core_id>/tune
+ $ ls
+ qos_tx_cpl qos_tx_np qos_tx_p
+ tx_path_rx_req_alloc_buf_level
+ tx_path_tx_req_alloc_buf_level
+ $ cat qos_tx_dp
+ 1
+ $ echo 2 > qos_tx_dp
+ $ cat qos_tx_dp
+ 2
+
+Current value (numerical value) of the event can be simply read
+from the file, and the desired value written to the file to tune.
+
+1. Tx Path QoS Control
+------------------------
+
+The following files are provided to tune the QoS of the tx path of
+the PCIe core.
+
+- qos_tx_cpl: weight of Tx completion TLPs
+- qos_tx_np: weight of Tx non-posted TLPs
+- qos_tx_p: weight of Tx posted TLPs
+
+The weight influences the proportion of certain packets on the PCIe link.
+For example, for the storage scenario, increase the proportion
+of the completion packets on the link to enhance the performance as
+more completions are consumed.
+
+The available tune data of these events is [0, 1, 2].
+Writing a negative value will return an error, and out of range
+values will be converted to 2. Note that the event value just
+indicates a probable level, but is not precise.
+
+2. Tx Path Buffer Control
+-------------------------
+
+Following files are provided to tune the buffer of tx path of the PCIe core.
+
+- rx_alloc_buf_level: watermark of Rx requested
+- tx_alloc_buf_level: watermark of Tx requested
+
+These events influence the watermark of the buffer allocated for each
+type. Rx means the inbound while Tx means outbound. The packets will
+be stored in the buffer first and then transmitted either when the
+watermark reached or when timed out. For a busy direction, you should
+increase the related buffer watermark to avoid frequently posting and
+thus enhance the performance. In most cases just keep the default value.
+
+The available tune data of above events is [0, 1, 2].
+Writing a negative value will return an error, and out of range
+values will be converted to 2. Note that the event value just
+indicates a probable level, but is not precise.
+
+Trace
+=====
+
+PTT trace is designed for dumping the TLP headers to the memory, which
+can be used to analyze the transactions and usage condition of the PCIe
+Link. You can choose to filter the traced headers by either Requester ID,
+or those downstream of a set of Root Ports on the same core of the PTT
+device. It's also supported to trace the headers of certain type and of
+certain direction.
+
+You can use the perf command `perf record` to set the parameters, start
+trace and get the data. It's also supported to decode the trace
+data with `perf report`. The control parameters for trace is inputted
+as event code for each events, which will be further illustrated later.
+An example usage is like
+::
+
+ $ perf record -e hisi_ptt0_2/filter=0x80001,type=1,direction=1,
+ format=1/ -- sleep 5
+
+This will trace the TLP headers downstream root port 0000:00:10.1 (event
+code for event 'filter' is 0x80001) with type of posted TLP requests,
+direction of inbound and traced data format of 8DW.
+
+1. Filter
+---------
+
+The TLP headers to trace can be filtered by the Root Ports or the Requester ID
+of the Endpoint, which are located on the same core of the PTT device. You can
+set the filter by specifying the `filter` parameter which is required to start
+the trace. The parameter value is 20 bit. Bit 19 indicates the filter type.
+1 for Root Port filter and 0 for Requester filter. Bit[15:0] indicates the
+filter value. The value for a Root Port is a mask of the core port id which is
+calculated from its PCI Slot ID as (slotid & 7) * 2. The value for a Requester
+is the Requester ID (Device ID of the PCIe function). Bit[18:16] is currently
+reserved for extension.
+
+For example, if the desired filter is Endpoint function 0000:01:00.1 the filter
+value will be 0x00101. If the desired filter is Root Port 0000:00:10.0 then
+then filter value is calculated as 0x80001.
+
+The driver also presents every supported Root Port and Requester filter through
+sysfs. Each filter will be an individual file with name of its related PCIe
+device name (domain:bus:device.function). The files of Root Port filters are
+under $(PTT PMU dir)/root_port_filters and files of Requester filters
+are under $(PTT PMU dir)/requester_filters.
+
+Note that multiple Root Ports can be specified at one time, but only one
+Endpoint function can be specified in one trace. Specifying both Root Port
+and function at the same time is not supported. Driver maintains a list of
+available filters and will check the invalid inputs.
+
+The available filters will be dynamically updated, which means you will always
+get correct filter information when hotplug events happen, or when you manually
+remove/rescan the devices.
+
+2. Type
+-------
+
+You can trace the TLP headers of certain types by specifying the `type`
+parameter, which is required to start the trace. The parameter value is
+8 bit. Current supported types and related values are shown below:
+
+- 8'b00000001: posted requests (P)
+- 8'b00000010: non-posted requests (NP)
+- 8'b00000100: completions (CPL)
+
+You can specify multiple types when tracing inbound TLP headers, but can only
+specify one when tracing outbound TLP headers.
+
+3. Direction
+------------
+
+You can trace the TLP headers from certain direction, which is relative
+to the Root Port or the PCIe core, by specifying the `direction` parameter.
+This is optional and the default parameter is inbound. The parameter value
+is 4 bit. When the desired format is 4DW, directions and related values
+supported are shown below:
+
+- 4'b0000: inbound TLPs (P, NP, CPL)
+- 4'b0001: outbound TLPs (P, NP, CPL)
+- 4'b0010: outbound TLPs (P, NP, CPL) and inbound TLPs (P, NP, CPL B)
+- 4'b0011: outbound TLPs (P, NP, CPL) and inbound TLPs (CPL A)
+
+When the desired format is 8DW, directions and related values supported are
+shown below:
+
+- 4'b0000: reserved
+- 4'b0001: outbound TLPs (P, NP, CPL)
+- 4'b0010: inbound TLPs (P, NP, CPL B)
+- 4'b0011: inbound TLPs (CPL A)
+
+Inbound completions are classified into two types:
+
+- completion A (CPL A): completion of CHI/DMA/Native non-posted requests, except for CPL B
+- completion B (CPL B): completion of DMA remote2local and P2P non-posted requests
+
+4. Format
+--------------
+
+You can change the format of the traced TLP headers by specifying the
+`format` parameter. The default format is 4DW. The parameter value is 4 bit.
+Current supported formats and related values are shown below:
+
+- 4'b0000: 4DW length per TLP header
+- 4'b0001: 8DW length per TLP header
+
+The traced TLP header format is different from the PCIe standard.
+
+When using the 8DW data format, the entire TLP header is logged
+(Header DW0-3 shown below). For example, the TLP header for Memory
+Reads with 64-bit addresses is shown in PCIe r5.0, Figure 2-17;
+the header for Configuration Requests is shown in Figure 2.20, etc.
+
+In addition, 8DW trace buffer entries contain a timestamp and
+possibly a prefix for a PASID TLP prefix (see Figure 6-20, PCIe r5.0).
+Otherwise this field will be all 0.
+
+The bit[31:11] of DW0 is always 0x1fffff, which can be
+used to distinguish the data format. 8DW format is like
+::
+
+ bits [ 31:11 ][ 10:0 ]
+ |---------------------------------------|-------------------|
+ DW0 [ 0x1fffff ][ Reserved (0x7ff) ]
+ DW1 [ Prefix ]
+ DW2 [ Header DW0 ]
+ DW3 [ Header DW1 ]
+ DW4 [ Header DW2 ]
+ DW5 [ Header DW3 ]
+ DW6 [ Reserved (0x0) ]
+ DW7 [ Time ]
+
+When using the 4DW data format, DW0 of the trace buffer entry
+contains selected fields of DW0 of the TLP, together with a
+timestamp. DW1-DW3 of the trace buffer entry contain DW1-DW3
+directly from the TLP header.
+
+4DW format is like
+::
+
+ bits [31:30] [ 29:25 ][24][23][22][21][ 20:11 ][ 10:0 ]
+ |-----|---------|---|---|---|---|-------------|-------------|
+ DW0 [ Fmt ][ Type ][T9][T8][TH][SO][ Length ][ Time ]
+ DW1 [ Header DW1 ]
+ DW2 [ Header DW2 ]
+ DW3 [ Header DW3 ]
+
+5. Memory Management
+--------------------
+
+The traced TLP headers will be written to the memory allocated
+by the driver. The hardware accepts 4 DMA address with same size,
+and writes the buffer sequentially like below. If DMA addr 3 is
+finished and the trace is still on, it will return to addr 0.
+::
+
+ +->[DMA addr 0]->[DMA addr 1]->[DMA addr 2]->[DMA addr 3]-+
+ +---------------------------------------------------------+
+
+Driver will allocate each DMA buffer of 4MiB. The finished buffer
+will be copied to the perf AUX buffer allocated by the perf core.
+Once the AUX buffer is full while the trace is still on, driver
+will commit the AUX buffer first and then apply for a new one with
+the same size. The size of AUX buffer is default to 16MiB. User can
+adjust the size by specifying the `-m` parameter of the perf command.
+
+6. Decoding
+-----------
+
+You can decode the traced data with `perf report -D` command (currently
+only support to dump the raw trace data). The traced data will be decoded
+according to the format described previously (take 8DW as an example):
+::
+
+ [...perf headers and other information]
+ . ... HISI PTT data: size 4194304 bytes
+ . 00000000: 00 00 00 00 Prefix
+ . 00000004: 01 00 00 60 Header DW0
+ . 00000008: 0f 1e 00 01 Header DW1
+ . 0000000c: 04 00 00 00 Header DW2
+ . 00000010: 40 00 81 02 Header DW3
+ . 00000014: 33 c0 04 00 Time
+ . 00000020: 00 00 00 00 Prefix
+ . 00000024: 01 00 00 60 Header DW0
+ . 00000028: 0f 1e 00 01 Header DW1
+ . 0000002c: 04 00 00 00 Header DW2
+ . 00000030: 40 00 81 02 Header DW3
+ . 00000034: 02 00 00 00 Time
+ . 00000040: 00 00 00 00 Prefix
+ . 00000044: 01 00 00 60 Header DW0
+ . 00000048: 0f 1e 00 01 Header DW1
+ . 0000004c: 04 00 00 00 Header DW2
+ . 00000050: 40 00 81 02 Header DW3
+ [...]
diff --git a/Documentation/trace/histogram-design.rst b/Documentation/trace/histogram-design.rst
new file mode 100644
index 000000000000..5765eb3e9efa
--- /dev/null
+++ b/Documentation/trace/histogram-design.rst
@@ -0,0 +1,2115 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======================
+Histogram Design Notes
+======================
+
+:Author: Tom Zanussi <zanussi@kernel.org>
+
+This document attempts to provide a description of how the ftrace
+histograms work and how the individual pieces map to the data
+structures used to implement them in trace_events_hist.c and
+tracing_map.c.
+
+Note: All the ftrace histogram command examples assume the working
+ directory is the ftrace /tracing directory. For example::
+
+ # cd /sys/kernel/tracing
+
+Also, the histogram output displayed for those commands will be
+generally be truncated - only enough to make the point is displayed.
+
+'hist_debug' trace event files
+==============================
+
+If the kernel is compiled with CONFIG_HIST_TRIGGERS_DEBUG set, an
+event file named 'hist_debug' will appear in each event's
+subdirectory. This file can be read at any time and will display some
+of the hist trigger internals described in this document. Specific
+examples and output will be described in test cases below.
+
+Basic histograms
+================
+
+First, basic histograms. Below is pretty much the simplest thing you
+can do with histograms - create one with a single key on a single
+event and cat the output::
+
+ # echo 'hist:keys=pid' >> events/sched/sched_waking/trigger
+
+ # cat events/sched/sched_waking/hist
+
+ { pid: 18249 } hitcount: 1
+ { pid: 13399 } hitcount: 1
+ { pid: 17973 } hitcount: 1
+ { pid: 12572 } hitcount: 1
+ ...
+ { pid: 10 } hitcount: 921
+ { pid: 18255 } hitcount: 1444
+ { pid: 25526 } hitcount: 2055
+ { pid: 5257 } hitcount: 2055
+ { pid: 27367 } hitcount: 2055
+ { pid: 1728 } hitcount: 2161
+
+ Totals:
+ Hits: 21305
+ Entries: 183
+ Dropped: 0
+
+What this does is create a histogram on the sched_waking event using
+pid as a key and with a single value, hitcount, which even if not
+explicitly specified, exists for every histogram regardless.
+
+The hitcount value is a per-bucket value that's automatically
+incremented on every hit for the given key, which in this case is the
+pid.
+
+So in this histogram, there's a separate bucket for each pid, and each
+bucket contains a value for that bucket, counting the number of times
+sched_waking was called for that pid.
+
+Each histogram is represented by a hist_data struct.
+
+To keep track of each key and value field in the histogram, hist_data
+keeps an array of these fields named fields[]. The fields[] array is
+an array containing struct hist_field representations of each
+histogram val and key in the histogram (variables are also included
+here, but are discussed later). So for the above histogram we have one
+key and one value; in this case the one value is the hitcount value,
+which all histograms have, regardless of whether they define that
+value or not, which the above histogram does not.
+
+Each struct hist_field contains a pointer to the ftrace_event_field
+from the event's trace_event_file along with various bits related to
+that such as the size, offset, type, and a hist_field_fn_t function,
+which is used to grab the field's data from the ftrace event buffer
+(in most cases - some hist_fields such as hitcount don't directly map
+to an event field in the trace buffer - in these cases the function
+implementation gets its value from somewhere else). The flags field
+indicates which type of field it is - key, value, variable, variable
+reference, etc., with value being the default.
+
+The other important hist_data data structure in addition to the
+fields[] array is the tracing_map instance created for the histogram,
+which is held in the .map member. The tracing_map implements the
+lock-free hash table used to implement histograms (see
+kernel/trace/tracing_map.h for much more discussion about the
+low-level data structures implementing the tracing_map). For the
+purposes of this discussion, the tracing_map contains a number of
+buckets, each bucket corresponding to a particular tracing_map_elt
+object hashed by a given histogram key.
+
+Below is a diagram the first part of which describes the hist_data and
+associated key and value fields for the histogram described above. As
+you can see, there are two fields in the fields array, one val field
+for the hitcount and one key field for the pid key.
+
+Below that is a diagram of a run-time snapshot of what the tracing_map
+might look like for a given run. It attempts to show the
+relationships between the hist_data fields and the tracing_map
+elements for a couple hypothetical keys and values.::
+
+ +------------------+
+ | hist_data |
+ +------------------+ +----------------+
+ | .fields[] |---->| val = hitcount |----------------------------+
+ +----------------+ +----------------+ |
+ | .map | | .size | |
+ +----------------+ +--------------+ |
+ | .offset | |
+ +--------------+ |
+ | .fn() | |
+ +--------------+ |
+ . |
+ . |
+ . |
+ +----------------+ <--- n_vals |
+ | key = pid |----------------------------|--+
+ +----------------+ | |
+ | .size | | |
+ +--------------+ | |
+ | .offset | | |
+ +--------------+ | |
+ | .fn() | | |
+ +----------------+ <--- n_fields | |
+ | unused | | |
+ +----------------+ | |
+ | | | |
+ +--------------+ | |
+ | | | |
+ +--------------+ | |
+ | | | |
+ +--------------+ | |
+ n_keys = n_fields - n_vals | |
+
+The hist_data n_vals and n_fields delineate the extent of the fields[] | |
+array and separate keys from values for the rest of the code. | |
+
+Below is a run-time representation of the tracing_map part of the | |
+histogram, with pointers from various parts of the fields[] array | |
+to corresponding parts of the tracing_map. | |
+
+The tracing_map consists of an array of tracing_map_entrys and a set | |
+of preallocated tracing_map_elts (abbreviated below as map_entry and | |
+map_elt). The total number of map_entrys in the hist_data.map array = | |
+map->max_elts (actually map->map_size but only max_elts of those are | |
+used. This is a property required by the map_insert() algorithm). | |
+
+If a map_entry is unused, meaning no key has yet hashed into it, its | |
+.key value is 0 and its .val pointer is NULL. Once a map_entry has | |
+been claimed, the .key value contains the key's hash value and the | |
+.val member points to a map_elt containing the full key and an entry | |
+for each key or value in the map_elt.fields[] array. There is an | |
+entry in the map_elt.fields[] array corresponding to each hist_field | |
+in the histogram, and this is where the continually aggregated sums | |
+corresponding to each histogram value are kept. | |
+
+The diagram attempts to show the relationship between the | |
+hist_data.fields[] and the map_elt.fields[] with the links drawn | |
+between diagrams::
+
+ +-----------+ | |
+ | hist_data | | |
+ +-----------+ | |
+ | .fields | | |
+ +---------+ +-----------+ | |
+ | .map |---->| map_entry | | |
+ +---------+ +-----------+ | |
+ | .key |---> 0 | |
+ +---------+ | |
+ | .val |---> NULL | |
+ +-----------+ | |
+ | map_entry | | |
+ +-----------+ | |
+ | .key |---> pid = 999 | |
+ +---------+ +-----------+ | |
+ | .val |--->| map_elt | | |
+ +---------+ +-----------+ | |
+ . | .key |---> full key * | |
+ . +---------+ +---------------+ | |
+ . | .fields |--->| .sum (val) |<-+ |
+ +-----------+ +---------+ | 2345 | | |
+ | map_entry | +---------------+ | |
+ +-----------+ | .offset (key) |<----+
+ | .key |---> 0 | 0 | | |
+ +---------+ +---------------+ | |
+ | .val |---> NULL . | |
+ +-----------+ . | |
+ | map_entry | . | |
+ +-----------+ +---------------+ | |
+ | .key | | .sum (val) or | | |
+ +---------+ +---------+ | .offset (key) | | |
+ | .val |--->| map_elt | +---------------+ | |
+ +-----------+ +---------+ | .sum (val) or | | |
+ | map_entry | | .offset (key) | | |
+ +-----------+ +---------------+ | |
+ | .key |---> pid = 4444 | |
+ +---------+ +-----------+ | |
+ | .val | | map_elt | | |
+ +---------+ +-----------+ | |
+ | .key |---> full key * | |
+ +---------+ +---------------+ | |
+ | .fields |--->| .sum (val) |<-+ |
+ +---------+ | 65523 | |
+ +---------------+ |
+ | .offset (key) |<----+
+ | 0 |
+ +---------------+
+ .
+ .
+ .
+ +---------------+
+ | .sum (val) or |
+ | .offset (key) |
+ +---------------+
+ | .sum (val) or |
+ | .offset (key) |
+ +---------------+
+
+Abbreviations used in the diagrams::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ fn = hist_field_fn_t
+ map_entry = struct tracing_map_entry
+ map_elt = struct tracing_map_elt
+ map_elt.fields = struct tracing_map_field
+
+Whenever a new event occurs and it has a hist trigger associated with
+it, event_hist_trigger() is called. event_hist_trigger() first deals
+with the key: for each subkey in the key (in the above example, there
+is just one subkey corresponding to pid), the hist_field that
+represents that subkey is retrieved from hist_data.fields[] and the
+hist_field_fn_t fn() associated with that field, along with the
+field's size and offset, is used to grab that subkey's data from the
+current trace record.
+
+Once the complete key has been retrieved, it's used to look that key
+up in the tracing_map. If there's no tracing_map_elt associated with
+that key, an empty one is claimed and inserted in the map for the new
+key. In either case, the tracing_map_elt associated with that key is
+returned.
+
+Once a tracing_map_elt available, hist_trigger_elt_update() is called.
+As the name implies, this updates the element, which basically means
+updating the element's fields. There's a tracing_map_field associated
+with each key and value in the histogram, and each of these correspond
+to the key and value hist_fields created when the histogram was
+created. hist_trigger_elt_update() goes through each value hist_field
+and, as for the keys, uses the hist_field's fn() and size and offset
+to grab the field's value from the current trace record. Once it has
+that value, it simply adds that value to that field's
+continually-updated tracing_map_field.sum member. Some hist_field
+fn()s, such as for the hitcount, don't actually grab anything from the
+trace record (the hitcount fn() just increments the counter sum by 1),
+but the idea is the same.
+
+Once all the values have been updated, hist_trigger_elt_update() is
+done and returns. Note that there are also tracing_map_fields for
+each subkey in the key, but hist_trigger_elt_update() doesn't look at
+them or update anything - those exist only for sorting, which can
+happen later.
+
+Basic histogram test
+--------------------
+
+This is a good example to try. It produces 3 value fields and 2 key
+fields in the output::
+
+ # echo 'hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger
+
+To see the debug data, cat the kmem/kmalloc's 'hist_debug' file. It
+will show the trigger info of the histogram it corresponds to, along
+with the address of the hist_data associated with the histogram, which
+will become useful in later examples. It then displays the number of
+total hist_fields associated with the histogram along with a count of
+how many of those correspond to keys and how many correspond to values.
+
+It then goes on to display details for each field, including the
+field's flags and the position of each field in the hist_data's
+fields[] array, which is useful information for verifying that things
+internally appear correct or not, and which again will become even
+more useful in further examples::
+
+ # cat events/kmem/kmalloc/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=common_pid,call_site.sym:vals=hitcount,bytes_req,bytes_alloc:sort=hitcount:size=2048 [active]
+ #
+
+ hist_data: 000000005e48c9a5
+
+ n_vals: 3
+ n_keys: 2
+ n_fields: 5
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ VAL: normal u64 value
+ ftrace_event_field name: bytes_req
+ type: size_t
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[2]:
+ flags:
+ VAL: normal u64 value
+ ftrace_event_field name: bytes_alloc
+ type: size_t
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: common_pid
+ type: int
+ size: 8
+ is_signed: 1
+
+ hist_data->fields[4]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: call_site
+ type: unsigned long
+ size: 8
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=common_pid,call_site.sym:values=bytes_req,bytes_alloc,hitcount' >> events/kmem/kmalloc/trigger
+
+Variables
+=========
+
+Variables allow data from one hist trigger to be saved by one hist
+trigger and retrieved by another hist trigger. For example, a trigger
+on the sched_waking event can capture a timestamp for a particular
+pid, and later a sched_switch event that switches to that pid event
+can grab the timestamp and use it to calculate a time delta between
+the two events::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
+ events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >>
+ events/sched/sched_switch/trigger
+
+In terms of the histogram data structures, variables are implemented
+as another type of hist_field and for a given hist trigger are added
+to the hist_data.fields[] array just after all the val fields. To
+distinguish them from the existing key and val fields, they're given a
+new flag type, HIST_FIELD_FL_VAR (abbreviated FL_VAR) and they also
+make use of a new .var.idx field member in struct hist_field, which
+maps them to an index in a new map_elt.vars[] array added to the
+map_elt specifically designed to store and retrieve variable values.
+The diagram below shows those new elements and adds a new variable
+entry, ts0, corresponding to the ts0 variable in the sched_waking
+trigger above.
+
+sched_waking histogram
+----------------------::
+
+ +------------------+
+ | hist_data |<-------------------------------------------------------+
+ +------------------+ +-------------------+ |
+ | .fields[] |-->| val = hitcount | |
+ +----------------+ +-------------------+ |
+ | .map | | .size | |
+ +----------------+ +-----------------+ |
+ | .offset | |
+ +-----------------+ |
+ | .fn() | |
+ +-----------------+ |
+ | .flags | |
+ +-----------------+ |
+ | .var.idx | |
+ +-------------------+ |
+ | var = ts0 | |
+ +-------------------+ |
+ | .size | |
+ +-----------------+ |
+ | .offset | |
+ +-----------------+ |
+ | .fn() | |
+ +-----------------+ |
+ | .flags & FL_VAR | |
+ +-----------------+ |
+ | .var.idx |----------------------------+-+ |
+ +-----------------+ | | |
+ . | | |
+ . | | |
+ . | | |
+ +-------------------+ <--- n_vals | | |
+ | key = pid | | | |
+ +-------------------+ | | |
+ | .size | | | |
+ +-----------------+ | | |
+ | .offset | | | |
+ +-----------------+ | | |
+ | .fn() | | | |
+ +-----------------+ | | |
+ | .flags & FL_KEY | | | |
+ +-----------------+ | | |
+ | .var.idx | | | |
+ +-------------------+ <--- n_fields | | |
+ | unused | | | |
+ +-------------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ | | | | |
+ +-----------------+ | | |
+ n_keys = n_fields - n_vals | | |
+ | | |
+
+This is very similar to the basic case. In the above diagram, we can | | |
+see a new .flags member has been added to the struct hist_field | | |
+struct, and a new entry added to hist_data.fields representing the ts0 | | |
+variable. For a normal val hist_field, .flags is just 0 (modulo | | |
+modifier flags), but if the value is defined as a variable, the .flags | | |
+contains a set FL_VAR bit. | | |
+
+As you can see, the ts0 entry's .var.idx member contains the index | | |
+into the tracing_map_elts' .vars[] array containing variable values. | | |
+This idx is used whenever the value of the variable is set or read. | | |
+The map_elt.vars idx assigned to the given variable is assigned and | | |
+saved in .var.idx by create_tracing_map_fields() after it calls | | |
+tracing_map_add_var(). | | |
+
+Below is a representation of the histogram at run-time, which | | |
+populates the map, along with correspondence to the above hist_data and | | |
+hist_field data structures. | | |
+
+The diagram attempts to show the relationship between the | | |
+hist_data.fields[] and the map_elt.fields[] and map_elt.vars[] with | | |
+the links drawn between diagrams. For each of the map_elts, you can | | |
+see that the .fields[] members point to the .sum or .offset of a key | | |
+or val and the .vars[] members point to the value of a variable. The | | |
+arrows between the two diagrams show the linkages between those | | |
+tracing_map members and the field definitions in the corresponding | | |
+hist_data fields[] members.::
+
+ +-----------+ | | |
+ | hist_data | | | |
+ +-----------+ | | |
+ | .fields | | | |
+ +---------+ +-----------+ | | |
+ | .map |---->| map_entry | | | |
+ +---------+ +-----------+ | | |
+ | .key |---> 0 | | |
+ +---------+ | | |
+ | .val |---> NULL | | |
+ +-----------+ | | |
+ | map_entry | | | |
+ +-----------+ | | |
+ | .key |---> pid = 999 | | |
+ +---------+ +-----------+ | | |
+ | .val |--->| map_elt | | | |
+ +---------+ +-----------+ | | |
+ . | .key |---> full key * | | |
+ . +---------+ +---------------+ | | |
+ . | .fields |--->| .sum (val) | | | |
+ . +---------+ | 2345 | | | |
+ . +--| .vars | +---------------+ | | |
+ . | +---------+ | .offset (key) | | | |
+ . | | 0 | | | |
+ . | +---------------+ | | |
+ . | . | | |
+ . | . | | |
+ . | . | | |
+ . | +---------------+ | | |
+ . | | .sum (val) or | | | |
+ . | | .offset (key) | | | |
+ . | +---------------+ | | |
+ . | | .sum (val) or | | | |
+ . | | .offset (key) | | | |
+ . | +---------------+ | | |
+ . | | | |
+ . +---------------->+---------------+ | | |
+ . | ts0 |<--+ | |
+ . | 113345679876 | | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . . | | |
+ . . | | |
+ . . | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . | unused | | | |
+ . | | | | |
+ . +---------------+ | | |
+ . | | |
+ +-----------+ | | |
+ | map_entry | | | |
+ +-----------+ | | |
+ | .key |---> pid = 4444 | | |
+ +---------+ +-----------+ | | |
+ | .val |--->| map_elt | | | |
+ +---------+ +-----------+ | | |
+ . | .key |---> full key * | | |
+ . +---------+ +---------------+ | | |
+ . | .fields |--->| .sum (val) | | | |
+ +---------+ | 2345 | | | |
+ +--| .vars | +---------------+ | | |
+ | +---------+ | .offset (key) | | | |
+ | | 0 | | | |
+ | +---------------+ | | |
+ | . | | |
+ | . | | |
+ | . | | |
+ | +---------------+ | | |
+ | | .sum (val) or | | | |
+ | | .offset (key) | | | |
+ | +---------------+ | | |
+ | | .sum (val) or | | | |
+ | | .offset (key) | | | |
+ | +---------------+ | | |
+ | | | |
+ | +---------------+ | | |
+ +---------------->| ts0 |<--+ | |
+ | 213499240729 | | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+ . | |
+ . | |
+ . | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+ | unused | | |
+ | | | |
+ +---------------+ | |
+
+For each used map entry, there's a map_elt pointing to an array of | |
+.vars containing the current value of the variables associated with | |
+that histogram entry. So in the above, the timestamp associated with | |
+pid 999 is 113345679876, and the timestamp variable in the same | |
+.var.idx for pid 4444 is 213499240729. | |
+
+sched_switch histogram | |
+---------------------- | |
+
+The sched_switch histogram paired with the above sched_waking | |
+histogram is shown below. The most important aspect of the | |
+sched_switch histogram is that it references a variable on the | |
+sched_waking histogram above. | |
+
+The histogram diagram is very similar to the others so far displayed, | |
+but it adds variable references. You can see the normal hitcount and | |
+key fields along with a new wakeup_lat variable implemented in the | |
+same way as the sched_waking ts0 variable, but in addition there's an | |
+entry with the new FL_VAR_REF (short for HIST_FIELD_FL_VAR_REF) flag. | |
+
+Associated with the new var ref field are a couple of new hist_field | |
+members, var.hist_data and var_ref_idx. For a variable reference, the | |
+var.hist_data goes with the var.idx, which together uniquely identify | |
+a particular variable on a particular histogram. The var_ref_idx is | |
+just the index into the var_ref_vals[] array that caches the values of | |
+each variable whenever a hist trigger is updated. Those resulting | |
+values are then finally accessed by other code such as trace action | |
+code that uses the var_ref_idx values to assign param values. | |
+
+The diagram below describes the situation for the sched_switch | |
+histogram referred to before::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> | |
+ events/sched/sched_switch/trigger | |
+ | |
+ +------------------+ | |
+ | hist_data | | |
+ +------------------+ +-----------------------+ | |
+ | .fields[] |-->| val = hitcount | | |
+ +----------------+ +-----------------------+ | |
+ | .map | | .size | | |
+ +----------------+ +---------------------+ | |
+ +--| .var_refs[] | | .offset | | |
+ | +----------------+ +---------------------+ | |
+ | | .fn() | | |
+ | var_ref_vals[] +---------------------+ | |
+ | +-------------+ | .flags | | |
+ | | $ts0 |<---+ +---------------------+ | |
+ | +-------------+ | | .var.idx | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .var.hist_data | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .var_ref_idx | | |
+ | | | | +-----------------------+ | |
+ | +-------------+ | | var = wakeup_lat | | |
+ | . | +-----------------------+ | |
+ | . | | .size | | |
+ | . | +---------------------+ | |
+ | +-------------+ | | .offset | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .fn() | | |
+ | | | | +---------------------+ | |
+ | +-------------+ | | .flags & FL_VAR | | |
+ | | +---------------------+ | |
+ | | | .var.idx | | |
+ | | +---------------------+ | |
+ | | | .var.hist_data | | |
+ | | +---------------------+ | |
+ | | | .var_ref_idx | | |
+ | | +---------------------+ | |
+ | | . | |
+ | | . | |
+ | | . | |
+ | | +-----------------------+ <--- n_vals | |
+ | | | key = pid | | |
+ | | +-----------------------+ | |
+ | | | .size | | |
+ | | +---------------------+ | |
+ | | | .offset | | |
+ | | +---------------------+ | |
+ | | | .fn() | | |
+ | | +---------------------+ | |
+ | | | .flags | | |
+ | | +---------------------+ | |
+ | | | .var.idx | | |
+ | | +-----------------------+ <--- n_fields | |
+ | | | unused | | |
+ | | +-----------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | | | | |
+ | | +---------------------+ | |
+ | | n_keys = n_fields - n_vals | |
+ | | | |
+ | | | |
+ | | +-----------------------+ | |
+ +---------------------->| var_ref = $ts0 | | |
+ | +-----------------------+ | |
+ | | .size | | |
+ | +---------------------+ | |
+ | | .offset | | |
+ | +---------------------+ | |
+ | | .fn() | | |
+ | +---------------------+ | |
+ | | .flags & FL_VAR_REF | | |
+ | +---------------------+ | |
+ | | .var.idx |--------------------------+ |
+ | +---------------------+ |
+ | | .var.hist_data |----------------------------+
+ | +---------------------+
+ +---| .var_ref_idx |
+ +---------------------+
+
+Abbreviations used in the diagrams::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ fn = hist_field_fn_t
+ FL_KEY = HIST_FIELD_FL_KEY
+ FL_VAR = HIST_FIELD_FL_VAR
+ FL_VAR_REF = HIST_FIELD_FL_VAR_REF
+
+When a hist trigger makes use of a variable, a new hist_field is
+created with flag HIST_FIELD_FL_VAR_REF. For a VAR_REF field, the
+var.idx and var.hist_data take the same values as the referenced
+variable, as well as the referenced variable's size, type, and
+is_signed values. The VAR_REF field's .name is set to the name of the
+variable it references. If a variable reference was created using the
+explicit system.event.$var_ref notation, the hist_field's system and
+event_name variables are also set.
+
+So, in order to handle an event for the sched_switch histogram,
+because we have a reference to a variable on another histogram, we
+need to resolve all variable references first. This is done via the
+resolve_var_refs() calls made from event_hist_trigger(). What this
+does is grabs the var_refs[] array from the hist_data representing the
+sched_switch histogram. For each one of those, the referenced
+variable's var.hist_data along with the current key is used to look up
+the corresponding tracing_map_elt in that histogram. Once found, the
+referenced variable's var.idx is used to look up the variable's value
+using tracing_map_read_var(elt, var.idx), which yields the value of
+the variable for that element, ts0 in the case above. Note that both
+the hist_fields representing both the variable and the variable
+reference have the same var.idx, so this is straightforward.
+
+Variable and variable reference test
+------------------------------------
+
+This example creates a variable on the sched_waking event, ts0, and
+uses it in the sched_switch trigger. The sched_switch trigger also
+creates its own variable, wakeup_lat, but nothing yet uses it::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger
+
+Looking at the sched_waking 'hist_debug' output, in addition to the
+normal key and value hist_fields, in the val fields section we see a
+field with the HIST_FIELD_FL_VAR flag, which indicates that that field
+represents a variable. Note that in addition to the variable name,
+contained in the var.name field, it includes the var.idx, which is the
+index into the tracing_map_elt.vars[] array of the actual variable
+location. Note also that the output shows that variables live in the
+same part of the hist_data->fields[] array as normal values::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 000000009536f554
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+Moving on to the sched_switch trigger hist_debug output, in addition
+to the unused wakeup_lat variable, we see a new section displaying
+variable references. Variable references are displayed in a separate
+section because in addition to being logically separate from
+variables and values, they actually live in a separate hist_data
+array, var_refs[].
+
+In this example, the sched_switch trigger has a reference to a
+variable on the sched_waking trigger, $ts0. Looking at the details,
+we can see that the var.hist_data value of the referenced variable
+matches the previously displayed sched_waking trigger, and the var.idx
+value matches the previously displayed var.idx value for that
+variable. Also displayed is the var_ref_idx value for that variable
+reference, which is where the value for that variable is cached for
+use when the trigger is invoked::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000f4ee8006
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 000000009536f554
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Actions and Handlers
+====================
+
+Adding onto the previous example, we will now do something with that
+wakeup_lat variable, namely send it and another field as a synthetic
+event.
+
+The onmatch() action below basically says that whenever we have a
+sched_switch event, if we have a matching sched_waking event, in this
+case if we have a pid in the sched_waking histogram that matches the
+next_pid field on this sched_switch event, we retrieve the
+variables specified in the wakeup_latency() trace action, and use
+them to generate a new wakeup_latency event into the trace stream.
+
+Note that the way the trace handlers such as wakeup_latency() (which
+could equivalently be written trace(wakeup_latency,$wakeup_lat,next_pid)
+are implemented, the parameters specified to the trace handler must be
+variables. In this case, $wakeup_lat is obviously a variable, but
+next_pid isn't, since it's just naming a field in the sched_switch
+trace event. Since this is something that almost every trace() and
+save() action does, a special shortcut is implemented to allow field
+names to be used directly in those cases. How it works is that under
+the covers, a temporary variable is created for the named field, and
+this variable is what is actually passed to the trace handler. In the
+code and documentation, this type of variable is called a 'field
+variable'.
+
+Fields on other trace event's histograms can be used as well. In that
+case we have to generate a new histogram and an unfortunately named
+'synthetic_field' (the use of synthetic here has nothing to do with
+synthetic events) and use that special histogram field as a variable.
+
+The diagram below illustrates the new elements described above in the
+context of the sched_switch histogram using the onmatch() handler and
+the trace() action.
+
+First, we define the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid' >> synthetic_events
+
+Next, the sched_waking hist trigger as before::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >>
+ events/sched/sched_waking/trigger
+
+Finally, we create a hist trigger on the sched_switch event that
+generates a wakeup_latency() trace event. In this case we pass
+next_pid into the wakeup_latency synthetic event invocation, which
+means it will be automatically converted into a field variable::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid)' >>
+ /sys/kernel/tracing/events/sched/sched_switch/trigger
+
+The diagram for the sched_switch event is similar to previous examples
+but shows the additional field_vars[] array for hist_data and shows
+the linkages between the field_vars and the variables and references
+created to implement the field variables. The details are discussed
+below::
+
+ +------------------+
+ | hist_data |
+ +------------------+ +-----------------------+
+ | .fields[] |-->| val = hitcount |
+ +----------------+ +-----------------------+
+ | .map | | .size |
+ +----------------+ +---------------------+
+ +---| .field_vars[] | | .offset |
+ | +----------------+ +---------------------+
+ |+--| .var_refs[] | | .offset |
+ || +----------------+ +---------------------+
+ || | .fn() |
+ || var_ref_vals[] +---------------------+
+ || +-------------+ | .flags |
+ || | $ts0 |<---+ +---------------------+
+ || +-------------+ | | .var.idx |
+ || | $next_pid |<-+ | +---------------------+
+ || +-------------+ | | | .var.hist_data |
+ ||+>| $wakeup_lat | | | +---------------------+
+ ||| +-------------+ | | | .var_ref_idx |
+ ||| | | | | +-----------------------+
+ ||| +-------------+ | | | var = wakeup_lat |
+ ||| . | | +-----------------------+
+ ||| . | | | .size |
+ ||| . | | +---------------------+
+ ||| +-------------+ | | | .offset |
+ ||| | | | | +---------------------+
+ ||| +-------------+ | | | .fn() |
+ ||| | | | | +---------------------+
+ ||| +-------------+ | | | .flags & FL_VAR |
+ ||| | | +---------------------+
+ ||| | | | .var.idx |
+ ||| | | +---------------------+
+ ||| | | | .var.hist_data |
+ ||| | | +---------------------+
+ ||| | | | .var_ref_idx |
+ ||| | | +---------------------+
+ ||| | | .
+ ||| | | .
+ ||| | | .
+ ||| | | .
+ ||| +--------------+ | | .
+ +-->| field_var | | | .
+ || +--------------+ | | .
+ || | var | | | .
+ || +------------+ | | .
+ || | val | | | .
+ || +--------------+ | | .
+ || | field_var | | | .
+ || +--------------+ | | .
+ || | var | | | .
+ || +------------+ | | .
+ || | val | | | .
+ || +------------+ | | .
+ || . | | .
+ || . | | .
+ || . | | +-----------------------+ <--- n_vals
+ || +--------------+ | | | key = pid |
+ || | field_var | | | +-----------------------+
+ || +--------------+ | | | .size |
+ || | var |--+| +---------------------+
+ || +------------+ ||| | .offset |
+ || | val |-+|| +---------------------+
+ || +------------+ ||| | .fn() |
+ || ||| +---------------------+
+ || ||| | .flags |
+ || ||| +---------------------+
+ || ||| | .var.idx |
+ || ||| +---------------------+ <--- n_fields
+ || |||
+ || ||| n_keys = n_fields - n_vals
+ || ||| +-----------------------+
+ || |+->| var = next_pid |
+ || | | +-----------------------+
+ || | | | .size |
+ || | | +---------------------+
+ || | | | .offset |
+ || | | +---------------------+
+ || | | | .flags & FL_VAR |
+ || | | +---------------------+
+ || | | | .var.idx |
+ || | | +---------------------+
+ || | | | .var.hist_data |
+ || | | +-----------------------+
+ || +-->| val for next_pid |
+ || | | +-----------------------+
+ || | | | .size |
+ || | | +---------------------+
+ || | | | .offset |
+ || | | +---------------------+
+ || | | | .fn() |
+ || | | +---------------------+
+ || | | | .flags |
+ || | | +---------------------+
+ || | | | |
+ || | | +---------------------+
+ || | |
+ || | |
+ || | | +-----------------------+
+ +|------------------|-|>| var_ref = $ts0 |
+ | | | +-----------------------+
+ | | | | .size |
+ | | | +---------------------+
+ | | | | .offset |
+ | | | +---------------------+
+ | | | | .fn() |
+ | | | +---------------------+
+ | | | | .flags & FL_VAR_REF |
+ | | | +---------------------+
+ | | +---| .var_ref_idx |
+ | | +-----------------------+
+ | | | var_ref = $next_pid |
+ | | +-----------------------+
+ | | | .size |
+ | | +---------------------+
+ | | | .offset |
+ | | +---------------------+
+ | | | .fn() |
+ | | +---------------------+
+ | | | .flags & FL_VAR_REF |
+ | | +---------------------+
+ | +-----| .var_ref_idx |
+ | +-----------------------+
+ | | var_ref = $wakeup_lat |
+ | +-----------------------+
+ | | .size |
+ | +---------------------+
+ | | .offset |
+ | +---------------------+
+ | | .fn() |
+ | +---------------------+
+ | | .flags & FL_VAR_REF |
+ | +---------------------+
+ +------------------------| .var_ref_idx |
+ +---------------------+
+
+As you can see, for a field variable, two hist_fields are created: one
+representing the variable, in this case next_pid, and one to actually
+get the value of the field from the trace stream, like a normal val
+field does. These are created separately from normal variable
+creation and are saved in the hist_data->field_vars[] array. See
+below for how these are used. In addition, a reference hist_field is
+also created, which is needed to reference the field variables such as
+$next_pid variable in the trace() action.
+
+Note that $wakeup_lat is also a variable reference, referencing the
+value of the expression common_timestamp-$ts0, and so also needs to
+have a hist field entry representing that reference created.
+
+When hist_trigger_elt_update() is called to get the normal key and
+value fields, it also calls update_field_vars(), which goes through
+each field_var created for the histogram, and available from
+hist_data->field_vars and calls val->fn() to get the data from the
+current trace record, and then uses the var's var.idx to set the
+variable at the var.idx offset in the appropriate tracing_map_elt's
+variable at elt->vars[var.idx].
+
+Once all the variables have been updated, resolve_var_refs() can be
+called from event_hist_trigger(), and not only can our $ts0 and
+$next_pid references be resolved but the $wakeup_lat reference as
+well. At this point, the trace() action can simply access the values
+assembled in the var_ref_vals[] array and generate the trace event.
+
+The same process occurs for the field variables associated with the
+save() action.
+
+Abbreviations used in the diagram::
+
+ hist_data = struct hist_trigger_data
+ hist_data.fields = struct hist_field
+ field_var = struct field_var
+ fn = hist_field_fn_t
+ FL_KEY = HIST_FIELD_FL_KEY
+ FL_VAR = HIST_FIELD_FL_VAR
+ FL_VAR_REF = HIST_FIELD_FL_VAR_REF
+
+trace() action field variable test
+----------------------------------
+
+This example adds to the previous test example by finally making use
+of the wakeup_lat variable, but in addition also creates a couple of
+field variables that then are all passed to the wakeup_latency() trace
+action via the onmatch() handler.
+
+First, we create the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+Next, the sched_waking trigger from previous examples::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Finally, as in the previous test example, we calculate and assign the
+wakeup latency using the $ts0 reference from the sched_waking trigger
+to the wakeup_lat variable, and finally use it along with a couple
+sched_switch event fields, next_pid and next_comm, to generate a
+wakeup_latency trace event. The next_pid and next_comm event fields
+are automatically converted into field variables for this purpose::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
+
+The sched_waking hist_debug output shows the same data as in the
+previous test example::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000d60ff61f
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The sched_switch hist_debug output shows the same key and value fields
+as in the previous test example - note that wakeup_lat is still in the
+val fields section, but that the new field variables are not there -
+although the field variables are variables, they're held separately in
+the hist_data's field_vars[] array. Although the field variables and
+the normal variables are located in separate places, you can see that
+the actual variable locations for those variables in the
+tracing_map_elt.vars[] do have increasing indices as expected:
+wakeup_lat takes the var.idx = 0 slot, while the field variables for
+next_pid and next_comm have values var.idx = 1, and var.idx = 2. Note
+also that those are the same values displayed for the variable
+references corresponding to those variables in the variable reference
+fields section. Since there are two triggers and thus two hist_data
+addresses, those addresses also need to be accounted for when doing
+the matching - you can see that the first variable refers to the 0
+var.idx on the previous hist trigger (see the hist_data address
+associated with that trigger), while the second variable refers to the
+0 var.idx on the sched_switch hist trigger, as do all the remaining
+variable references.
+
+Finally, the action tracking variables section just shows the system
+and event name for the onmatch() handler::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm) [active]
+ #
+
+ hist_data: 0000000008f551b7
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000d60ff61f
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: pid_t
+ size: 4
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+ var.hist_data: 0000000008f551b7
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+
+ field_vars[0].val:
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->field_vars[1]:
+
+ field_vars[1].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ field_vars[1].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,next_comm)' >> /sys/kernel/tracing/events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+action_data and the trace() action
+----------------------------------
+
+As mentioned above, when the trace() action generates a synthetic
+event, all the parameters to the synthetic event either already are
+variables or are converted into variables (via field variables), and
+finally all those variable values are collected via references to them
+into a var_ref_vals[] array.
+
+The values in the var_ref_vals[] array, however, don't necessarily
+follow the same ordering as the synthetic event params. To address
+that, struct action_data contains another array, var_ref_idx[] that
+maps the trace action params to the var_ref_vals[] values. Below is a
+diagram illustrating that for the wakeup_latency() synthetic event::
+
+ +------------------+ wakeup_latency()
+ | action_data | event params var_ref_vals[]
+ +------------------+ +-----------------+ +-----------------+
+ | .var_ref_idx[] |--->| $wakeup_lat idx |---+ | |
+ +----------------+ +-----------------+ | +-----------------+
+ | .synth_event | | $next_pid idx |---|-+ | $wakeup_lat val |
+ +----------------+ +-----------------+ | | +-----------------+
+ . | +->| $next_pid val |
+ . | +-----------------+
+ . | .
+ +-----------------+ | .
+ | | | .
+ +-----------------+ | +-----------------+
+ +--->| $wakeup_lat val |
+ +-----------------+
+
+Basically, how this ends up getting used in the synthetic event probe
+function, trace_event_raw_event_synth(), is as follows::
+
+ for each field i in .synth_event
+ val_idx = .var_ref_idx[i]
+ val = var_ref_vals[val_idx]
+
+action_data and the onXXX() handlers
+------------------------------------
+
+The hist trigger onXXX() actions other than onmatch(), such as onmax()
+and onchange(), also make use of and internally create hidden
+variables. This information is contained in the
+action_data.track_data struct, and is also visible in the hist_debug
+output as will be described in the example below.
+
+Typically, the onmax() or onchange() handlers are used in conjunction
+with the save() and snapshot() actions. For example::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >>
+ /sys/kernel/tracing/events/sched/sched_switch/trigger
+
+or::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).snapshot()' >>
+ /sys/kernel/tracing/events/sched/sched_switch/trigger
+
+save() action field variable test
+---------------------------------
+
+For this example, instead of generating a synthetic event, the save()
+action is used to save field values whenever an onmax() handler
+detects that a new max latency has been hit. As in the previous
+example, the values being saved are also field values, but in this
+case, are kept in a separate hist_data array named save_vars[].
+
+As in previous test examples, we set up the sched_waking trigger::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+In this case, however, we set up the sched_switch trigger to save some
+sched_switch field values whenever we hit a new maximum latency. For
+both the onmax() handler and save() action, variables will be created,
+which we can use the hist_debug files to examine::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger
+
+The sched_waking hist_debug output shows the same data as in the
+previous test examples::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000e6290f48
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The output of the sched_switch trigger shows the same val and key
+values as before, but also shows a couple new sections.
+
+First, the action tracking variables section now shows the
+actions[].track_data information describing the special tracking
+variables and references used to track, in this case, the running
+maximum value. The actions[].track_data.var_ref member contains the
+reference to the variable being tracked, in this case the $wakeup_lat
+variable. In order to perform the onmax() handler function, there
+also needs to be a variable that tracks the current maximum by getting
+updated whenever a new maximum is hit. In this case, we can see that
+an auto-generated variable named ' __max' has been created and is
+visible in the actions[].track_data.track_var variable.
+
+Finally, in the new 'save action variables' section, we can see that
+the 4 params to the save() function have resulted in 4 field variables
+being created for the purposes of saving the values of the named
+fields when the max is hit. These variables are kept in a separate
+save_vars[] array off of hist_data, so are displayed in a separate
+section::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) [active]
+ #
+
+ hist_data: 0000000057bcd28d
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000e6290f48
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000057bcd28d
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].track_data.var_ref:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000057bcd28d
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->actions[0].track_data.track_var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: __max
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ save action variables (save() params):
+
+ hist_data->save_vars[0]:
+
+ save_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ save_vars[0].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ hist_data->save_vars[1]:
+
+ save_vars[1].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_pid
+ var.idx (into tracing_map_elt.vars[]): 3
+
+ save_vars[1].val:
+ ftrace_event_field name: prev_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->save_vars[2]:
+
+ save_vars[2].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_prio
+ var.idx (into tracing_map_elt.vars[]): 4
+
+ save_vars[2].val:
+ ftrace_event_field name: prev_prio
+ type: int
+ size: 4
+ is_signed: 1
+
+ hist_data->save_vars[3]:
+
+ save_vars[3].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: prev_comm
+ var.idx (into tracing_map_elt.vars[]): 5
+
+ save_vars[3].val:
+ ftrace_event_field name: prev_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+A couple special cases
+======================
+
+While the above covers the basics of the histogram internals, there
+are a couple of special cases that should be discussed, since they
+tend to create even more confusion. Those are field variables on other
+histograms, and aliases, both described below through example tests
+using the hist_debug files.
+
+Test of field variables on other histograms
+-------------------------------------------
+
+This example is similar to the previous examples, but in this case,
+the sched_switch trigger references a hist trigger field on another
+event, namely the sched_waking event. In order to accomplish this, a
+field variable is created for the other event, but since an existing
+histogram can't be used, as existing histograms are immutable, a new
+histogram with a matching variable is created and used, and we'll see
+that reflected in the hist_debug output shown below.
+
+First, we create the wakeup_latency synthetic event. Note the
+addition of the prio field::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events
+
+As in previous test examples, we set up the sched_waking trigger::
+
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+Here we set up a hist trigger on sched_switch to send a wakeup_latency
+event using an onmatch handler naming the sched_waking event. Note
+that the third param being passed to the wakeup_latency() is prio,
+which is a field name that needs to have a field variable created for
+it. There isn't however any prio field on the sched_switch event so
+it would seem that it wouldn't be possible to create a field variable
+for it. The matching sched_waking event does have a prio field, so it
+should be possible to make use of it for this purpose. The problem
+with that is that it's not currently possible to define a new variable
+on an existing histogram, so it's not possible to add a new prio field
+variable to the existing sched_waking histogram. It is however
+possible to create an additional new 'matching' sched_waking histogram
+for the same event, meaning that it uses the same key and filters, and
+define the new prio field variable on that.
+
+Here's the sched_switch trigger::
+
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger
+
+And here's the output of the hist_debug information for the
+sched_waking hist trigger. Note that there are two histograms
+displayed in the output: the first is the normal sched_waking
+histogram we've seen in the previous examples, and the second is the
+special histogram we created to provide the prio field variable.
+
+Looking at the second histogram below, we see a variable with the name
+synthetic_prio. This is the field variable created for the prio field
+on that sched_waking histogram::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000349570e4
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:synthetic_prio=prio:sort=hitcount:size=2048 [active]
+ #
+
+ hist_data: 000000006920cf38
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ ftrace_event_field name: prio
+ var.name: synthetic_prio
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: int
+ size: 4
+ is_signed: 1
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+Looking at the sched_switch histogram below, we can see a reference to
+the synthetic_prio variable on sched_waking, and looking at the
+associated hist_data address we see that it is indeed associated with
+the new histogram. Note also that the other references are to a
+normal variable, wakeup_lat, and to a normal field variable, next_pid,
+the details of which are in the field variables section::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio) [active]
+ #
+
+ hist_data: 00000000a73b67df
+
+ n_vals: 2
+ n_keys: 1
+ n_fields: 3
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000349570e4
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000a73b67df
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 00000000a73b67df
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: pid_t
+ size: 4
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: synthetic_prio
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 000000006920cf38
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: int
+ size: 4
+ is_signed: 1
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_pid
+ var.idx (into tracing_map_elt.vars[]): 1
+
+ field_vars[0].val:
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; int prio' >> synthetic_events
+
+Alias test
+----------
+
+This example is very similar to previous examples, but demonstrates
+the alias flag.
+
+First, we create the wakeup_latency synthetic event::
+
+ # echo 'wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
+
+Next, we create a sched_waking trigger similar to previous examples,
+but in this case we save the pid in the waking_pid variable::
+
+ # echo 'hist:keys=pid:waking_pid=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+For the sched_switch trigger, instead of using $waking_pid directly in
+the wakeup_latency synthetic event invocation, we create an alias of
+$waking_pid named $woken_pid, and use that in the synthetic event
+invocation instead::
+
+ # echo 'hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger
+
+Looking at the sched_waking hist_debug output, in addition to the
+normal fields, we can see the waking_pid variable::
+
+ # cat events/sched/sched_waking/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid:vals=hitcount:waking_pid=pid,ts0=common_timestamp.usecs:sort=hitcount:size=2048:clock=global [active]
+ #
+
+ hist_data: 00000000a250528c
+
+ n_vals: 3
+ n_keys: 1
+ n_fields: 4
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ ftrace_event_field name: pid
+ var.name: waking_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: ts0
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+The sched_switch hist_debug output shows that a variable named
+woken_pid has been created but that it also has the
+HIST_FIELD_FL_ALIAS flag set. It also has the HIST_FIELD_FL_VAR flag
+set, which is why it appears in the val field section.
+
+Despite that implementation detail, an alias variable is actually more
+like a variable reference; in fact it can be thought of as a reference
+to a reference. The implementation copies the var_ref->fn() from the
+variable reference being referenced, in this case, the waking_pid
+fn(), which is hist_field_var_ref() and makes that the fn() of the
+alias. The hist_field_var_ref() fn() requires the var_ref_idx of the
+variable reference it's using, so waking_pid's var_ref_idx is also
+copied to the alias. The end result is that when the value of alias
+is retrieved, in the end it just does the same thing the original
+reference would have done and retrieves the same value from the
+var_ref_vals[] array. You can verify this in the output by noting
+that the var_ref_idx of the alias, in this case woken_pid, is the same
+as the var_ref_idx of the reference, waking_pid, in the variable
+reference fields section.
+
+Additionally, once it gets that value, since it is also a variable, it
+then saves that value into its var.idx. So the var.idx of the
+woken_pid alias is 0, which it fills with the value from var_ref_idx 0
+when its fn() is called to update itself. You'll also notice that
+there's a woken_pid var_ref in the variable refs section. That is the
+reference to the woken_pid alias variable, and you can see that it
+retrieves the value from the same var.idx as the woken_pid alias, 0,
+and then in turn saves that value in its own var_ref_idx slot, 3, and
+the value at this position is finally what gets assigned to the
+$woken_pid slot in the trace event invocation::
+
+ # cat events/sched/sched_switch/hist_debug
+
+ # event histogram
+ #
+ # trigger info: hist:keys=next_pid:vals=hitcount:woken_pid=$waking_pid,wakeup_lat=common_timestamp.usecs-$ts0:sort=hitcount:size=2048:clock=global:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm) [active]
+ #
+
+ hist_data: 0000000055d65ed0
+
+ n_vals: 3
+ n_keys: 1
+ n_fields: 4
+
+ val fields:
+
+ hist_data->fields[0]:
+ flags:
+ VAL: HIST_FIELD_FL_HITCOUNT
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->fields[1]:
+ flags:
+ HIST_FIELD_FL_VAR
+ HIST_FIELD_FL_ALIAS
+ var.name: woken_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->fields[2]:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 1
+ type: u64
+ size: 0
+ is_signed: 0
+
+ key fields:
+
+ hist_data->fields[3]:
+ flags:
+ HIST_FIELD_FL_KEY
+ ftrace_event_field name: next_pid
+ type: pid_t
+ size: 8
+ is_signed: 1
+
+ variable reference fields:
+
+ hist_data->var_refs[0]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: waking_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 00000000a250528c
+ var_ref_idx (into hist_data->var_refs[]): 0
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->var_refs[1]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: ts0
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 00000000a250528c
+ var_ref_idx (into hist_data->var_refs[]): 1
+ type: u64
+ size: 8
+ is_signed: 0
+
+ hist_data->var_refs[2]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: wakeup_lat
+ var.idx (into tracing_map_elt.vars[]): 1
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 2
+ type: u64
+ size: 0
+ is_signed: 0
+
+ hist_data->var_refs[3]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: woken_pid
+ var.idx (into tracing_map_elt.vars[]): 0
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 3
+ type: pid_t
+ size: 4
+ is_signed: 1
+
+ hist_data->var_refs[4]:
+ flags:
+ HIST_FIELD_FL_VAR_REF
+ name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+ var.hist_data: 0000000055d65ed0
+ var_ref_idx (into hist_data->var_refs[]): 4
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ field variables:
+
+ hist_data->field_vars[0]:
+
+ field_vars[0].var:
+ flags:
+ HIST_FIELD_FL_VAR
+ var.name: next_comm
+ var.idx (into tracing_map_elt.vars[]): 2
+
+ field_vars[0].val:
+ ftrace_event_field name: next_comm
+ type: char[16]
+ size: 256
+ is_signed: 0
+
+ action tracking variables (for onmax()/onchange()/onmatch()):
+
+ hist_data->actions[0].match_data.event_system: sched
+ hist_data->actions[0].match_data.event: sched_waking
+
+The commands below can be used to clean things up for the next test::
+
+ # echo '!hist:keys=next_pid:woken_pid=$waking_pid:wakeup_lat=common_timestamp.usecs-$ts0:onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,$woken_pid,next_comm)' >> events/sched/sched_switch/trigger
+
+ # echo '!hist:keys=pid:ts0=common_timestamp.usecs' >> events/sched/sched_waking/trigger
+
+ # echo '!wakeup_latency u64 lat; pid_t pid; char comm[16]' >> synthetic_events
diff --git a/Documentation/trace/histogram.rst b/Documentation/trace/histogram.rst
index 8408670d0328..3c9b263de9c2 100644
--- a/Documentation/trace/histogram.rst
+++ b/Documentation/trace/histogram.rst
@@ -25,7 +25,7 @@ Documentation written by Tom Zanussi
hist:keys=<field1[,field2,...]>[:values=<field1[,field2,...]>]
[:sort=<field1[,field2,...]>][:size=#entries][:pause][:continue]
- [:clear][:name=histname1][:<handler>.<action>] [if <filter>]
+ [:clear][:name=histname1][:nohitcount][:<handler>.<action>] [if <filter>]
When a matching event is hit, an entry is added to a hash table
using the key(s) and value(s) named. Keys and values correspond to
@@ -35,11 +35,11 @@ Documentation written by Tom Zanussi
in place of an explicit value field - this is simply a count of
event hits. If 'values' isn't specified, an implicit 'hitcount'
value will be automatically created and used as the only value.
- Keys can be any field, or the special string 'stacktrace', which
+ Keys can be any field, or the special string 'common_stacktrace', which
will use the event's kernel stacktrace as the key. The keywords
'keys' or 'key' can be used to specify keys, and the keywords
'values', 'vals', or 'val' can be used to specify values. Compound
- keys consisting of up to two fields can be specified by the 'keys'
+ keys consisting of up to three fields can be specified by the 'keys'
keyword. Hashing a compound key produces a unique entry in the
table for each unique combination of component keys, and can be
useful for providing more fine-grained summaries of event data.
@@ -54,7 +54,7 @@ Documentation written by Tom Zanussi
'compatible' if the fields named in the trigger share the same
number and type of fields and those fields also have the same names.
Note that any two events always share the compatible 'hitcount' and
- 'stacktrace' fields and can therefore be combined using those
+ 'common_stacktrace' fields and can therefore be combined using those
fields, however pointless that may be.
'hist' triggers add a 'hist' file to each event's subdirectory.
@@ -70,15 +70,19 @@ Documentation written by Tom Zanussi
modified by appending any of the following modifiers to the field
name:
- =========== ==========================================
- .hex display a number as a hex value
- .sym display an address as a symbol
- .sym-offset display an address as a symbol and offset
- .syscall display a syscall id as a system call name
- .execname display a common_pid as a program name
- .log2 display log2 value rather than raw number
- .usecs display a common_timestamp in microseconds
- =========== ==========================================
+ ============= =================================================
+ .hex display a number as a hex value
+ .sym display an address as a symbol
+ .sym-offset display an address as a symbol and offset
+ .syscall display a syscall id as a system call name
+ .execname display a common_pid as a program name
+ .log2 display log2 value rather than raw number
+ .buckets=size display grouping of values rather than raw number
+ .usecs display a common_timestamp in microseconds
+ .percent display a number of percentage value
+ .graph display a bar-graph of a value
+ .stacktrace display as a stacktrace (must by a long[] type)
+ ============= =================================================
Note that in general the semantics of a given field aren't
interpreted when applying a modifier to it, but there are some
@@ -99,12 +103,12 @@ Documentation written by Tom Zanussi
trigger, read its current contents, and then turn it off::
# echo 'hist:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+ /sys/kernel/tracing/events/net/netif_rx/trigger
- # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+ # cat /sys/kernel/tracing/events/net/netif_rx/hist
# echo '!hist:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+ /sys/kernel/tracing/events/net/netif_rx/trigger
The trigger file itself can be read to show the details of the
currently attached hist trigger. This information is also displayed
@@ -136,6 +140,12 @@ Documentation written by Tom Zanussi
existing trigger, rather than via the '>' operator, which will cause
the trigger to be removed through truncation.
+ The 'nohitcount' (or NOHC) parameter will suppress display of
+ raw hitcount in the histogram. This option requires at least one
+ value field which is not a 'raw hitcount'. For example,
+ 'hist:...:vals=hitcount:nohitcount' is rejected, but
+ 'hist:...:vals=hitcount.percent:nohitcount' is OK.
+
- enable_hist/disable_hist
The enable_hist and disable_hist triggers can be used to have one
@@ -160,13 +170,13 @@ Documentation written by Tom Zanussi
aggregation on and off when conditions of interest are hit::
# echo 'hist:keys=skbaddr.hex:vals=len:pause' > \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
# echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+ /sys/kernel/tracing/events/sched/sched_process_exec/trigger
# echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+ /sys/kernel/tracing/events/sched/sched_process_exit/trigger
The above sets up an initially paused hist trigger which is unpaused
and starts aggregating events when a given program is executed, and
@@ -191,7 +201,7 @@ Documentation written by Tom Zanussi
with the event, in nanoseconds. May be
modified by .usecs to have timestamps
interpreted as microseconds.
- cpu int the cpu on which the event occurred.
+ common_cpu int the cpu on which the event occurred.
====================== ==== =======================================
Extended error information
@@ -209,7 +219,7 @@ Extended error information
event. The fields that can be used for the hist trigger are listed
in the kmalloc event's format file::
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/format
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/format
name: kmalloc
ID: 374
format:
@@ -228,8 +238,8 @@ Extended error information
that lists the total number of bytes requested for each function in
the kernel that made one or more calls to kmalloc::
- # echo 'hist:key=call_site:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ # echo 'hist:key=call_site:val=bytes_req.buckets=32' > \
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
This tells the tracing system to create a 'hist' trigger using the
call_site field of the kmalloc event as the key for the table, which
@@ -243,7 +253,7 @@ Extended error information
file in the kmalloc event's subdirectory (for readability, a number
of entries have been omitted)::
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
{ call_site: 18446744072106379007 } hitcount: 1 bytes_req: 176
@@ -283,7 +293,7 @@ Extended error information
the trigger info, which can also be displayed by reading the
'trigger' file::
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/trigger
hist:keys=call_site:vals=bytes_req:sort=hitcount:size=2048 [active]
At the end of the output are a few lines that display the overall
@@ -314,7 +324,7 @@ Extended error information
command history and re-execute it with a '!' prepended::
# echo '!hist:key=call_site:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
Finally, notice that the call_site as displayed in the output above
isn't really very useful. It's an address, but normally addresses
@@ -322,9 +332,9 @@ Extended error information
value, simply append '.hex' to the field name in the trigger::
# echo 'hist:key=call_site.hex:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.hex:vals=bytes_req:sort=hitcount:size=2048 [active]
{ call_site: ffffffffa026b291 } hitcount: 1 bytes_req: 433
@@ -367,9 +377,9 @@ Extended error information
trigger::
# echo 'hist:key=call_site.sym:val=bytes_req' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=hitcount:size=2048 [active]
{ call_site: [ffffffff810adcb9] syslog_print_all } hitcount: 1 bytes_req: 1024
@@ -411,15 +421,15 @@ Extended error information
Because the default sort key above is 'hitcount', the above shows a
the list of call_sites by increasing hitcount, so that at the bottom
we see the functions that made the most kmalloc calls during the
- run. If instead we we wanted to see the top kmalloc callers in
+ run. If instead we wanted to see the top kmalloc callers in
terms of the number of bytes requested rather than the number of
calls, and we wanted the top caller to appear at the top, we can use
the 'sort' parameter, along with the 'descending' modifier::
# echo 'hist:key=call_site.sym:val=bytes_req:sort=bytes_req.descending' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 2186 bytes_req: 3397464
@@ -458,9 +468,9 @@ Extended error information
name, just use 'sym-offset' instead::
# echo 'hist:key=call_site.sym-offset:val=bytes_req:sort=bytes_req.descending' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym-offset:vals=bytes_req:sort=bytes_req.descending:size=2048 [active]
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2+0x6c/0x2c0 [i915] } hitcount: 4569 bytes_req: 3163720
@@ -497,9 +507,9 @@ Extended error information
allocated in a descending order::
# echo 'hist:keys=call_site.sym:values=bytes_req,bytes_alloc:sort=bytes_alloc.descending' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
# trigger info: hist:keys=call_site.sym:vals=bytes_req,bytes_alloc:sort=bytes_alloc.descending:size=2048 [active]
{ call_site: [ffffffffa046041c] i915_gem_execbuffer2 [i915] } hitcount: 7403 bytes_req: 4084360 bytes_alloc: 5958016
@@ -537,10 +547,10 @@ Extended error information
the hist trigger display symbolic call_sites, we can have the hist
trigger additionally display the complete set of kernel stack traces
that led to each call_site. To do that, we simply use the special
- value 'stacktrace' for the key parameter::
+ value 'common_stacktrace' for the key parameter::
- # echo 'hist:keys=stacktrace:values=bytes_req,bytes_alloc:sort=bytes_alloc' > \
- /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger
+ # echo 'hist:keys=common_stacktrace:values=bytes_req,bytes_alloc:sort=bytes_alloc' > \
+ /sys/kernel/tracing/events/kmem/kmalloc/trigger
The above trigger will use the kernel stack trace in effect when an
event is triggered as the key for the hash table. This allows the
@@ -550,10 +560,10 @@ Extended error information
every callpath in the system that led up to a kmalloc (in this case
every callpath to a kmalloc for a kernel compile)::
- # cat /sys/kernel/debug/tracing/events/kmem/kmalloc/hist
- # trigger info: hist:keys=stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
+ # cat /sys/kernel/tracing/events/kmem/kmalloc/hist
+ # trigger info: hist:keys=common_stacktrace:vals=bytes_req,bytes_alloc:sort=bytes_alloc:size=2048 [active]
- { stacktrace:
+ { common_stacktrace:
__kmalloc_track_caller+0x10b/0x1a0
kmemdup+0x20/0x50
hidraw_report_event+0x8a/0x120 [hid]
@@ -571,7 +581,7 @@ Extended error information
cpu_startup_entry+0x315/0x3e0
rest_init+0x7c/0x80
} hitcount: 3 bytes_req: 21 bytes_alloc: 24
- { stacktrace:
+ { common_stacktrace:
__kmalloc_track_caller+0x10b/0x1a0
kmemdup+0x20/0x50
hidraw_report_event+0x8a/0x120 [hid]
@@ -586,7 +596,7 @@ Extended error information
do_IRQ+0x5a/0xf0
ret_from_intr+0x0/0x30
} hitcount: 3 bytes_req: 21 bytes_alloc: 24
- { stacktrace:
+ { common_stacktrace:
kmem_cache_alloc_trace+0xeb/0x150
aa_alloc_task_context+0x27/0x40
apparmor_cred_prepare+0x1f/0x50
@@ -598,7 +608,7 @@ Extended error information
.
.
.
- { stacktrace:
+ { common_stacktrace:
__kmalloc+0x11b/0x1b0
i915_gem_execbuffer2+0x6c/0x2c0 [i915]
drm_ioctl+0x349/0x670 [drm]
@@ -606,7 +616,7 @@ Extended error information
SyS_ioctl+0x81/0xa0
system_call_fastpath+0x12/0x6a
} hitcount: 17726 bytes_req: 13944120 bytes_alloc: 19593808
- { stacktrace:
+ { common_stacktrace:
__kmalloc+0x11b/0x1b0
load_elf_phdrs+0x76/0xa0
load_elf_binary+0x102/0x1650
@@ -615,7 +625,7 @@ Extended error information
SyS_execve+0x3a/0x50
return_from_execve+0x0/0x23
} hitcount: 33348 bytes_req: 17152128 bytes_alloc: 20226048
- { stacktrace:
+ { common_stacktrace:
kmem_cache_alloc_trace+0xeb/0x150
apparmor_file_alloc_security+0x27/0x40
security_file_alloc+0x16/0x20
@@ -626,7 +636,7 @@ Extended error information
SyS_open+0x1e/0x20
system_call_fastpath+0x12/0x6a
} hitcount: 4766422 bytes_req: 9532844 bytes_alloc: 38131376
- { stacktrace:
+ { common_stacktrace:
__kmalloc+0x11b/0x1b0
seq_buf_alloc+0x1b/0x50
seq_read+0x2cc/0x370
@@ -649,9 +659,9 @@ Extended error information
keeps a per-process sum of total bytes read::
# echo 'hist:key=common_pid.execname:val=count:sort=count.descending' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger
+ /sys/kernel/tracing/events/syscalls/sys_enter_read/trigger
- # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/hist
+ # cat /sys/kernel/tracing/events/syscalls/sys_enter_read/hist
# trigger info: hist:keys=common_pid.execname:vals=count:sort=count.descending:size=2048 [active]
{ common_pid: gnome-terminal [ 3196] } hitcount: 280 count: 1093512
@@ -690,9 +700,9 @@ Extended error information
counts for the system during the run::
# echo 'hist:key=id.syscall:val=hitcount' > \
- /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+ /sys/kernel/tracing/events/raw_syscalls/sys_enter/trigger
- # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+ # cat /sys/kernel/tracing/events/raw_syscalls/sys_enter/hist
# trigger info: hist:keys=id.syscall:vals=hitcount:sort=hitcount:size=2048 [active]
{ id: sys_fsync [ 74] } hitcount: 1
@@ -744,9 +754,9 @@ Extended error information
hitcount sum as the secondary key::
# echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount' > \
- /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+ /sys/kernel/tracing/events/raw_syscalls/sys_enter/trigger
- # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+ # cat /sys/kernel/tracing/events/raw_syscalls/sys_enter/hist
# trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 [active]
{ id: sys_read [ 0], common_pid: rtkit-daemon [ 1877] } hitcount: 1
@@ -794,9 +804,9 @@ Extended error information
can use that to filter out all the other syscalls::
# echo 'hist:key=id.syscall,common_pid.execname:val=hitcount:sort=id,hitcount if id == 16' > \
- /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/trigger
+ /sys/kernel/tracing/events/raw_syscalls/sys_enter/trigger
- # cat /sys/kernel/debug/tracing/events/raw_syscalls/sys_enter/hist
+ # cat /sys/kernel/tracing/events/raw_syscalls/sys_enter/hist
# trigger info: hist:keys=id.syscall,common_pid.execname:vals=hitcount:sort=id.syscall,hitcount:size=2048 if id == 16 [active]
{ id: sys_ioctl [ 16], common_pid: gmain [ 2769] } hitcount: 1
@@ -837,9 +847,9 @@ Extended error information
each process::
# echo 'hist:key=common_pid.execname,size:val=hitcount:sort=common_pid,size' > \
- /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/trigger
+ /sys/kernel/tracing/events/syscalls/sys_enter_recvfrom/trigger
- # cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_recvfrom/hist
+ # cat /sys/kernel/tracing/events/syscalls/sys_enter_recvfrom/hist
# trigger info: hist:keys=common_pid.execname,size:vals=hitcount:sort=common_pid.execname,size:size=2048 [active]
{ common_pid: smbd [ 784], size: 4 } hitcount: 1
@@ -890,9 +900,9 @@ Extended error information
much smaller number, say 256::
# echo 'hist:key=child_comm:val=hitcount:size=256' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+ /sys/kernel/tracing/events/sched/sched_process_fork/trigger
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # cat /sys/kernel/tracing/events/sched/sched_process_fork/hist
# trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
{ child_comm: dconf worker } hitcount: 1
@@ -926,9 +936,9 @@ Extended error information
displays as [paused]::
# echo 'hist:key=child_comm:val=hitcount:size=256:pause' >> \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+ /sys/kernel/tracing/events/sched/sched_process_fork/trigger
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # cat /sys/kernel/tracing/events/sched/sched_process_fork/hist
# trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [paused]
{ child_comm: dconf worker } hitcount: 1
@@ -963,9 +973,9 @@ Extended error information
again, and the data has changed::
# echo 'hist:key=child_comm:val=hitcount:size=256:cont' >> \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
+ /sys/kernel/tracing/events/sched/sched_process_fork/trigger
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
+ # cat /sys/kernel/tracing/events/sched/sched_process_fork/hist
# trigger info: hist:keys=child_comm:vals=hitcount:sort=hitcount:size=256 [active]
{ child_comm: dconf worker } hitcount: 1
@@ -1016,8 +1026,8 @@ Extended error information
First we set up an initially paused stacktrace trigger on the
netif_receive_skb event::
- # echo 'hist:key=stacktrace:vals=len:pause' > \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:key=common_stacktrace:vals=len:pause' > \
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
Next, we set up an 'enable_hist' trigger on the sched_process_exec
event, with an 'if filename==/usr/bin/wget' filter. The effect of
@@ -1028,7 +1038,7 @@ Extended error information
hash table keyed on stacktrace::
# echo 'enable_hist:net:netif_receive_skb if filename==/usr/bin/wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+ /sys/kernel/tracing/events/sched/sched_process_exec/trigger
The aggregation continues until the netif_receive_skb is paused
again, which is what the following disable_hist event does by
@@ -1036,7 +1046,7 @@ Extended error information
filter 'comm==wget'::
# echo 'disable_hist:net:netif_receive_skb if comm==wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+ /sys/kernel/tracing/events/sched/sched_process_exit/trigger
Whenever a process exits and the comm field of the disable_hist
trigger filter matches 'comm==wget', the netif_receive_skb hist
@@ -1049,10 +1059,10 @@ Extended error information
$ wget https://www.kernel.org/pub/linux/kernel/v3.x/patch-3.19.xz
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
- # trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
+ # cat /sys/kernel/tracing/events/net/netif_receive_skb/hist
+ # trigger info: hist:keys=common_stacktrace:vals=len:sort=hitcount:size=2048 [paused]
- { stacktrace:
+ { common_stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
netif_receive_skb_internal+0x23/0x90
@@ -1069,7 +1079,7 @@ Extended error information
kthread+0xd2/0xf0
ret_from_fork+0x42/0x70
} hitcount: 85 len: 28884
- { stacktrace:
+ { common_stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
netif_receive_skb_internal+0x23/0x90
@@ -1087,7 +1097,7 @@ Extended error information
irq_thread+0x11f/0x150
kthread+0xd2/0xf0
} hitcount: 98 len: 664329
- { stacktrace:
+ { common_stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
process_backlog+0xa8/0x150
@@ -1105,7 +1115,7 @@ Extended error information
inet_sendmsg+0x64/0xa0
sock_sendmsg+0x3d/0x50
} hitcount: 115 len: 13030
- { stacktrace:
+ { common_stacktrace:
__netif_receive_skb_core+0x46d/0x990
__netif_receive_skb+0x18/0x60
netif_receive_skb_internal+0x23/0x90
@@ -1132,14 +1142,14 @@ Extended error information
into the histogram. In order to avoid having to set everything up
again, we can just clear the histogram first::
- # echo 'hist:key=stacktrace:vals=len:clear' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ # echo 'hist:key=common_stacktrace:vals=len:clear' >> \
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
Just to verify that it is in fact cleared, here's what we now see in
the hist file::
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
- # trigger info: hist:keys=stacktrace:vals=len:sort=hitcount:size=2048 [paused]
+ # cat /sys/kernel/tracing/events/net/netif_receive_skb/hist
+ # trigger info: hist:keys=common_stacktrace:vals=len:sort=hitcount:size=2048 [paused]
Totals:
Hits: 0
@@ -1153,21 +1163,21 @@ Extended error information
sched_process_exit events as such::
# echo 'enable_event:net:netif_receive_skb if filename==/usr/bin/wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+ /sys/kernel/tracing/events/sched/sched_process_exec/trigger
# echo 'disable_event:net:netif_receive_skb if comm==wget' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+ /sys/kernel/tracing/events/sched/sched_process_exit/trigger
If you read the trigger files for the sched_process_exec and
sched_process_exit triggers, you should see two triggers for each:
one enabling/disabling the hist aggregation and the other
enabling/disabling the logging of events::
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_exec/trigger
+ # cat /sys/kernel/tracing/events/sched/sched_process_exec/trigger
enable_event:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
enable_hist:net:netif_receive_skb:unlimited if filename==/usr/bin/wget
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_exit/trigger
+ # cat /sys/kernel/tracing/events/sched/sched_process_exit/trigger
enable_event:net:netif_receive_skb:unlimited if comm==wget
disable_hist:net:netif_receive_skb:unlimited if comm==wget
@@ -1183,7 +1193,7 @@ Extended error information
saw in the last run, but this time you should also see the
individual events in the trace file::
- # cat /sys/kernel/debug/tracing/trace
+ # cat /sys/kernel/tracing/trace
# tracer: nop
#
@@ -1217,15 +1227,15 @@ Extended error information
other things::
# echo 'hist:keys=skbaddr.hex:vals=len if len < 0' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=skbaddr.hex:vals=len if len > 4096' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=skbaddr.hex:vals=len if len == 256' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=skbaddr.hex:vals=len' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:keys=len:vals=common_preempt_count' >> \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
The above set of commands create four triggers differing only in
their filters, along with a completely different though fairly
@@ -1237,7 +1247,7 @@ Extended error information
Displaying the contents of the 'hist' file for the event shows the
contents of all five histograms::
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist
+ # cat /sys/kernel/tracing/events/net/netif_receive_skb/hist
# event histogram
#
@@ -1358,15 +1368,15 @@ Extended error information
field in the shared 'foo' histogram data::
# echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_receive_skb/trigger
+ /sys/kernel/tracing/events/net/netif_receive_skb/trigger
# echo 'hist:name=foo:keys=skbaddr.hex:vals=len' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+ /sys/kernel/tracing/events/net/netif_rx/trigger
You can see that they're updating common histogram data by reading
each event's hist files at the same time::
- # cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/hist;
- cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+ # cat /sys/kernel/tracing/events/net/netif_receive_skb/hist;
+ cat /sys/kernel/tracing/events/net/netif_rx/hist
# event histogram
#
@@ -1475,32 +1485,32 @@ Extended error information
And here's an example that shows how to combine histogram data from
any two events even if they don't share any 'compatible' fields
- other than 'hitcount' and 'stacktrace'. These commands create a
+ other than 'hitcount' and 'common_stacktrace'. These commands create a
couple of triggers named 'bar' using those fields::
- # echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
- /sys/kernel/debug/tracing/events/sched/sched_process_fork/trigger
- # echo 'hist:name=bar:key=stacktrace:val=hitcount' > \
- /sys/kernel/debug/tracing/events/net/netif_rx/trigger
+ # echo 'hist:name=bar:key=common_stacktrace:val=hitcount' > \
+ /sys/kernel/tracing/events/sched/sched_process_fork/trigger
+ # echo 'hist:name=bar:key=common_stacktrace:val=hitcount' > \
+ /sys/kernel/tracing/events/net/netif_rx/trigger
And displaying the output of either shows some interesting if
somewhat confusing output::
- # cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/hist
- # cat /sys/kernel/debug/tracing/events/net/netif_rx/hist
+ # cat /sys/kernel/tracing/events/sched/sched_process_fork/hist
+ # cat /sys/kernel/tracing/events/net/netif_rx/hist
# event histogram
#
- # trigger info: hist:name=bar:keys=stacktrace:vals=hitcount:sort=hitcount:size=2048 [active]
+ # trigger info: hist:name=bar:keys=common_stacktrace:vals=hitcount:sort=hitcount:size=2048 [active]
#
- { stacktrace:
- _do_fork+0x18e/0x330
+ { common_stacktrace:
+ kernel_clone+0x18e/0x330
kernel_thread+0x29/0x30
kthreadd+0x154/0x1b0
ret_from_fork+0x3f/0x70
} hitcount: 1
- { stacktrace:
+ { common_stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx_ni+0x20/0x70
dev_loopback_xmit+0xaa/0xd0
@@ -1518,7 +1528,7 @@ Extended error information
call_cpuidle+0x3b/0x60
cpu_startup_entry+0x22d/0x310
} hitcount: 1
- { stacktrace:
+ { common_stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx_ni+0x20/0x70
dev_loopback_xmit+0xaa/0xd0
@@ -1533,7 +1543,7 @@ Extended error information
SyS_sendto+0xe/0x10
entry_SYSCALL_64_fastpath+0x12/0x6a
} hitcount: 2
- { stacktrace:
+ { common_stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx+0x1c/0x60
loopback_xmit+0x6c/0xb0
@@ -1551,7 +1561,7 @@ Extended error information
sock_sendmsg+0x38/0x50
___sys_sendmsg+0x14e/0x270
} hitcount: 76
- { stacktrace:
+ { common_stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx+0x1c/0x60
loopback_xmit+0x6c/0xb0
@@ -1569,7 +1579,7 @@ Extended error information
sock_sendmsg+0x38/0x50
___sys_sendmsg+0x269/0x270
} hitcount: 77
- { stacktrace:
+ { common_stacktrace:
netif_rx_internal+0xb2/0xd0
netif_rx+0x1c/0x60
loopback_xmit+0x6c/0xb0
@@ -1587,8 +1597,8 @@ Extended error information
sock_sendmsg+0x38/0x50
SYSC_sendto+0xef/0x170
} hitcount: 88
- { stacktrace:
- _do_fork+0x18e/0x330
+ { common_stacktrace:
+ kernel_clone+0x18e/0x330
SyS_clone+0x19/0x20
entry_SYSCALL_64_fastpath+0x12/0x6a
} hitcount: 244
@@ -1762,6 +1772,23 @@ using the same key and variable from yet another event::
# echo 'hist:key=pid:wakeupswitch_lat=$wakeup_lat+$switchtime_lat ...' >> event3/trigger
+Expressions support the use of addition, subtraction, multiplication and
+division operators (+-\*/).
+
+Note if division by zero cannot be detected at parse time (i.e. the
+divisor is not a constant), the result will be -1.
+
+Numeric constants can also be used directly in an expression::
+
+ # echo 'hist:keys=next_pid:timestamp_secs=common_timestamp/1000000 ...' >> event/trigger
+
+or assigned to a variable and referenced in a subsequent expression::
+
+ # echo 'hist:keys=next_pid:us_per_sec=1000000 ...' >> event/trigger
+ # echo 'hist:keys=next_pid:timestamp_secs=common_timestamp/$us_per_sec ...' >> event/trigger
+
+Variables can even hold stacktraces, which are useful with synthetic events.
+
2.2.2 Synthetic Events
----------------------
@@ -1776,6 +1803,24 @@ consisting of the name of the new event along with one or more
variables and their types, which can be any valid field type,
separated by semicolons, to the tracing/synthetic_events file.
+See synth_field_size() for available types.
+
+If field_name contains [n], the field is considered to be a static array.
+
+If field_names contains[] (no subscript), the field is considered to
+be a dynamic array, which will only take as much space in the event as
+is required to hold the array.
+
+A string field can be specified using either the static notation:
+
+ char name[32];
+
+Or the dynamic:
+
+ char name[];
+
+The size limit for either is 256.
+
For instance, the following creates a new event named 'wakeup_latency'
with 3 fields: lat, pid, and prio. Each of those fields is simply a
variable reference to a variable on another event::
@@ -1784,19 +1829,19 @@ variable reference to a variable on another event::
u64 lat; \
pid_t pid; \
int prio' >> \
- /sys/kernel/debug/tracing/synthetic_events
+ /sys/kernel/tracing/synthetic_events
Reading the tracing/synthetic_events file lists all the currently
defined synthetic events, in this case the event defined above::
- # cat /sys/kernel/debug/tracing/synthetic_events
+ # cat /sys/kernel/tracing/synthetic_events
wakeup_latency u64 lat; pid_t pid; int prio
An existing synthetic event definition can be removed by prepending
the command that defined it with a '!'::
# echo '!wakeup_latency u64 lat pid_t pid int prio' >> \
- /sys/kernel/debug/tracing/synthetic_events
+ /sys/kernel/tracing/synthetic_events
At this point, there isn't yet an actual 'wakeup_latency' event
instantiated in the event subsystem - for this to happen, a 'hist
@@ -1805,19 +1850,249 @@ and variables defined on other events (see Section 2.2.3 below on
how that is done using hist trigger 'onmatch' action). Once that is
done, the 'wakeup_latency' synthetic event instance is created.
-A histogram can now be defined for the new synthetic event::
-
- # echo 'hist:keys=pid,prio,lat.log2:sort=pid,lat' >> \
- /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
-
The new event is created under the tracing/events/synthetic/ directory
and looks and behaves just like any other event::
- # ls /sys/kernel/debug/tracing/events/synthetic/wakeup_latency
+ # ls /sys/kernel/tracing/events/synthetic/wakeup_latency
enable filter format hist id trigger
+A histogram can now be defined for the new synthetic event::
+
+ # echo 'hist:keys=pid,prio,lat.log2:sort=lat' >> \
+ /sys/kernel/tracing/events/synthetic/wakeup_latency/trigger
+
+The above shows the latency "lat" in a power of 2 grouping.
+
Like any other event, once a histogram is enabled for the event, the
-output can be displayed by reading the event's 'hist' file.
+output can be displayed by reading the event's 'hist' file::
+
+ # cat /sys/kernel/tracing/events/synthetic/wakeup_latency/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid,prio,lat.log2:vals=hitcount:sort=lat.log2:size=2048 [active]
+ #
+
+ { pid: 2035, prio: 9, lat: ~ 2^2 } hitcount: 43
+ { pid: 2034, prio: 9, lat: ~ 2^2 } hitcount: 60
+ { pid: 2029, prio: 9, lat: ~ 2^2 } hitcount: 965
+ { pid: 2034, prio: 120, lat: ~ 2^2 } hitcount: 9
+ { pid: 2033, prio: 120, lat: ~ 2^2 } hitcount: 5
+ { pid: 2030, prio: 9, lat: ~ 2^2 } hitcount: 335
+ { pid: 2030, prio: 120, lat: ~ 2^2 } hitcount: 10
+ { pid: 2032, prio: 120, lat: ~ 2^2 } hitcount: 1
+ { pid: 2035, prio: 120, lat: ~ 2^2 } hitcount: 2
+ { pid: 2031, prio: 9, lat: ~ 2^2 } hitcount: 176
+ { pid: 2028, prio: 120, lat: ~ 2^2 } hitcount: 15
+ { pid: 2033, prio: 9, lat: ~ 2^2 } hitcount: 91
+ { pid: 2032, prio: 9, lat: ~ 2^2 } hitcount: 125
+ { pid: 2029, prio: 120, lat: ~ 2^2 } hitcount: 4
+ { pid: 2031, prio: 120, lat: ~ 2^2 } hitcount: 3
+ { pid: 2029, prio: 120, lat: ~ 2^3 } hitcount: 2
+ { pid: 2035, prio: 9, lat: ~ 2^3 } hitcount: 41
+ { pid: 2030, prio: 120, lat: ~ 2^3 } hitcount: 1
+ { pid: 2032, prio: 9, lat: ~ 2^3 } hitcount: 32
+ { pid: 2031, prio: 9, lat: ~ 2^3 } hitcount: 44
+ { pid: 2034, prio: 9, lat: ~ 2^3 } hitcount: 40
+ { pid: 2030, prio: 9, lat: ~ 2^3 } hitcount: 29
+ { pid: 2033, prio: 9, lat: ~ 2^3 } hitcount: 31
+ { pid: 2029, prio: 9, lat: ~ 2^3 } hitcount: 31
+ { pid: 2028, prio: 120, lat: ~ 2^3 } hitcount: 18
+ { pid: 2031, prio: 120, lat: ~ 2^3 } hitcount: 2
+ { pid: 2028, prio: 120, lat: ~ 2^4 } hitcount: 1
+ { pid: 2029, prio: 9, lat: ~ 2^4 } hitcount: 4
+ { pid: 2031, prio: 120, lat: ~ 2^7 } hitcount: 1
+ { pid: 2032, prio: 120, lat: ~ 2^7 } hitcount: 1
+
+ Totals:
+ Hits: 2122
+ Entries: 30
+ Dropped: 0
+
+
+The latency values can also be grouped linearly by a given size with
+the ".buckets" modifier and specify a size (in this case groups of 10)::
+
+ # echo 'hist:keys=pid,prio,lat.buckets=10:sort=lat' >> \
+ /sys/kernel/tracing/events/synthetic/wakeup_latency/trigger
+
+ # event histogram
+ #
+ # trigger info: hist:keys=pid,prio,lat.buckets=10:vals=hitcount:sort=lat.buckets=10:size=2048 [active]
+ #
+
+ { pid: 2067, prio: 9, lat: ~ 0-9 } hitcount: 220
+ { pid: 2068, prio: 9, lat: ~ 0-9 } hitcount: 157
+ { pid: 2070, prio: 9, lat: ~ 0-9 } hitcount: 100
+ { pid: 2067, prio: 120, lat: ~ 0-9 } hitcount: 6
+ { pid: 2065, prio: 120, lat: ~ 0-9 } hitcount: 2
+ { pid: 2066, prio: 120, lat: ~ 0-9 } hitcount: 2
+ { pid: 2069, prio: 9, lat: ~ 0-9 } hitcount: 122
+ { pid: 2069, prio: 120, lat: ~ 0-9 } hitcount: 8
+ { pid: 2070, prio: 120, lat: ~ 0-9 } hitcount: 1
+ { pid: 2068, prio: 120, lat: ~ 0-9 } hitcount: 7
+ { pid: 2066, prio: 9, lat: ~ 0-9 } hitcount: 365
+ { pid: 2064, prio: 120, lat: ~ 0-9 } hitcount: 35
+ { pid: 2065, prio: 9, lat: ~ 0-9 } hitcount: 998
+ { pid: 2071, prio: 9, lat: ~ 0-9 } hitcount: 85
+ { pid: 2065, prio: 9, lat: ~ 10-19 } hitcount: 2
+ { pid: 2064, prio: 120, lat: ~ 10-19 } hitcount: 2
+
+ Totals:
+ Hits: 2112
+ Entries: 16
+ Dropped: 0
+
+To save stacktraces, create a synthetic event with a field of type "unsigned long[]"
+or even just "long[]". For example, to see how long a task is blocked in an
+uninterruptible state::
+
+ # cd /sys/kernel/tracing
+ # echo 's:block_lat pid_t pid; u64 delta; unsigned long[] stack;' > dynamic_events
+ # echo 'hist:keys=next_pid:ts=common_timestamp.usecs,st=common_stacktrace if prev_state == 2' >> events/sched/sched_switch/trigger
+ # echo 'hist:keys=prev_pid:delta=common_timestamp.usecs-$ts,s=$st:onmax($delta).trace(block_lat,prev_pid,$delta,$s)' >> events/sched/sched_switch/trigger
+ # echo 1 > events/synthetic/block_lat/enable
+ # cat trace
+
+ # tracer: nop
+ #
+ # entries-in-buffer/entries-written: 2/2 #P:8
+ #
+ # _-----=> irqs-off/BH-disabled
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # ||| / _-=> migrate-disable
+ # |||| / delay
+ # TASK-PID CPU# ||||| TIMESTAMP FUNCTION
+ # | | | ||||| | |
+ <idle>-0 [005] d..4. 521.164922: block_lat: pid=0 delta=8322 stack=STACK:
+ => __schedule+0x448/0x7b0
+ => schedule+0x5a/0xb0
+ => io_schedule+0x42/0x70
+ => bit_wait_io+0xd/0x60
+ => __wait_on_bit+0x4b/0x140
+ => out_of_line_wait_on_bit+0x91/0xb0
+ => jbd2_journal_commit_transaction+0x1679/0x1a70
+ => kjournald2+0xa9/0x280
+ => kthread+0xe9/0x110
+ => ret_from_fork+0x2c/0x50
+
+ <...>-2 [004] d..4. 525.184257: block_lat: pid=2 delta=76 stack=STACK:
+ => __schedule+0x448/0x7b0
+ => schedule+0x5a/0xb0
+ => schedule_timeout+0x11a/0x150
+ => wait_for_completion_killable+0x144/0x1f0
+ => __kthread_create_on_node+0xe7/0x1e0
+ => kthread_create_on_node+0x51/0x70
+ => create_worker+0xcc/0x1a0
+ => worker_thread+0x2ad/0x380
+ => kthread+0xe9/0x110
+ => ret_from_fork+0x2c/0x50
+
+A synthetic event that has a stacktrace field may use it as a key in
+histogram::
+
+ # echo 'hist:keys=delta.buckets=100,stack.stacktrace:sort=delta' > events/synthetic/block_lat/trigger
+ # cat events/synthetic/block_lat/hist
+
+ # event histogram
+ #
+ # trigger info: hist:keys=delta.buckets=100,stack.stacktrace:vals=hitcount:sort=delta.buckets=100:size=2048 [active]
+ #
+ { delta: ~ 0-99, stack.stacktrace __schedule+0xa19/0x1520
+ schedule+0x6b/0x110
+ io_schedule+0x46/0x80
+ bit_wait_io+0x11/0x80
+ __wait_on_bit+0x4e/0x120
+ out_of_line_wait_on_bit+0x8d/0xb0
+ __wait_on_buffer+0x33/0x40
+ jbd2_journal_commit_transaction+0x155a/0x19b0
+ kjournald2+0xab/0x270
+ kthread+0xfa/0x130
+ ret_from_fork+0x29/0x50
+ } hitcount: 1
+ { delta: ~ 0-99, stack.stacktrace __schedule+0xa19/0x1520
+ schedule+0x6b/0x110
+ io_schedule+0x46/0x80
+ rq_qos_wait+0xd0/0x170
+ wbt_wait+0x9e/0xf0
+ __rq_qos_throttle+0x25/0x40
+ blk_mq_submit_bio+0x2c3/0x5b0
+ __submit_bio+0xff/0x190
+ submit_bio_noacct_nocheck+0x25b/0x2b0
+ submit_bio_noacct+0x20b/0x600
+ submit_bio+0x28/0x90
+ ext4_bio_write_page+0x1e0/0x8c0
+ mpage_submit_page+0x60/0x80
+ mpage_process_page_bufs+0x16c/0x180
+ mpage_prepare_extent_to_map+0x23f/0x530
+ } hitcount: 1
+ { delta: ~ 0-99, stack.stacktrace __schedule+0xa19/0x1520
+ schedule+0x6b/0x110
+ schedule_hrtimeout_range_clock+0x97/0x110
+ schedule_hrtimeout_range+0x13/0x20
+ usleep_range_state+0x65/0x90
+ __intel_wait_for_register+0x1c1/0x230 [i915]
+ intel_psr_wait_for_idle_locked+0x171/0x2a0 [i915]
+ intel_pipe_update_start+0x169/0x360 [i915]
+ intel_update_crtc+0x112/0x490 [i915]
+ skl_commit_modeset_enables+0x199/0x600 [i915]
+ intel_atomic_commit_tail+0x7c4/0x1080 [i915]
+ intel_atomic_commit_work+0x12/0x20 [i915]
+ process_one_work+0x21c/0x3f0
+ worker_thread+0x50/0x3e0
+ kthread+0xfa/0x130
+ } hitcount: 3
+ { delta: ~ 0-99, stack.stacktrace __schedule+0xa19/0x1520
+ schedule+0x6b/0x110
+ schedule_timeout+0x11e/0x160
+ __wait_for_common+0x8f/0x190
+ wait_for_completion+0x24/0x30
+ __flush_work.isra.0+0x1cc/0x360
+ flush_work+0xe/0x20
+ drm_mode_rmfb+0x18b/0x1d0 [drm]
+ drm_mode_rmfb_ioctl+0x10/0x20 [drm]
+ drm_ioctl_kernel+0xb8/0x150 [drm]
+ drm_ioctl+0x243/0x560 [drm]
+ __x64_sys_ioctl+0x92/0xd0
+ do_syscall_64+0x59/0x90
+ entry_SYSCALL_64_after_hwframe+0x72/0xdc
+ } hitcount: 1
+ { delta: ~ 0-99, stack.stacktrace __schedule+0xa19/0x1520
+ schedule+0x6b/0x110
+ schedule_timeout+0x87/0x160
+ __wait_for_common+0x8f/0x190
+ wait_for_completion_timeout+0x1d/0x30
+ drm_atomic_helper_wait_for_flip_done+0x57/0x90 [drm_kms_helper]
+ intel_atomic_commit_tail+0x8ce/0x1080 [i915]
+ intel_atomic_commit_work+0x12/0x20 [i915]
+ process_one_work+0x21c/0x3f0
+ worker_thread+0x50/0x3e0
+ kthread+0xfa/0x130
+ ret_from_fork+0x29/0x50
+ } hitcount: 1
+ { delta: ~ 100-199, stack.stacktrace __schedule+0xa19/0x1520
+ schedule+0x6b/0x110
+ schedule_hrtimeout_range_clock+0x97/0x110
+ schedule_hrtimeout_range+0x13/0x20
+ usleep_range_state+0x65/0x90
+ pci_set_low_power_state+0x17f/0x1f0
+ pci_set_power_state+0x49/0x250
+ pci_finish_runtime_suspend+0x4a/0x90
+ pci_pm_runtime_suspend+0xcb/0x1b0
+ __rpm_callback+0x48/0x120
+ rpm_callback+0x67/0x70
+ rpm_suspend+0x167/0x780
+ rpm_idle+0x25a/0x380
+ pm_runtime_work+0x93/0xc0
+ process_one_work+0x21c/0x3f0
+ } hitcount: 1
+
+ Totals:
+ Hits: 10
+ Entries: 7
+ Dropped: 0
2.2.3 Hist trigger 'handlers' and 'actions'
-------------------------------------------
@@ -1918,9 +2193,9 @@ The following commonly-used handler.action pairs are available:
event::
# echo 'wakeup_new_test pid_t pid' >> \
- /sys/kernel/debug/tracing/synthetic_events
+ /sys/kernel/tracing/synthetic_events
- # cat /sys/kernel/debug/tracing/synthetic_events
+ # cat /sys/kernel/tracing/synthetic_events
wakeup_new_test pid_t pid
The following hist trigger both defines the missing testpid
@@ -1931,26 +2206,26 @@ The following commonly-used handler.action pairs are available:
# echo 'hist:keys=$testpid:testpid=pid:onmatch(sched.sched_wakeup_new).\
wakeup_new_test($testpid) if comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger
+ /sys/kernel/tracing/events/sched/sched_wakeup_new/trigger
- Or, equivalently, using the 'trace' keyword syntax:
+ Or, equivalently, using the 'trace' keyword syntax::
- # echo 'hist:keys=$testpid:testpid=pid:onmatch(sched.sched_wakeup_new).\
- trace(wakeup_new_test,$testpid) if comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger
+ # echo 'hist:keys=$testpid:testpid=pid:onmatch(sched.sched_wakeup_new).\
+ trace(wakeup_new_test,$testpid) if comm=="cyclictest"' >> \
+ /sys/kernel/tracing/events/sched/sched_wakeup_new/trigger
Creating and displaying a histogram based on those events is now
just a matter of using the fields and new synthetic event in the
tracing/events/synthetic directory, as usual::
# echo 'hist:keys=pid:sort=pid' >> \
- /sys/kernel/debug/tracing/events/synthetic/wakeup_new_test/trigger
+ /sys/kernel/tracing/events/synthetic/wakeup_new_test/trigger
Running 'cyclictest' should cause wakeup_new events to generate
wakeup_new_test synthetic events which should result in histogram
output in the wakeup_new_test event's hist file::
- # cat /sys/kernel/debug/tracing/events/synthetic/wakeup_new_test/hist
+ # cat /sys/kernel/tracing/events/synthetic/wakeup_new_test/hist
A more typical usage would be to use two events to calculate a
latency. The following example uses a set of hist triggers to
@@ -1959,14 +2234,14 @@ The following commonly-used handler.action pairs are available:
First, we define a 'wakeup_latency' synthetic event::
# echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> \
- /sys/kernel/debug/tracing/synthetic_events
+ /sys/kernel/tracing/synthetic_events
Next, we specify that whenever we see a sched_waking event for a
cyclictest thread, save the timestamp in a 'ts0' variable::
# echo 'hist:keys=$saved_pid:saved_pid=pid:ts0=common_timestamp.usecs \
if comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
+ /sys/kernel/tracing/events/sched/sched_waking/trigger
Then, when the corresponding thread is actually scheduled onto the
CPU by a sched_switch event (saved_pid matches next_pid), calculate
@@ -1976,19 +2251,19 @@ The following commonly-used handler.action pairs are available:
# echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0:\
onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,\
$saved_pid,next_prio) if next_comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+ /sys/kernel/tracing/events/sched/sched_switch/trigger
We also need to create a histogram on the wakeup_latency synthetic
event in order to aggregate the generated synthetic event data::
# echo 'hist:keys=pid,prio,lat:sort=pid,lat' >> \
- /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger
+ /sys/kernel/tracing/events/synthetic/wakeup_latency/trigger
Finally, once we've run cyclictest to actually generate some
events, we can see the output by looking at the wakeup_latency
synthetic event's hist file::
- # cat /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/hist
+ # cat /sys/kernel/tracing/events/synthetic/wakeup_latency/hist
- onmax(var).save(field,.. .)
@@ -2014,19 +2289,19 @@ The following commonly-used handler.action pairs are available:
# echo 'hist:keys=pid:ts0=common_timestamp.usecs \
if comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
+ /sys/kernel/tracing/events/sched/sched_waking/trigger
# echo 'hist:keys=next_pid:\
wakeup_lat=common_timestamp.usecs-$ts0:\
onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) \
if next_comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+ /sys/kernel/tracing/events/sched/sched_switch/trigger
When the histogram is displayed, the max value and the saved
values corresponding to the max are displayed following the rest
of the fields::
- # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
+ # cat /sys/kernel/tracing/events/sched/sched_switch/hist
{ next_pid: 2255 } hitcount: 239
common_timestamp-ts0: 0
max: 27
@@ -2070,48 +2345,48 @@ The following commonly-used handler.action pairs are available:
resulting latency, stored in wakeup_lat, exceeds the current
maximum latency, a snapshot is taken. As part of the setup, all
the scheduler events are also enabled, which are the events that
- will show up in the snapshot when it is taken at some point:
+ will show up in the snapshot when it is taken at some point::
- # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
+ # echo 1 > /sys/kernel/tracing/events/sched/enable
- # echo 'hist:keys=pid:ts0=common_timestamp.usecs \
- if comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_waking/trigger
+ # echo 'hist:keys=pid:ts0=common_timestamp.usecs \
+ if comm=="cyclictest"' >> \
+ /sys/kernel/tracing/events/sched/sched_waking/trigger
- # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
- onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio, \
- prev_comm):onmax($wakeup_lat).snapshot() \
- if next_comm=="cyclictest"' >> \
- /sys/kernel/debug/tracing/events/sched/sched_switch/trigger
+ # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: \
+ onmax($wakeup_lat).save(next_prio,next_comm,prev_pid,prev_prio, \
+ prev_comm):onmax($wakeup_lat).snapshot() \
+ if next_comm=="cyclictest"' >> \
+ /sys/kernel/tracing/events/sched/sched_switch/trigger
When the histogram is displayed, for each bucket the max value
and the saved values corresponding to the max are displayed
following the rest of the fields.
If a snapshot was taken, there is also a message indicating that,
- along with the value and event that triggered the global maximum:
+ along with the value and event that triggered the global maximum::
- # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist
- { next_pid: 2101 } hitcount: 200
- max: 52 next_prio: 120 next_comm: cyclictest \
- prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
+ # cat /sys/kernel/tracing/events/sched/sched_switch/hist
+ { next_pid: 2101 } hitcount: 200
+ max: 52 next_prio: 120 next_comm: cyclictest \
+ prev_pid: 0 prev_prio: 120 prev_comm: swapper/6
- { next_pid: 2103 } hitcount: 1326
- max: 572 next_prio: 19 next_comm: cyclictest \
- prev_pid: 0 prev_prio: 120 prev_comm: swapper/1
+ { next_pid: 2103 } hitcount: 1326
+ max: 572 next_prio: 19 next_comm: cyclictest \
+ prev_pid: 0 prev_prio: 120 prev_comm: swapper/1
- { next_pid: 2102 } hitcount: 1982 \
- max: 74 next_prio: 19 next_comm: cyclictest \
- prev_pid: 0 prev_prio: 120 prev_comm: swapper/5
+ { next_pid: 2102 } hitcount: 1982 \
+ max: 74 next_prio: 19 next_comm: cyclictest \
+ prev_pid: 0 prev_prio: 120 prev_comm: swapper/5
- Snapshot taken (see tracing/snapshot). Details:
- triggering value { onmax($wakeup_lat) }: 572 \
- triggered by event with key: { next_pid: 2103 }
+ Snapshot taken (see tracing/snapshot). Details:
+ triggering value { onmax($wakeup_lat) }: 572 \
+ triggered by event with key: { next_pid: 2103 }
- Totals:
- Hits: 3508
- Entries: 3
- Dropped: 0
+ Totals:
+ Hits: 3508
+ Entries: 3
+ Dropped: 0
In the above case, the event that triggered the global maximum has
the key with next_pid == 2103. If you look at the bucket that has
@@ -2126,7 +2401,7 @@ The following commonly-used handler.action pairs are available:
sched_switch events, which should match the time displayed in the
global maximum)::
- # cat /sys/kernel/debug/tracing/snapshot
+ # cat /sys/kernel/tracing/snapshot
<...>-2103 [005] d..3 309.873125: sched_switch: prev_comm=cyclictest prev_pid=2103 prev_prio=19 prev_state=D ==> next_comm=swapper/5 next_pid=0 next_prio=120
<idle>-0 [005] d.h3 309.873611: sched_waking: comm=cyclictest pid=2102 prio=19 target_cpu=005
@@ -2189,15 +2464,15 @@ The following commonly-used handler.action pairs are available:
$cwnd variable. If the value has changed, a snapshot is taken.
As part of the setup, all the scheduler and tcp events are also
enabled, which are the events that will show up in the snapshot
- when it is taken at some point:
+ when it is taken at some point::
- # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
- # echo 1 > /sys/kernel/debug/tracing/events/tcp/enable
+ # echo 1 > /sys/kernel/tracing/events/sched/enable
+ # echo 1 > /sys/kernel/tracing/events/tcp/enable
- # echo 'hist:keys=dport:cwnd=snd_cwnd: \
- onchange($cwnd).save(snd_wnd,srtt,rcv_wnd): \
- onchange($cwnd).snapshot()' >> \
- /sys/kernel/debug/tracing/events/tcp/tcp_probe/trigger
+ # echo 'hist:keys=dport:cwnd=snd_cwnd: \
+ onchange($cwnd).save(snd_wnd,srtt,rcv_wnd): \
+ onchange($cwnd).snapshot()' >> \
+ /sys/kernel/tracing/events/tcp/tcp_probe/trigger
When the histogram is displayed, for each bucket the tracked value
and the saved values corresponding to that value are displayed
@@ -2206,7 +2481,7 @@ The following commonly-used handler.action pairs are available:
If a snapshot was taken, there is also a message indicating that,
along with the value and event that triggered the snapshot::
- # cat /sys/kernel/debug/tracing/events/tcp/tcp_probe/hist
+ # cat /sys/kernel/tracing/events/tcp/tcp_probe/hist
{ dport: 1521 } hitcount: 8
changed: 10 snd_wnd: 35456 srtt: 154262 rcv_wnd: 42112
@@ -2220,10 +2495,10 @@ The following commonly-used handler.action pairs are available:
{ dport: 443 } hitcount: 211
changed: 10 snd_wnd: 26960 srtt: 17379 rcv_wnd: 28800
- Snapshot taken (see tracing/snapshot). Details::
+ Snapshot taken (see tracing/snapshot). Details:
- triggering value { onchange($cwnd) }: 10
- triggered by event with key: { dport: 80 }
+ triggering value { onchange($cwnd) }: 10
+ triggered by event with key: { dport: 80 }
Totals:
Hits: 414
@@ -2240,7 +2515,7 @@ The following commonly-used handler.action pairs are available:
And finally, looking at the snapshot data should show at or near
the end the event that triggered the snapshot::
- # cat /sys/kernel/debug/tracing/snapshot
+ # cat /sys/kernel/tracing/snapshot
gnome-shell-1261 [006] dN.3 49.823113: sched_stat_runtime: comm=gnome-shell pid=1261 runtime=49347 [ns] vruntime=1835730389 [ns]
kworker/u16:4-773 [003] d..3 49.823114: sched_switch: prev_comm=kworker/u16:4 prev_pid=773 prev_prio=120 prev_state=R+ ==> next_comm=kworker/3:2 next_pid=135 next_prio=120
diff --git a/Documentation/trace/hwlat_detector.rst b/Documentation/trace/hwlat_detector.rst
index 5739349649c8..11b749c2a8bd 100644
--- a/Documentation/trace/hwlat_detector.rst
+++ b/Documentation/trace/hwlat_detector.rst
@@ -14,7 +14,7 @@ originally written for use by the "RT" patch since the Real Time
kernel is highly latency sensitive.
SMIs are not serviced by the Linux kernel, which means that it does not
-even know that they are occuring. SMIs are instead set up by BIOS code
+even know that they are occurring. SMIs are instead set up by BIOS code
and are serviced by BIOS code, usually for "critical" events such as
management of thermal sensors and fans. Sometimes though, SMIs are used for
other tasks and those tasks can spend an inordinate amount of time in the
@@ -76,8 +76,13 @@ in /sys/kernel/tracing:
- tracing_cpumask - the CPUs to move the hwlat thread across
- hwlat_detector/width - specified amount of time to spin within window (usecs)
- hwlat_detector/window - amount of time between (width) runs (usecs)
+ - hwlat_detector/mode - the thread mode
-The hwlat detector's kernel thread will migrate across each CPU specified in
-tracing_cpumask between each window. To limit the migration, either modify
-tracing_cpumask, or modify the hwlat kernel thread (named [hwlatd]) CPU
-affinity directly, and the migration will stop.
+By default, one hwlat detector's kernel thread will migrate across each CPU
+specified in cpumask at the beginning of a new window, in a round-robin
+fashion. This behavior can be changed by changing the thread mode,
+the available options are:
+
+ - none: do not force migration
+ - round-robin: migrate across each CPU specified in cpumask [default]
+ - per-cpu: create one thread for each cpu in tracing_cpumask
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 04acd277c5f6..5092d6c13af5 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -9,8 +9,11 @@ Linux Tracing Technologies
tracepoint-analysis
ftrace
ftrace-uses
+ fprobe
+ kprobes
kprobetrace
uprobetracer
+ fprobetrace
tracepoints
events
events-kmem
@@ -19,8 +22,16 @@ Linux Tracing Technologies
events-msr
mmiotrace
histogram
+ histogram-design
+ boottime-trace
hwlat_detector
+ osnoise-tracer
+ timerlat-tracer
intel_th
+ ring-buffer-design
stm
sys-t
coresight/index
+ user_events
+ rv/index
+ hisi-ptt
diff --git a/Documentation/trace/intel_th.rst b/Documentation/trace/intel_th.rst
index 70b7126eaeeb..b31818d5f6c5 100644
--- a/Documentation/trace/intel_th.rst
+++ b/Documentation/trace/intel_th.rst
@@ -58,7 +58,7 @@ Bus and Subdevices
For each Intel TH device in the system a bus of its own is
created and assigned an id number that reflects the order in which TH
-devices were emumerated. All TH subdevices (devices on intel_th bus)
+devices were enumerated. All TH subdevices (devices on intel_th bus)
begin with this id: 0-gth, 0-msc0, 0-msc1, 0-pti, 0-sth, which is
followed by device's name and an optional index.
diff --git a/Documentation/trace/kprobes.rst b/Documentation/trace/kprobes.rst
new file mode 100644
index 000000000000..e1636e579c9c
--- /dev/null
+++ b/Documentation/trace/kprobes.rst
@@ -0,0 +1,788 @@
+=======================
+Kernel Probes (Kprobes)
+=======================
+
+:Author: Jim Keniston <jkenisto@us.ibm.com>
+:Author: Prasanna S Panchamukhi <prasanna.panchamukhi@gmail.com>
+:Author: Masami Hiramatsu <mhiramat@kernel.org>
+
+.. CONTENTS
+
+ 1. Concepts: Kprobes, and Return Probes
+ 2. Architectures Supported
+ 3. Configuring Kprobes
+ 4. API Reference
+ 5. Kprobes Features and Limitations
+ 6. Probe Overhead
+ 7. TODO
+ 8. Kprobes Example
+ 9. Kretprobes Example
+ 10. Deprecated Features
+ Appendix A: The kprobes debugfs interface
+ Appendix B: The kprobes sysctl interface
+ Appendix C: References
+
+Concepts: Kprobes and Return Probes
+=========================================
+
+Kprobes enables you to dynamically break into any kernel routine and
+collect debugging and performance information non-disruptively. You
+can trap at almost any kernel code address [1]_, specifying a handler
+routine to be invoked when the breakpoint is hit.
+
+.. [1] some parts of the kernel code can not be trapped, see
+ :ref:`kprobes_blacklist`)
+
+There are currently two types of probes: kprobes, and kretprobes
+(also called return probes). A kprobe can be inserted on virtually
+any instruction in the kernel. A return probe fires when a specified
+function returns.
+
+In the typical case, Kprobes-based instrumentation is packaged as
+a kernel module. The module's init function installs ("registers")
+one or more probes, and the exit function unregisters them. A
+registration function such as register_kprobe() specifies where
+the probe is to be inserted and what handler is to be called when
+the probe is hit.
+
+There are also ``register_/unregister_*probes()`` functions for batch
+registration/unregistration of a group of ``*probes``. These functions
+can speed up unregistration process when you have to unregister
+a lot of probes at once.
+
+The next four subsections explain how the different types of
+probes work and how jump optimization works. They explain certain
+things that you'll need to know in order to make the best use of
+Kprobes -- e.g., the difference between a pre_handler and
+a post_handler, and how to use the maxactive and nmissed fields of
+a kretprobe. But if you're in a hurry to start using Kprobes, you
+can skip ahead to :ref:`kprobes_archs_supported`.
+
+How Does a Kprobe Work?
+-----------------------
+
+When a kprobe is registered, Kprobes makes a copy of the probed
+instruction and replaces the first byte(s) of the probed instruction
+with a breakpoint instruction (e.g., int3 on i386 and x86_64).
+
+When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
+registers are saved, and control passes to Kprobes via the
+notifier_call_chain mechanism. Kprobes executes the "pre_handler"
+associated with the kprobe, passing the handler the addresses of the
+kprobe struct and the saved registers.
+
+Next, Kprobes single-steps its copy of the probed instruction.
+(It would be simpler to single-step the actual instruction in place,
+but then Kprobes would have to temporarily remove the breakpoint
+instruction. This would open a small time window when another CPU
+could sail right past the probepoint.)
+
+After the instruction is single-stepped, Kprobes executes the
+"post_handler," if any, that is associated with the kprobe.
+Execution then continues with the instruction following the probepoint.
+
+Changing Execution Path
+-----------------------
+
+Since kprobes can probe into a running kernel code, it can change the
+register set, including instruction pointer. This operation requires
+maximum care, such as keeping the stack frame, recovering the execution
+path etc. Since it operates on a running kernel and needs deep knowledge
+of computer architecture and concurrent computing, you can easily shoot
+your foot.
+
+If you change the instruction pointer (and set up other related
+registers) in pre_handler, you must return !0 so that kprobes stops
+single stepping and just returns to the given address.
+This also means post_handler should not be called anymore.
+
+Note that this operation may be harder on some architectures which use
+TOC (Table of Contents) for function call, since you have to setup a new
+TOC for your function in your module, and recover the old one after
+returning from it.
+
+Return Probes
+-------------
+
+How Does a Return Probe Work?
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When you call register_kretprobe(), Kprobes establishes a kprobe at
+the entry to the function. When the probed function is called and this
+probe is hit, Kprobes saves a copy of the return address, and replaces
+the return address with the address of a "trampoline." The trampoline
+is an arbitrary piece of code -- typically just a nop instruction.
+At boot time, Kprobes registers a kprobe at the trampoline.
+
+When the probed function executes its return instruction, control
+passes to the trampoline and that probe is hit. Kprobes' trampoline
+handler calls the user-specified return handler associated with the
+kretprobe, then sets the saved instruction pointer to the saved return
+address, and that's where execution resumes upon return from the trap.
+
+While the probed function is executing, its return address is
+stored in an object of type kretprobe_instance. Before calling
+register_kretprobe(), the user sets the maxactive field of the
+kretprobe struct to specify how many instances of the specified
+function can be probed simultaneously. register_kretprobe()
+pre-allocates the indicated number of kretprobe_instance objects.
+
+For example, if the function is non-recursive and is called with a
+spinlock held, maxactive = 1 should be enough. If the function is
+non-recursive and can never relinquish the CPU (e.g., via a semaphore
+or preemption), NR_CPUS should be enough. If maxactive <= 0, it is
+set to a default value: max(10, 2*NR_CPUS).
+
+It's not a disaster if you set maxactive too low; you'll just miss
+some probes. In the kretprobe struct, the nmissed field is set to
+zero when the return probe is registered, and is incremented every
+time the probed function is entered but there is no kretprobe_instance
+object available for establishing the return probe.
+
+Kretprobe entry-handler
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Kretprobes also provides an optional user-specified handler which runs
+on function entry. This handler is specified by setting the entry_handler
+field of the kretprobe struct. Whenever the kprobe placed by kretprobe at the
+function entry is hit, the user-defined entry_handler, if any, is invoked.
+If the entry_handler returns 0 (success) then a corresponding return handler
+is guaranteed to be called upon function return. If the entry_handler
+returns a non-zero error then Kprobes leaves the return address as is, and
+the kretprobe has no further effect for that particular function instance.
+
+Multiple entry and return handler invocations are matched using the unique
+kretprobe_instance object associated with them. Additionally, a user
+may also specify per return-instance private data to be part of each
+kretprobe_instance object. This is especially useful when sharing private
+data between corresponding user entry and return handlers. The size of each
+private data object can be specified at kretprobe registration time by
+setting the data_size field of the kretprobe struct. This data can be
+accessed through the data field of each kretprobe_instance object.
+
+In case probed function is entered but there is no kretprobe_instance
+object available, then in addition to incrementing the nmissed count,
+the user entry_handler invocation is also skipped.
+
+.. _kprobes_jump_optimization:
+
+How Does Jump Optimization Work?
+--------------------------------
+
+If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
+is automatically set 'y' on x86/x86-64, non-preemptive kernel) and
+the "debug.kprobes_optimization" kernel parameter is set to 1 (see
+sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
+instruction instead of a breakpoint instruction at each probepoint.
+
+Init a Kprobe
+^^^^^^^^^^^^^
+
+When a probe is registered, before attempting this optimization,
+Kprobes inserts an ordinary, breakpoint-based kprobe at the specified
+address. So, even if it's not possible to optimize this particular
+probepoint, there'll be a probe there.
+
+Safety Check
+^^^^^^^^^^^^
+
+Before optimizing a probe, Kprobes performs the following safety checks:
+
+- Kprobes verifies that the region that will be replaced by the jump
+ instruction (the "optimized region") lies entirely within one function.
+ (A jump instruction is multiple bytes, and so may overlay multiple
+ instructions.)
+
+- Kprobes analyzes the entire function and verifies that there is no
+ jump into the optimized region. Specifically:
+
+ - the function contains no indirect jump;
+ - the function contains no instruction that causes an exception (since
+ the fixup code triggered by the exception could jump back into the
+ optimized region -- Kprobes checks the exception tables to verify this);
+ - there is no near jump to the optimized region (other than to the first
+ byte).
+
+- For each instruction in the optimized region, Kprobes verifies that
+ the instruction can be executed out of line.
+
+Preparing Detour Buffer
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Next, Kprobes prepares a "detour" buffer, which contains the following
+instruction sequence:
+
+- code to push the CPU's registers (emulating a breakpoint trap)
+- a call to the trampoline code which calls user's probe handlers.
+- code to restore registers
+- the instructions from the optimized region
+- a jump back to the original execution path.
+
+Pre-optimization
+^^^^^^^^^^^^^^^^
+
+After preparing the detour buffer, Kprobes verifies that none of the
+following situations exist:
+
+- The probe has a post_handler.
+- Other instructions in the optimized region are probed.
+- The probe is disabled.
+
+In any of the above cases, Kprobes won't start optimizing the probe.
+Since these are temporary situations, Kprobes tries to start
+optimizing it again if the situation is changed.
+
+If the kprobe can be optimized, Kprobes enqueues the kprobe to an
+optimizing list, and kicks the kprobe-optimizer workqueue to optimize
+it. If the to-be-optimized probepoint is hit before being optimized,
+Kprobes returns control to the original instruction path by setting
+the CPU's instruction pointer to the copied code in the detour buffer
+-- thus at least avoiding the single-step.
+
+Optimization
+^^^^^^^^^^^^
+
+The Kprobe-optimizer doesn't insert the jump instruction immediately;
+rather, it calls synchronize_rcu() for safety first, because it's
+possible for a CPU to be interrupted in the middle of executing the
+optimized region [3]_. As you know, synchronize_rcu() can ensure
+that all interruptions that were active when synchronize_rcu()
+was called are done, but only if CONFIG_PREEMPT=n. So, this version
+of kprobe optimization supports only kernels with CONFIG_PREEMPT=n [4]_.
+
+After that, the Kprobe-optimizer calls stop_machine() to replace
+the optimized region with a jump instruction to the detour buffer,
+using text_poke_smp().
+
+Unoptimization
+^^^^^^^^^^^^^^
+
+When an optimized kprobe is unregistered, disabled, or blocked by
+another kprobe, it will be unoptimized. If this happens before
+the optimization is complete, the kprobe is just dequeued from the
+optimized list. If the optimization has been done, the jump is
+replaced with the original code (except for an int3 breakpoint in
+the first byte) by using text_poke_smp().
+
+.. [3] Please imagine that the 2nd instruction is interrupted and then
+ the optimizer replaces the 2nd instruction with the jump *address*
+ while the interrupt handler is running. When the interrupt
+ returns to original address, there is no valid instruction,
+ and it causes an unexpected result.
+
+.. [4] This optimization-safety checking may be replaced with the
+ stop-machine method that ksplice uses for supporting a CONFIG_PREEMPT=y
+ kernel.
+
+NOTE for geeks:
+The jump optimization changes the kprobe's pre_handler behavior.
+Without optimization, the pre_handler can change the kernel's execution
+path by changing regs->ip and returning 1. However, when the probe
+is optimized, that modification is ignored. Thus, if you want to
+tweak the kernel's execution path, you need to suppress optimization,
+using one of the following techniques:
+
+- Specify an empty function for the kprobe's post_handler.
+
+or
+
+- Execute 'sysctl -w debug.kprobes_optimization=n'
+
+.. _kprobes_blacklist:
+
+Blacklist
+---------
+
+Kprobes can probe most of the kernel except itself. This means
+that there are some functions where kprobes cannot probe. Probing
+(trapping) such functions can cause a recursive trap (e.g. double
+fault) or the nested probe handler may never be called.
+Kprobes manages such functions as a blacklist.
+If you want to add a function into the blacklist, you just need
+to (1) include linux/kprobes.h and (2) use NOKPROBE_SYMBOL() macro
+to specify a blacklisted function.
+Kprobes checks the given probe address against the blacklist and
+rejects registering it, if the given address is in the blacklist.
+
+.. _kprobes_archs_supported:
+
+Architectures Supported
+=======================
+
+Kprobes and return probes are implemented on the following
+architectures:
+
+- i386 (Supports jump optimization)
+- x86_64 (AMD-64, EM64T) (Supports jump optimization)
+- ppc64
+- sparc64 (Return probes not yet implemented.)
+- arm
+- ppc
+- mips
+- s390
+- parisc
+- loongarch
+
+Configuring Kprobes
+===================
+
+When configuring the kernel using make menuconfig/xconfig/oldconfig,
+ensure that CONFIG_KPROBES is set to "y", look for "Kprobes" under
+"General architecture-dependent options".
+
+So that you can load and unload Kprobes-based instrumentation modules,
+make sure "Loadable module support" (CONFIG_MODULES) and "Module
+unloading" (CONFIG_MODULE_UNLOAD) are set to "y".
+
+Also make sure that CONFIG_KALLSYMS and perhaps even CONFIG_KALLSYMS_ALL
+are set to "y", since kallsyms_lookup_name() is used by the in-kernel
+kprobe address resolution code.
+
+If you need to insert a probe in the middle of a function, you may find
+it useful to "Compile the kernel with debug info" (CONFIG_DEBUG_INFO),
+so you can use "objdump -d -l vmlinux" to see the source-to-object
+code mapping.
+
+API Reference
+=============
+
+The Kprobes API includes a "register" function and an "unregister"
+function for each type of probe. The API also includes "register_*probes"
+and "unregister_*probes" functions for (un)registering arrays of probes.
+Here are terse, mini-man-page specifications for these functions and
+the associated probe handlers that you'll write. See the files in the
+samples/kprobes/ sub-directory for examples.
+
+register_kprobe
+---------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kprobe(struct kprobe *kp);
+
+Sets a breakpoint at the address kp->addr. When the breakpoint is hit, Kprobes
+calls kp->pre_handler. After the probed instruction is single-stepped, Kprobe
+calls kp->post_handler. Any or all handlers can be NULL. If kp->flags is set
+KPROBE_FLAG_DISABLED, that kp will be registered but disabled, so, its handlers
+aren't hit until calling enable_kprobe(kp).
+
+.. note::
+
+ 1. With the introduction of the "symbol_name" field to struct kprobe,
+ the probepoint address resolution will now be taken care of by the kernel.
+ The following will now work::
+
+ kp.symbol_name = "symbol_name";
+
+ (64-bit powerpc intricacies such as function descriptors are handled
+ transparently)
+
+ 2. Use the "offset" field of struct kprobe if the offset into the symbol
+ to install a probepoint is known. This field is used to calculate the
+ probepoint.
+
+ 3. Specify either the kprobe "symbol_name" OR the "addr". If both are
+ specified, kprobe registration will fail with -EINVAL.
+
+ 4. With CISC architectures (such as i386 and x86_64), the kprobes code
+ does not validate if the kprobe.addr is at an instruction boundary.
+ Use "offset" with caution.
+
+register_kprobe() returns 0 on success, or a negative errno otherwise.
+
+User's pre-handler (kp->pre_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int pre_handler(struct kprobe *p, struct pt_regs *regs);
+
+Called with p pointing to the kprobe associated with the breakpoint,
+and regs pointing to the struct containing the registers saved when
+the breakpoint was hit. Return 0 here unless you're a Kprobes geek.
+
+User's post-handler (kp->post_handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ void post_handler(struct kprobe *p, struct pt_regs *regs,
+ unsigned long flags);
+
+p and regs are as described for the pre_handler. flags always seems
+to be zero.
+
+register_kretprobe
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kretprobe(struct kretprobe *rp);
+
+Establishes a return probe for the function whose address is
+rp->kp.addr. When that function returns, Kprobes calls rp->handler.
+You must set rp->maxactive appropriately before you call
+register_kretprobe(); see "How Does a Return Probe Work?" for details.
+
+register_kretprobe() returns 0 on success, or a negative errno
+otherwise.
+
+User's return-probe handler (rp->handler)::
+
+ #include <linux/kprobes.h>
+ #include <linux/ptrace.h>
+ int kretprobe_handler(struct kretprobe_instance *ri,
+ struct pt_regs *regs);
+
+regs is as described for kprobe.pre_handler. ri points to the
+kretprobe_instance object, of which the following fields may be
+of interest:
+
+- ret_addr: the return address
+- rp: points to the corresponding kretprobe object
+- task: points to the corresponding task struct
+- data: points to per return-instance private data; see "Kretprobe
+ entry-handler" for details.
+
+The regs_return_value(regs) macro provides a simple abstraction to
+extract the return value from the appropriate register as defined by
+the architecture's ABI.
+
+The handler's return value is currently ignored.
+
+unregister_*probe
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ void unregister_kprobe(struct kprobe *kp);
+ void unregister_kretprobe(struct kretprobe *rp);
+
+Removes the specified probe. The unregister function can be called
+at any time after the probe has been registered.
+
+.. note::
+
+ If the functions find an incorrect probe (ex. an unregistered probe),
+ they clear the addr field of the probe.
+
+register_*probes
+----------------
+
+::
+
+ #include <linux/kprobes.h>
+ int register_kprobes(struct kprobe **kps, int num);
+ int register_kretprobes(struct kretprobe **rps, int num);
+
+Registers each of the num probes in the specified array. If any
+error occurs during registration, all probes in the array, up to
+the bad probe, are safely unregistered before the register_*probes
+function returns.
+
+- kps/rps: an array of pointers to ``*probe`` data structures
+- num: the number of the array entries.
+
+.. note::
+
+ You have to allocate(or define) an array of pointers and set all
+ of the array entries before using these functions.
+
+unregister_*probes
+------------------
+
+::
+
+ #include <linux/kprobes.h>
+ void unregister_kprobes(struct kprobe **kps, int num);
+ void unregister_kretprobes(struct kretprobe **rps, int num);
+
+Removes each of the num probes in the specified array at once.
+
+.. note::
+
+ If the functions find some incorrect probes (ex. unregistered
+ probes) in the specified array, they clear the addr field of those
+ incorrect probes. However, other probes in the array are
+ unregistered correctly.
+
+disable_*probe
+--------------
+
+::
+
+ #include <linux/kprobes.h>
+ int disable_kprobe(struct kprobe *kp);
+ int disable_kretprobe(struct kretprobe *rp);
+
+Temporarily disables the specified ``*probe``. You can enable it again by using
+enable_*probe(). You must specify the probe which has been registered.
+
+enable_*probe
+-------------
+
+::
+
+ #include <linux/kprobes.h>
+ int enable_kprobe(struct kprobe *kp);
+ int enable_kretprobe(struct kretprobe *rp);
+
+Enables ``*probe`` which has been disabled by disable_*probe(). You must specify
+the probe which has been registered.
+
+Kprobes Features and Limitations
+================================
+
+Kprobes allows multiple probes at the same address. Also,
+a probepoint for which there is a post_handler cannot be optimized.
+So if you install a kprobe with a post_handler, at an optimized
+probepoint, the probepoint will be unoptimized automatically.
+
+In general, you can install a probe anywhere in the kernel.
+In particular, you can probe interrupt handlers. Known exceptions
+are discussed in this section.
+
+The register_*probe functions will return -EINVAL if you attempt
+to install a probe in the code that implements Kprobes (mostly
+kernel/kprobes.c and ``arch/*/kernel/kprobes.c``, but also functions such
+as do_page_fault and notifier_call_chain).
+
+If you install a probe in an inline-able function, Kprobes makes
+no attempt to chase down all inline instances of the function and
+install probes there. gcc may inline a function without being asked,
+so keep this in mind if you're not seeing the probe hits you expect.
+
+A probe handler can modify the environment of the probed function
+-- e.g., by modifying kernel data structures, or by modifying the
+contents of the pt_regs struct (which are restored to the registers
+upon return from the breakpoint). So Kprobes can be used, for example,
+to install a bug fix or to inject faults for testing. Kprobes, of
+course, has no way to distinguish the deliberately injected faults
+from the accidental ones. Don't drink and probe.
+
+Kprobes makes no attempt to prevent probe handlers from stepping on
+each other -- e.g., probing printk() and then calling printk() from a
+probe handler. If a probe handler hits a probe, that second probe's
+handlers won't be run in that instance, and the kprobe.nmissed member
+of the second probe will be incremented.
+
+As of Linux v2.6.15-rc1, multiple handlers (or multiple instances of
+the same handler) may run concurrently on different CPUs.
+
+Kprobes does not use mutexes or allocate memory except during
+registration and unregistration.
+
+Probe handlers are run with preemption disabled or interrupt disabled,
+which depends on the architecture and optimization state. (e.g.,
+kretprobe handlers and optimized kprobe handlers run without interrupt
+disabled on x86/x86-64). In any case, your handler should not yield
+the CPU (e.g., by attempting to acquire a semaphore, or waiting I/O).
+
+Since a return probe is implemented by replacing the return
+address with the trampoline's address, stack backtraces and calls
+to __builtin_return_address() will typically yield the trampoline's
+address instead of the real return address for kretprobed functions.
+(As far as we can tell, __builtin_return_address() is used only
+for instrumentation and error reporting.)
+
+If the number of times a function is called does not match the number
+of times it returns, registering a return probe on that function may
+produce undesirable results. In such a case, a line:
+kretprobe BUG!: Processing kretprobe d000000000041aa8 @ c00000000004f48c
+gets printed. With this information, one will be able to correlate the
+exact instance of the kretprobe that caused the problem. We have the
+do_exit() case covered. do_execve() and do_fork() are not an issue.
+We're unaware of other specific cases where this could be a problem.
+
+If, upon entry to or exit from a function, the CPU is running on
+a stack other than that of the current task, registering a return
+probe on that function may produce undesirable results. For this
+reason, Kprobes doesn't support return probes (or kprobes)
+on the x86_64 version of __switch_to(); the registration functions
+return -EINVAL.
+
+On x86/x86-64, since the Jump Optimization of Kprobes modifies
+instructions widely, there are some limitations to optimization. To
+explain it, we introduce some terminology. Imagine a 3-instruction
+sequence consisting of a two 2-byte instructions and one 3-byte
+instruction.
+
+::
+
+ IA
+ |
+ [-2][-1][0][1][2][3][4][5][6][7]
+ [ins1][ins2][ ins3 ]
+ [<- DCR ->]
+ [<- JTPR ->]
+
+ ins1: 1st Instruction
+ ins2: 2nd Instruction
+ ins3: 3rd Instruction
+ IA: Insertion Address
+ JTPR: Jump Target Prohibition Region
+ DCR: Detoured Code Region
+
+The instructions in DCR are copied to the out-of-line buffer
+of the kprobe, because the bytes in DCR are replaced by
+a 5-byte jump instruction. So there are several limitations.
+
+a) The instructions in DCR must be relocatable.
+b) The instructions in DCR must not include a call instruction.
+c) JTPR must not be targeted by any jump or call instruction.
+d) DCR must not straddle the border between functions.
+
+Anyway, these limitations are checked by the in-kernel instruction
+decoder, so you don't need to worry about that.
+
+Probe Overhead
+==============
+
+On a typical CPU in use in 2005, a kprobe hit takes 0.5 to 1.0
+microseconds to process. Specifically, a benchmark that hits the same
+probepoint repeatedly, firing a simple handler each time, reports 1-2
+million hits per second, depending on the architecture. A return-probe
+hit typically takes 50-75% longer than a kprobe hit.
+When you have a return probe set on a function, adding a kprobe at
+the entry to that function adds essentially no overhead.
+
+Here are sample overhead figures (in usec) for different architectures::
+
+ k = kprobe; r = return probe; kr = kprobe + return probe
+ on same function
+
+ i386: Intel Pentium M, 1495 MHz, 2957.31 bogomips
+ k = 0.57 usec; r = 0.92; kr = 0.99
+
+ x86_64: AMD Opteron 246, 1994 MHz, 3971.48 bogomips
+ k = 0.49 usec; r = 0.80; kr = 0.82
+
+ ppc64: POWER5 (gr), 1656 MHz (SMT disabled, 1 virtual CPU per physical CPU)
+ k = 0.77 usec; r = 1.26; kr = 1.45
+
+Optimized Probe Overhead
+------------------------
+
+Typically, an optimized kprobe hit takes 0.07 to 0.1 microseconds to
+process. Here are sample overhead figures (in usec) for x86 architectures::
+
+ k = unoptimized kprobe, b = boosted (single-step skipped), o = optimized kprobe,
+ r = unoptimized kretprobe, rb = boosted kretprobe, ro = optimized kretprobe.
+
+ i386: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+ k = 0.80 usec; b = 0.33; o = 0.05; r = 1.10; rb = 0.61; ro = 0.33
+
+ x86-64: Intel(R) Xeon(R) E5410, 2.33GHz, 4656.90 bogomips
+ k = 0.99 usec; b = 0.43; o = 0.06; r = 1.24; rb = 0.68; ro = 0.30
+
+TODO
+====
+
+a. SystemTap (http://sourceware.org/systemtap): Provides a simplified
+ programming interface for probe-based instrumentation. Try it out.
+b. Kernel return probes for sparc64.
+c. Support for other architectures.
+d. User-space probes.
+e. Watchpoint probes (which fire on data references).
+
+Kprobes Example
+===============
+
+See samples/kprobes/kprobe_example.c
+
+Kretprobes Example
+==================
+
+See samples/kprobes/kretprobe_example.c
+
+Deprecated Features
+===================
+
+Jprobes is now a deprecated feature. People who are depending on it should
+migrate to other tracing features or use older kernels. Please consider to
+migrate your tool to one of the following options:
+
+- Use trace-event to trace target function with arguments.
+
+ trace-event is a low-overhead (and almost no visible overhead if it
+ is off) statically defined event interface. You can define new events
+ and trace it via ftrace or any other tracing tools.
+
+ See the following urls:
+
+ - https://lwn.net/Articles/379903/
+ - https://lwn.net/Articles/381064/
+ - https://lwn.net/Articles/383362/
+
+- Use ftrace dynamic events (kprobe event) with perf-probe.
+
+ If you build your kernel with debug info (CONFIG_DEBUG_INFO=y), you can
+ find which register/stack is assigned to which local variable or arguments
+ by using perf-probe and set up new event to trace it.
+
+ See following documents:
+
+ - Documentation/trace/kprobetrace.rst
+ - Documentation/trace/events.rst
+ - tools/perf/Documentation/perf-probe.txt
+
+
+The kprobes debugfs interface
+=============================
+
+
+With recent kernels (> 2.6.20) the list of registered kprobes is visible
+under the /sys/kernel/debug/kprobes/ directory (assuming debugfs is mounted at //sys/kernel/debug).
+
+/sys/kernel/debug/kprobes/list: Lists all registered probes on the system::
+
+ c015d71a k vfs_read+0x0
+ c03dedc5 r tcp_v4_rcv+0x0
+
+The first column provides the kernel address where the probe is inserted.
+The second column identifies the type of probe (k - kprobe and r - kretprobe)
+while the third column specifies the symbol+offset of the probe.
+If the probed function belongs to a module, the module name is also
+specified. Following columns show probe status. If the probe is on
+a virtual address that is no longer valid (module init sections, module
+virtual addresses that correspond to modules that've been unloaded),
+such probes are marked with [GONE]. If the probe is temporarily disabled,
+such probes are marked with [DISABLED]. If the probe is optimized, it is
+marked with [OPTIMIZED]. If the probe is ftrace-based, it is marked with
+[FTRACE].
+
+/sys/kernel/debug/kprobes/enabled: Turn kprobes ON/OFF forcibly.
+
+Provides a knob to globally and forcibly turn registered kprobes ON or OFF.
+By default, all kprobes are enabled. By echoing "0" to this file, all
+registered probes will be disarmed, till such time a "1" is echoed to this
+file. Note that this knob just disarms and arms all kprobes and doesn't
+change each probe's disabling state. This means that disabled kprobes (marked
+[DISABLED]) will be not enabled if you turn ON all kprobes by this knob.
+
+
+The kprobes sysctl interface
+============================
+
+/proc/sys/debug/kprobes-optimization: Turn kprobes optimization ON/OFF.
+
+When CONFIG_OPTPROBES=y, this sysctl interface appears and it provides
+a knob to globally and forcibly turn jump optimization (see section
+:ref:`kprobes_jump_optimization`) ON or OFF. By default, jump optimization
+is allowed (ON). If you echo "0" to this file or set
+"debug.kprobes_optimization" to 0 via sysctl, all optimized probes will be
+unoptimized, and any new probes registered after that will not be optimized.
+
+Note that this knob *changes* the optimized state. This means that optimized
+probes (marked [OPTIMIZED]) will be unoptimized ([OPTIMIZED] tag will be
+removed). If the knob is turned on, they will be optimized again.
+
+References
+==========
+
+For additional information on Kprobes, refer to the following URLs:
+
+- https://lwn.net/Articles/132196/
+- https://www.kernel.org/doc/ols/2006/ols2006v2-pages-109-124.pdf
+
diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst
index 55993055902c..a49662ccd53c 100644
--- a/Documentation/trace/kprobetrace.rst
+++ b/Documentation/trace/kprobetrace.rst
@@ -6,21 +6,21 @@ Kprobe-based Event Tracing
Overview
--------
-These events are similar to tracepoint based events. Instead of Tracepoint,
+These events are similar to tracepoint-based events. Instead of tracepoints,
this is based on kprobes (kprobe and kretprobe). So it can probe wherever
kprobes can probe (this means, all functions except those with
__kprobes/nokprobe_inline annotation and those marked NOKPROBE_SYMBOL).
-Unlike the Tracepoint based event, this can be added and removed
+Unlike the tracepoint-based event, this can be added and removed
dynamically, on the fly.
To enable this feature, build your kernel with CONFIG_KPROBE_EVENTS=y.
-Similar to the events tracer, this doesn't need to be activated via
+Similar to the event tracer, this doesn't need to be activated via
current_tracer. Instead of that, add probe points via
-/sys/kernel/debug/tracing/kprobe_events, and enable it via
-/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enable.
+/sys/kernel/tracing/kprobe_events, and enable it via
+/sys/kernel/tracing/events/kprobes/<EVENT>/enable.
-You can also use /sys/kernel/debug/tracing/dynamic_events instead of
+You can also use /sys/kernel/tracing/dynamic_events instead of
kprobe_events. That interface will provide unified access to other
dynamic events too.
@@ -28,19 +28,21 @@ Synopsis of kprobe_events
-------------------------
::
- p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
- r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
- -:[GRP/]EVENT : Clear a probe
+ p[:[GRP/][EVENT]] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
+ r[MAXACTIVE][:[GRP/][EVENT]] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
+ p[:[GRP/][EVENT]] [MOD:]SYM[+0]%return [FETCHARGS] : Set a return probe
+ -:[GRP/][EVENT] : Clear a probe
GRP : Group name. If omitted, use "kprobes" for it.
EVENT : Event name. If omitted, the event name is generated
based on SYM+offs or MEMADDR.
MOD : Module name which has given SYM.
SYM[+offs] : Symbol+offset where the probe is inserted.
+ SYM%return : Return address of the symbol
MEMADDR : Address where the probe is inserted.
MAXACTIVE : Maximum number of instances of the specified function that
can be probed simultaneously, or 0 for the default value
- as defined in Documentation/kprobes.txt section 1.3.1.
+ as defined in Documentation/trace/kprobes.rst section 1.3.1.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@@ -56,32 +58,52 @@ Synopsis of kprobe_events
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
- (x8/x16/x32/x64), "string", "ustring" and bitfield
- are supported.
-
- (\*1) only for the probe on function entry (offs == 0).
- (\*2) only for return probe.
+ (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr"
+ and bitfield are supported.
+
+ (\*1) only for the probe on function entry (offs == 0). Note, this argument access
+ is best effort, because depending on the argument type, it may be passed on
+ the stack. But this only support the arguments via registers.
+ (\*2) only for return probe. Note that this is also best effort. Depending on the
+ return value type, it might be passed via a pair of registers. But this only
+ accesses one register.
(\*3) this is useful for fetching a field of data structures.
(\*4) "u" means user-space dereference. See :ref:`user_mem_access`.
+Function arguments at kretprobe
+-------------------------------
+Function arguments can be accessed at kretprobe using $arg<N> fetcharg. This
+is useful to record the function parameter and return value at once, and
+trace the difference of structure fields (for debuging a function whether it
+correctly updates the given data structure or not).
+See the :ref:`sample<fprobetrace_exit_args_sample>` in fprobe event for how
+it works.
+
+.. _kprobetrace_types:
+
Types
-----
-Several types are supported for fetch-args. Kprobe tracer will access memory
+Several types are supported for fetchargs. Kprobe tracer will access memory
by given type. Prefix 's' and 'u' means those types are signed and unsigned
respectively. 'x' prefix implies it is unsigned. Traced arguments are shown
in decimal ('s' and 'u') or hexadecimal ('x'). Without type casting, 'x32'
or 'x64' is used depends on the architecture (e.g. x86-32 uses x32, and
x86-64 uses x64).
+
These value types can be an array. To record array data, you can add '[N]'
(where N is a fixed number, less than 64) to the base type.
-E.g. 'x16[4]' means an array of x16 (2bytes hex) with 4 elements.
+E.g. 'x16[4]' means an array of x16 (2-byte hex) with 4 elements.
Note that the array can be applied to memory type fetchargs, you can not
apply it to registers/stack-entries etc. (for example, '$stack1:x8[8]' is
wrong, but '+8($stack):x8[8]' is OK.)
+
+Char type can be used to show the character value of traced arguments.
+
String type is a special type, which fetches a "null-terminated" string from
kernel space. This means it will fail and store NULL if the string container
has been paged out. "ustring" type is an alternative of string for user-space.
-See :ref:`user_mem_access` for more info..
+See :ref:`user_mem_access` for more info.
+
The string array type is a bit different from other types. For other base
types, <base-type>[1] is equal to <base-type> (e.g. +0(%di):x32[1] is same
as +0(%di):x32.) But string[1] is not equal to string. The string type itself
@@ -94,9 +116,14 @@ offset, and container-size (usually 32). The syntax is::
Symbol type('symbol') is an alias of u32 or u64 type (depends on BITS_PER_LONG)
which shows given pointer in "symbol+offset" style.
+On the other hand, symbol-string type ('symstr') converts the given address to
+"symbol+offset/symbolsize" style and stores it as a null-terminated string.
+With 'symstr' type, you can filter the event with wildcard pattern of the
+symbols, and you don't need to solve symbol name by yourself.
For $comm, the default type is "string"; any other type is invalid.
.. _user_mem_access:
+
User Memory Access
------------------
Kprobe events supports user-space memory access. For that purpose, you can use
@@ -113,8 +140,8 @@ space. 'ustring' is a shortcut way of performing the same task. That is,
Note that kprobe-event provides the user-memory access syntax but it doesn't
use it transparently. This means if you use normal dereference or string type
-for user memory, it might fail, and may always fail on some archs. The user
-has to carefully check if the target data is in kernel or user space.
+for user memory, it might fail, and may always fail on some architectures. The
+user has to carefully check if the target data is in kernel or user space.
Per-Probe Event Filtering
-------------------------
@@ -143,7 +170,7 @@ trigger:
Event Profiling
---------------
You can check the total number of probe hits and probe miss-hits via
-/sys/kernel/debug/tracing/kprobe_profile.
+/sys/kernel/tracing/kprobe_profile.
The first column is event name, the second is the number of probe hits,
the third is the number of probe miss-hits.
@@ -153,11 +180,11 @@ You can add and enable new kprobe events when booting up the kernel by
"kprobe_event=" parameter. The parameter accepts a semicolon-delimited
kprobe events, which format is similar to the kprobe_events.
The difference is that the probe definition parameters are comma-delimited
-instead of space. For example, adding myprobe event on do_sys_open like below
+instead of space. For example, adding myprobe event on do_sys_open like below::
p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)
-should be below for kernel boot parameter (just replace spaces with comma)
+should be below for kernel boot parameter (just replace spaces with comma)::
p:myprobe,do_sys_open,dfd=%ax,filename=%dx,flags=%cx,mode=+4($stack)
@@ -167,7 +194,7 @@ Usage examples
To add a probe as a new event, write a new definition to kprobe_events
as below::
- echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events
+ echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/tracing/kprobe_events
This sets a kprobe on the top of do_sys_open() function with recording
1st to 4th arguments as "myprobe" event. Note, which register/stack entry is
@@ -177,15 +204,15 @@ under tools/perf/).
As this example shows, users can choose more familiar names for each arguments.
::
- echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events
+ echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/tracing/kprobe_events
This sets a kretprobe on the return point of do_sys_open() function with
recording return value as "myretprobe" event.
You can see the format of these events via
-/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format.
+/sys/kernel/tracing/events/kprobes/<EVENT>/format.
::
- cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format
+ cat /sys/kernel/tracing/events/kprobes/myprobe/format
name: myprobe
ID: 780
format:
@@ -208,7 +235,7 @@ You can see the format of these events via
You can see that the event has 4 arguments as in the expressions you specified.
::
- echo > /sys/kernel/debug/tracing/kprobe_events
+ echo > /sys/kernel/tracing/kprobe_events
This clears all probe points.
@@ -223,8 +250,8 @@ Right after definition, each event is disabled by default. For tracing these
events, you need to enable it.
::
- echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
- echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable
+ echo 1 > /sys/kernel/tracing/events/kprobes/myprobe/enable
+ echo 1 > /sys/kernel/tracing/events/kprobes/myretprobe/enable
Use the following command to start tracing in an interval.
::
@@ -233,10 +260,10 @@ Use the following command to start tracing in an interval.
Open something...
# echo 0 > tracing_on
-And you can see the traced information via /sys/kernel/debug/tracing/trace.
+And you can see the traced information via /sys/kernel/tracing/trace.
::
- cat /sys/kernel/debug/tracing/trace
+ cat /sys/kernel/tracing/trace
# tracer: nop
#
# TASK-PID CPU# TIMESTAMP FUNCTION
@@ -252,4 +279,3 @@ And you can see the traced information via /sys/kernel/debug/tracing/trace.
Each line shows when the kernel hits an event, and <- SYMBOL means kernel
returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel
returns from do_sys_open to sys_open+0x1b).
-
diff --git a/Documentation/trace/mmiotrace.rst b/Documentation/trace/mmiotrace.rst
index 5116e8ca27b4..95b750722a13 100644
--- a/Documentation/trace/mmiotrace.rst
+++ b/Documentation/trace/mmiotrace.rst
@@ -5,7 +5,7 @@ In-kernel memory-mapped I/O tracing
Home page and links to optional user space tools:
- http://nouveau.freedesktop.org/wiki/MmioTrace
+ https://nouveau.freedesktop.org/wiki/MmioTrace
MMIO tracing was originally developed by Intel around 2003 for their Fault
Injection Test Harness. In Dec 2006 - Jan 2007, using the code from Intel,
@@ -36,11 +36,11 @@ Usage Quick Reference
::
$ mount -t debugfs debugfs /sys/kernel/debug
- $ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
- $ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
+ $ echo mmiotrace > /sys/kernel/tracing/current_tracer
+ $ cat /sys/kernel/tracing/trace_pipe > mydump.txt &
Start X or whatever.
- $ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
- $ echo nop > /sys/kernel/debug/tracing/current_tracer
+ $ echo "X is up" > /sys/kernel/tracing/trace_marker
+ $ echo nop > /sys/kernel/tracing/current_tracer
Check for lost events.
@@ -56,11 +56,11 @@ Check that the driver you are about to trace is not loaded.
Activate mmiotrace (requires root privileges)::
- $ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
+ $ echo mmiotrace > /sys/kernel/tracing/current_tracer
Start storing the trace::
- $ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt &
+ $ cat /sys/kernel/tracing/trace_pipe > mydump.txt &
The 'cat' process should stay running (sleeping) in the background.
@@ -68,14 +68,14 @@ Load the driver you want to trace and use it. Mmiotrace will only catch MMIO
accesses to areas that are ioremapped while mmiotrace is active.
During tracing you can place comments (markers) into the trace by
-$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker
+$ echo "X is up" > /sys/kernel/tracing/trace_marker
This makes it easier to see which part of the (huge) trace corresponds to
which action. It is recommended to place descriptive markers about what you
do.
Shut down mmiotrace (requires root privileges)::
- $ echo nop > /sys/kernel/debug/tracing/current_tracer
+ $ echo nop > /sys/kernel/tracing/current_tracer
The 'cat' process exits. If it does not, kill it by issuing 'fg' command and
pressing ctrl+c.
@@ -93,12 +93,12 @@ events were lost, the trace is incomplete. You should enlarge the buffers and
try again. Buffers are enlarged by first seeing how large the current buffers
are::
- $ cat /sys/kernel/debug/tracing/buffer_size_kb
+ $ cat /sys/kernel/tracing/buffer_size_kb
gives you a number. Approximately double this number and write it back, for
instance::
- $ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb
+ $ echo 128000 > /sys/kernel/tracing/buffer_size_kb
Then start again from the top.
diff --git a/Documentation/trace/osnoise-tracer.rst b/Documentation/trace/osnoise-tracer.rst
new file mode 100644
index 000000000000..140ef2533d26
--- /dev/null
+++ b/Documentation/trace/osnoise-tracer.rst
@@ -0,0 +1,180 @@
+==============
+OSNOISE Tracer
+==============
+
+In the context of high-performance computing (HPC), the Operating System
+Noise (*osnoise*) refers to the interference experienced by an application
+due to activities inside the operating system. In the context of Linux,
+NMIs, IRQs, SoftIRQs, and any other system thread can cause noise to the
+system. Moreover, hardware-related jobs can also cause noise, for example,
+via SMIs.
+
+hwlat_detector is one of the tools used to identify the most complex
+source of noise: *hardware noise*.
+
+In a nutshell, the hwlat_detector creates a thread that runs
+periodically for a given period. At the beginning of a period, the thread
+disables interrupt and starts sampling. While running, the hwlatd
+thread reads the time in a loop. As interrupts are disabled, threads,
+IRQs, and SoftIRQs cannot interfere with the hwlatd thread. Hence, the
+cause of any gap between two different reads of the time roots either on
+NMI or in the hardware itself. At the end of the period, hwlatd enables
+interrupts and reports the max observed gap between the reads. It also
+prints a NMI occurrence counter. If the output does not report NMI
+executions, the user can conclude that the hardware is the culprit for
+the latency. The hwlat detects the NMI execution by observing
+the entry and exit of a NMI.
+
+The osnoise tracer leverages the hwlat_detector by running a
+similar loop with preemption, SoftIRQs and IRQs enabled, thus allowing
+all the sources of *osnoise* during its execution. Using the same approach
+of hwlat, osnoise takes note of the entry and exit point of any
+source of interferences, increasing a per-cpu interference counter. The
+osnoise tracer also saves an interference counter for each source of
+interference. The interference counter for NMI, IRQs, SoftIRQs, and
+threads is increased anytime the tool observes these interferences' entry
+events. When a noise happens without any interference from the operating
+system level, the hardware noise counter increases, pointing to a
+hardware-related noise. In this way, osnoise can account for any
+source of interference. At the end of the period, the osnoise tracer
+prints the sum of all noise, the max single noise, the percentage of CPU
+available for the thread, and the counters for the noise sources.
+
+Usage
+-----
+
+Write the ASCII text "osnoise" into the current_tracer file of the
+tracing system (generally mounted at /sys/kernel/tracing).
+
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo osnoise > current_tracer
+
+It is possible to follow the trace by reading the trace file::
+
+ [root@f32 tracing]# cat trace
+ # tracer: osnoise
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth MAX
+ # || / SINGLE Interference counters:
+ # |||| RUNTIME NOISE % OF CPU NOISE +-----------------------------+
+ # TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
+ # | | | |||| | | | | | | | | | |
+ <...>-859 [000] .... 81.637220: 1000000 190 99.98100 9 18 0 1007 18 1
+ <...>-860 [001] .... 81.638154: 1000000 656 99.93440 74 23 0 1006 16 3
+ <...>-861 [002] .... 81.638193: 1000000 5675 99.43250 202 6 0 1013 25 21
+ <...>-862 [003] .... 81.638242: 1000000 125 99.98750 45 1 0 1011 23 0
+ <...>-863 [004] .... 81.638260: 1000000 1721 99.82790 168 7 0 1002 49 41
+ <...>-864 [005] .... 81.638286: 1000000 263 99.97370 57 6 0 1006 26 2
+ <...>-865 [006] .... 81.638302: 1000000 109 99.98910 21 3 0 1006 18 1
+ <...>-866 [007] .... 81.638326: 1000000 7816 99.21840 107 8 0 1016 39 19
+
+In addition to the regular trace fields (from TASK-PID to TIMESTAMP), the
+tracer prints a message at the end of each period for each CPU that is
+running an osnoise/ thread. The osnoise specific fields report:
+
+ - The RUNTIME IN US reports the amount of time in microseconds that
+ the osnoise thread kept looping reading the time.
+ - The NOISE IN US reports the sum of noise in microseconds observed
+ by the osnoise tracer during the associated runtime.
+ - The % OF CPU AVAILABLE reports the percentage of CPU available for
+ the osnoise thread during the runtime window.
+ - The MAX SINGLE NOISE IN US reports the maximum single noise observed
+ during the runtime window.
+ - The Interference counters display how many each of the respective
+ interference happened during the runtime window.
+
+Note that the example above shows a high number of HW noise samples.
+The reason being is that this sample was taken on a virtual machine,
+and the host interference is detected as a hardware interference.
+
+Tracer Configuration
+--------------------
+
+The tracer has a set of options inside the osnoise directory, they are:
+
+ - osnoise/cpus: CPUs at which a osnoise thread will execute.
+ - osnoise/period_us: the period of the osnoise thread.
+ - osnoise/runtime_us: how long an osnoise thread will look for noise.
+ - osnoise/stop_tracing_us: stop the system tracing if a single noise
+ higher than the configured value happens. Writing 0 disables this
+ option.
+ - osnoise/stop_tracing_total_us: stop the system tracing if total noise
+ higher than the configured value happens. Writing 0 disables this
+ option.
+ - tracing_threshold: the minimum delta between two time() reads to be
+ considered as noise, in us. When set to 0, the default value will
+ be used, which is currently 5 us.
+ - osnoise/options: a set of on/off options that can be enabled by
+ writing the option name to the file or disabled by writing the option
+ name preceded with the 'NO\_' prefix. For example, writing
+ NO_OSNOISE_WORKLOAD disables the OSNOISE_WORKLOAD option. The
+ special DEAFAULTS option resets all options to the default value.
+
+Tracer Options
+--------------
+
+The osnoise/options file exposes a set of on/off configuration options for
+the osnoise tracer. These options are:
+
+ - DEFAULTS: reset the options to the default value.
+ - OSNOISE_WORKLOAD: do not dispatch osnoise workload (see dedicated
+ section below).
+ - PANIC_ON_STOP: call panic() if the tracer stops. This option serves to
+ capture a vmcore.
+ - OSNOISE_PREEMPT_DISABLE: disable preemption while running the osnoise
+ workload, allowing only IRQ and hardware-related noise.
+ - OSNOISE_IRQ_DISABLE: disable IRQs while running the osnoise workload,
+ allowing only NMIs and hardware-related noise, like hwlat tracer.
+
+Additional Tracing
+------------------
+
+In addition to the tracer, a set of tracepoints were added to
+facilitate the identification of the osnoise source.
+
+ - osnoise:sample_threshold: printed anytime a noise is higher than
+ the configurable tolerance_ns.
+ - osnoise:nmi_noise: noise from NMI, including the duration.
+ - osnoise:irq_noise: noise from an IRQ, including the duration.
+ - osnoise:softirq_noise: noise from a SoftIRQ, including the
+ duration.
+ - osnoise:thread_noise: noise from a thread, including the duration.
+
+Note that all the values are *net values*. For example, if while osnoise
+is running, another thread preempts the osnoise thread, it will start a
+thread_noise duration at the start. Then, an IRQ takes place, preempting
+the thread_noise, starting a irq_noise. When the IRQ ends its execution,
+it will compute its duration, and this duration will be subtracted from
+the thread_noise, in such a way as to avoid the double accounting of the
+IRQ execution. This logic is valid for all sources of noise.
+
+Here is one example of the usage of these tracepoints::
+
+ osnoise/8-961 [008] d.h. 5789.857532: irq_noise: local_timer:236 start 5789.857529929 duration 1845 ns
+ osnoise/8-961 [008] dNh. 5789.858408: irq_noise: local_timer:236 start 5789.858404871 duration 2848 ns
+ migration/8-54 [008] d... 5789.858413: thread_noise: migration/8:54 start 5789.858409300 duration 3068 ns
+ osnoise/8-961 [008] .... 5789.858413: sample_threshold: start 5789.858404555 duration 8812 ns interferences 2
+
+In this example, a noise sample of 8 microseconds was reported in the last
+line, pointing to two interferences. Looking backward in the trace, the
+two previous entries were about the migration thread running after a
+timer IRQ execution. The first event is not part of the noise because
+it took place one millisecond before.
+
+It is worth noticing that the sum of the duration reported in the
+tracepoints is smaller than eight us reported in the sample_threshold.
+The reason roots in the overhead of the entry and exit code that happens
+before and after any interference execution. This justifies the dual
+approach: measuring thread and tracing.
+
+Running osnoise tracer without workload
+---------------------------------------
+
+By enabling the osnoise tracer with the NO_OSNOISE_WORKLOAD option set,
+the osnoise: tracepoints serve to measure the execution time of
+any type of Linux task, free from the interference of other tasks.
diff --git a/Documentation/trace/postprocess/decode_msr.py b/Documentation/trace/postprocess/decode_msr.py
index 0ab40e0db580..aa9cc7abd5c2 100644
--- a/Documentation/trace/postprocess/decode_msr.py
+++ b/Documentation/trace/postprocess/decode_msr.py
@@ -1,4 +1,4 @@
-#!/usr/bin/python
+#!/usr/bin/env python
# add symbolic names to read_msr / write_msr in trace
# decode_msr msr-index.h < trace
import sys
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
index 0a120aae33ce..d16494c5e200 100644
--- a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -1,10 +1,10 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# This is a POC (proof of concept or piece of crap, take your pick) for reading the
# text representation of trace output related to page allocation. It makes an attempt
# to extract some high-level information on what is going on. The accuracy of the parser
# may vary considerably
#
-# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/tracing/trace_pipe
# other options
# --prepend-parent Report on the parent proc and PID
# --read-procstat If the trace lacks process info, get it from /proc
@@ -94,7 +94,7 @@ sub generate_traceevent_regex {
my $regex;
# Read the event format or use the default
- if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+ if (!open (FORMAT, "/sys/kernel/tracing/events/$event/format")) {
$regex = $default;
} else {
my $line;
diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
index 995da15b16ca..048dc0dbce64 100644
--- a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -1,9 +1,9 @@
-#!/usr/bin/perl
+#!/usr/bin/env perl
# This is a POC for reading the text representation of trace output related to
# page reclaim. It makes an attempt to extract some high-level information on
# what is going on. The accuracy of the parser may vary
#
-# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/tracing/trace_pipe
# other options
# --read-procstat If the trace lacks process info, get it from /proc
# --ignore-pid Aggregate processes of the same name together
@@ -107,14 +107,14 @@ GetOptions(
);
# Defaults for dynamically discovered regex's
-my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_begin_default = 'order=([0-9]*) gfp_flags=([A-Z_|]*)';
my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
-my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*) gfp_flags=([A-Z_|]*)';
-my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) classzone_idx=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_skipped=([0-9]*) nr_taken=([0-9]*) lru=([a-z_]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) order=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_lru_isolate_default = 'classzone=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_skipped=([0-9]*) nr_taken=([0-9]*) lru=([a-z_]*)';
my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) nr_dirty=([0-9]*) nr_writeback=([0-9]*) nr_congested=([0-9]*) nr_immediate=([0-9]*) nr_activate_anon=([0-9]*) nr_activate_file=([0-9]*) nr_ref_keep=([0-9]*) nr_unmap_fail=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)';
-my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_taken=([0-9]*) nr_active=([0-9]*) nr_deactivated=([0-9]*) nr_referenced=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)' ;
my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)';
# Dyanically discovered regex
@@ -140,7 +140,7 @@ sub generate_traceevent_regex {
my $regex;
# Read the event format or use the default
- if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+ if (!open (FORMAT, "/sys/kernel/tracing/events/$event/format")) {
print("WARNING: Event $event format string not found\n");
return $default;
} else {
@@ -184,8 +184,7 @@ sub generate_traceevent_regex {
$regex_direct_begin = generate_traceevent_regex(
"vmscan/mm_vmscan_direct_reclaim_begin",
$regex_direct_begin_default,
- "order", "may_writepage",
- "gfp_flags");
+ "order", "gfp_flags");
$regex_direct_end = generate_traceevent_regex(
"vmscan/mm_vmscan_direct_reclaim_end",
$regex_direct_end_default,
@@ -201,11 +200,11 @@ $regex_kswapd_sleep = generate_traceevent_regex(
$regex_wakeup_kswapd = generate_traceevent_regex(
"vmscan/mm_vmscan_wakeup_kswapd",
$regex_wakeup_kswapd_default,
- "nid", "zid", "order", "gfp_flags");
+ "nid", "order", "gfp_flags");
$regex_lru_isolate = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_isolate",
$regex_lru_isolate_default,
- "isolate_mode", "classzone_idx", "order",
+ "classzone", "order",
"nr_requested", "nr_scanned", "nr_skipped", "nr_taken",
"lru");
$regex_lru_shrink_inactive = generate_traceevent_regex(
@@ -218,11 +217,10 @@ $regex_lru_shrink_inactive = generate_traceevent_regex(
$regex_lru_shrink_active = generate_traceevent_regex(
"vmscan/mm_vmscan_lru_shrink_active",
$regex_lru_shrink_active_default,
- "nid", "zid",
- "lru",
- "nr_scanned", "nr_rotated", "priority");
+ "nid", "nr_taken", "nr_active", "nr_deactivated", "nr_referenced",
+ "priority", "flags");
$regex_writepage = generate_traceevent_regex(
- "vmscan/mm_vmscan_writepage",
+ "vmscan/mm_vmscan_write_folio",
$regex_writepage_default,
"page", "pfn", "flags");
@@ -371,7 +369,7 @@ EVENT_PROCESS:
print " $regex_wakeup_kswapd\n";
next;
}
- my $order = $3;
+ my $order = $2;
$perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
} elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
$details = $6;
@@ -381,18 +379,14 @@ EVENT_PROCESS:
print " $regex_lru_isolate/o\n";
next;
}
- my $isolate_mode = $1;
- my $nr_scanned = $5;
- my $file = $8;
-
- # To closer match vmstat scanning statistics, only count isolate_both
- # and isolate_inactive as scanning. isolate_active is rotation
- # isolate_inactive == 1
- # isolate_active == 2
- # isolate_both == 3
- if ($isolate_mode != 2) {
+ my $nr_scanned = $4;
+ my $lru = $7;
+
+ # To closer match vmstat scanning statistics, only count
+ # inactive lru as scanning
+ if ($lru =~ /inactive_/) {
$perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
- if ($file =~ /_file/) {
+ if ($lru =~ /_file/) {
$perprocesspid{$process_pid}->{HIGH_NR_FILE_SCANNED} += $nr_scanned;
} else {
$perprocesspid{$process_pid}->{HIGH_NR_ANON_SCANNED} += $nr_scanned;
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.rst
index ff747b6fa39b..c5d77fcbb5bc 100644
--- a/Documentation/trace/ring-buffer-design.txt
+++ b/Documentation/trace/ring-buffer-design.rst
@@ -1,11 +1,15 @@
- Lockless Ring Buffer Design
- ===========================
+.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only
+
+===========================
+Lockless Ring Buffer Design
+===========================
Copyright 2009 Red Hat Inc.
- Author: Steven Rostedt <srostedt@redhat.com>
- License: The GNU Free Documentation License, Version 1.2
- (dual licensed under the GPL v2)
-Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
+
+:Author: Steven Rostedt <srostedt@redhat.com>
+:License: The GNU Free Documentation License, Version 1.2
+ (dual licensed under the GPL v2)
+:Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
and Frederic Weisbecker.
@@ -14,37 +18,50 @@ Written for: 2.6.31
Terminology used in this Document
---------------------------------
-tail - where new writes happen in the ring buffer.
+tail
+ - where new writes happen in the ring buffer.
-head - where new reads happen in the ring buffer.
+head
+ - where new reads happen in the ring buffer.
-producer - the task that writes into the ring buffer (same as writer)
+producer
+ - the task that writes into the ring buffer (same as writer)
-writer - same as producer
+writer
+ - same as producer
-consumer - the task that reads from the buffer (same as reader)
+consumer
+ - the task that reads from the buffer (same as reader)
-reader - same as consumer.
+reader
+ - same as consumer.
-reader_page - A page outside the ring buffer used solely (for the most part)
- by the reader.
+reader_page
+ - A page outside the ring buffer used solely (for the most part)
+ by the reader.
-head_page - a pointer to the page that the reader will use next
+head_page
+ - a pointer to the page that the reader will use next
-tail_page - a pointer to the page that will be written to next
+tail_page
+ - a pointer to the page that will be written to next
-commit_page - a pointer to the page with the last finished non-nested write.
+commit_page
+ - a pointer to the page with the last finished non-nested write.
-cmpxchg - hardware-assisted atomic transaction that performs the following:
+cmpxchg
+ - hardware-assisted atomic transaction that performs the following::
- A = B iff previous A == C
+ A = B if previous A == C
- R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
- current A is equal to C, and we put the old (current) A into R
+ R = cmpxchg(A, C, B) is saying that we replace A with B if and only
+ if current A is equal to C, and we put the old (current)
+ A into R
- R gets the previous A regardless if A is updated with B or not.
+ R gets the previous A regardless if A is updated with B or not.
- To see if the update was successful a compare of R == C may be used.
+ To see if the update was successful a compare of ``R == C``
+ may be used.
The Generic Ring Buffer
-----------------------
@@ -64,7 +81,7 @@ No two writers can write at the same time (on the same per-cpu buffer),
but a writer may interrupt another writer, but it must finish writing
before the previous writer may continue. This is very important to the
algorithm. The writers act like a "stack". The way interrupts works
-enforces this behavior.
+enforces this behavior::
writer1 start
@@ -115,6 +132,8 @@ A sample of how the reader page is swapped: Note this does not
show the head page in the buffer, it is for demonstrating a swap
only.
+::
+
+------+
|reader| RING BUFFER
|page |
@@ -172,21 +191,22 @@ only.
It is possible that the page swapped is the commit page and the tail page,
if what is in the ring buffer is less than what is held in a buffer page.
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+::
+
+ reader page commit page tail page
+ | | |
+ v | |
+ +---+ | |
+ | |<----------+ |
+ | |<------------------------+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
This case is still valid for this algorithm.
When the writer leaves the page, it simply goes into the ring buffer
@@ -196,15 +216,19 @@ buffer.
The main pointers:
- reader page - The page used solely by the reader and is not part
- of the ring buffer (may be swapped in)
+ reader page
+ - The page used solely by the reader and is not part
+ of the ring buffer (may be swapped in)
- head page - the next page in the ring buffer that will be swapped
+ head page
+ - the next page in the ring buffer that will be swapped
with the reader page.
- tail page - the page where the next write will take place.
+ tail page
+ - the page where the next write will take place.
- commit page - the page that last finished a write.
+ commit page
+ - the page that last finished a write.
The commit page only is updated by the outermost writer in the
writer stack. A writer that preempts another writer will not move the
@@ -219,7 +243,7 @@ transaction. If another write happens it must finish before continuing
with the previous write.
- Write reserve:
+ Write reserve::
Buffer page
+---------+
@@ -230,7 +254,7 @@ with the previous write.
| empty |
+---------+
- Write commit:
+ Write commit::
Buffer page
+---------+
@@ -242,7 +266,7 @@ with the previous write.
+---------+
- If a write happens after the first reserve:
+ If a write happens after the first reserve::
Buffer page
+---------+
@@ -253,7 +277,7 @@ with the previous write.
|reserved |
+---------+ <--- tail pointer
- After second writer commits:
+ After second writer commits::
Buffer page
@@ -266,7 +290,7 @@ with the previous write.
|commit |
+---------+ <--- tail pointer
- When the first writer commits:
+ When the first writer commits::
Buffer page
+---------+
@@ -292,21 +316,22 @@ be several pages ahead. If the tail page catches up to the commit
page then no more writes may take place (regardless of the mode
of the ring buffer: overwrite and produce/consumer).
-The order of pages is:
+The order of pages is::
head page
commit page
tail page
-Possible scenario:
- tail page
- head page commit page |
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+Possible scenario::
+
+ tail page
+ head page commit page |
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
There is a special case that the head page is after either the commit page
and possibly the tail page. That is when the commit (and tail) page has been
@@ -315,24 +340,25 @@ part of the ring buffer, but the reader page is not. Whenever there
has been less than a full page that has been committed inside the ring buffer,
and a reader swaps out a page, it will be swapping out the commit page.
-
- reader page commit page tail page
- | | |
- v | |
- +---+ | |
- | |<----------+ |
- | |<------------------------+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
+::
+
+ reader page commit page tail page
+ | | |
+ v | |
+ +---+ | |
+ | |<----------+ |
+ | |<------------------------+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
In this case, the head page will not move when the tail and commit
@@ -347,42 +373,42 @@ When the tail meets the head page, if the buffer is in overwrite mode,
the head page will be pushed ahead one. If the buffer is in producer/consumer
mode, the write will fail.
-Overwrite mode:
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
-
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- head page
+Overwrite mode::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ head page
Note, the reader page will still point to the previous head page.
But when a swap takes place, it will use the most recent head page.
@@ -397,7 +423,7 @@ State flags are placed inside the pointer to the page. To do this,
each page must be aligned in memory by 4 bytes. This will allow the 2
least significant bits of the address to be used as flags, since
they will always be zero for the address. To get the address,
-simply mask out the flags.
+simply mask out the flags::
MASK = ~3
@@ -405,24 +431,27 @@ simply mask out the flags.
Two flags will be kept by these two bits:
- HEADER - the page being pointed to is a head page
+ HEADER
+ - the page being pointed to is a head page
- UPDATE - the page being pointed to is being updated by a writer
+ UPDATE
+ - the page being pointed to is being updated by a writer
and was or is about to be a head page.
+::
- reader page
- |
- v
- +---+
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ reader page
+ |
+ v
+ +---+
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The above pointer "-H->" would have the HEADER flag set. That is
@@ -430,24 +459,24 @@ the next page is the next page to be swapped out by the reader.
This pointer means the next page is the head page.
When the tail page meets the head pointer, it will use cmpxchg to
-change the pointer to the UPDATE state:
+change the pointer to the UPDATE state::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
"-U->" represents a pointer in the UPDATE state.
@@ -462,7 +491,7 @@ head page does not have the HEADER flag set, the compare will fail
and the reader will need to look for the new head page and try again.
Note, the flags UPDATE and HEADER are never set at the same time.
-The reader swaps the reader page as follows:
+The reader swaps the reader page as follows::
+------+
|reader| RING BUFFER
@@ -477,7 +506,7 @@ The reader swaps the reader page as follows:
+-----H-------------+
The reader sets the reader page next pointer as HEADER to the page after
-the head page.
+the head page::
+------+
@@ -495,7 +524,7 @@ the head page.
It does a cmpxchg with the pointer to the previous head page to make it
point to the reader page. Note that the new pointer does not have the HEADER
-flag set. This action atomically moves the head page forward.
+flag set. This action atomically moves the head page forward::
+------+
|reader| RING BUFFER
@@ -511,7 +540,7 @@ flag set. This action atomically moves the head page forward.
+------------------------------------+
After the new head page is set, the previous pointer of the head page is
-updated to the reader page.
+updated to the reader page::
+------+
|reader| RING BUFFER
@@ -548,7 +577,7 @@ prev pointers may not.
Note, the way to determine a reader page is simply by examining the previous
pointer of the page. If the next pointer of the previous page does not
-point back to the original page, then the original page is a reader page:
+point back to the original page, then the original page is a reader page::
+--------+
@@ -572,54 +601,54 @@ not be able to swap the head page from the buffer, nor will it be able to
move the head page, until the writer is finished with the move.
This eliminates any races that the reader can have on the writer. The reader
-must spin, and this is why the reader cannot preempt the writer.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-The following page will be made into the new head page.
-
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+must spin, and this is why the reader cannot preempt the writer::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+The following page will be made into the new head page::
+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
After the new head page has been set, we can set the old head page
-pointer back to NORMAL.
+pointer back to NORMAL::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-After the head page has been moved, the tail page may now move forward.
+After the head page has been moved, the tail page may now move forward::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The above are the trivial updates. Now for the more complex scenarios.
@@ -630,26 +659,26 @@ tail page may make it all the way around the buffer and meet the commit
page. At this time, we must start dropping writes (usually with some kind
of warning to the user). But what happens if the commit was still on the
reader page? The commit page is not part of the ring buffer. The tail page
-must account for this.
-
-
- reader page commit page
- | |
- v |
- +---+ |
- | |<----------+
- | |
- | |------+
- +---+ |
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
- ^
- |
- tail page
+must account for this::
+
+
+ reader page commit page
+ | |
+ v |
+ +---+ |
+ | |<----------+
+ | |
+ | |------+
+ +---+ |
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+ ^
+ |
+ tail page
If the tail page were to simply push the head page forward, the commit when
leaving the reader page would not be pointing to the correct page.
@@ -676,7 +705,7 @@ the head page if the head page is the next page. If the head page
is not the next page, the tail page is simply updated with a cmpxchg.
Only writers move the tail page. This must be done atomically to protect
-against nested writers.
+against nested writers::
temp_page = tail_page
next_page = temp_page->next
@@ -684,54 +713,54 @@ against nested writers.
The above will update the tail page if it is still pointing to the expected
page. If this fails, a nested write pushed it forward, the current write
-does not need to push it.
-
-
- temp page
- |
- v
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
-
-Nested write comes in and moves the tail page forward:
-
- tail page (moved by nested writer)
- temp page |
- | |
- v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+does not need to push it::
+
+
+ temp page
+ |
+ v
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
+
+Nested write comes in and moves the tail page forward::
+
+ tail page (moved by nested writer)
+ temp page |
+ | |
+ v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The above would fail the cmpxchg, but since the tail page has already
been moved forward, the writer will just try again to reserve storage
on the new tail page.
-But the moving of the head page is a bit more complex.
+But the moving of the head page is a bit more complex::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-The write converts the head page pointer to UPDATE.
+The write converts the head page pointer to UPDATE::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
But if a nested writer preempts here, it will see that the next
page is a head page, but it is also nested. It will detect that
@@ -739,217 +768,216 @@ it is nested and will save that information. The detection is the
fact that it sees the UPDATE flag instead of a HEADER or NORMAL
pointer.
-The nested writer will set the new head page pointer.
+The nested writer will set the new head page pointer::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
But it will not reset the update back to normal. Only the writer
that converted a pointer from HEAD to UPDATE will convert it back
-to NORMAL.
+to NORMAL::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
After the nested writer finishes, the outermost writer will convert
-the UPDATE pointer to NORMAL.
+the UPDATE pointer to NORMAL::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
It can be even more complex if several nested writes came in and moved
-the tail page ahead several pages:
+the tail page ahead several pages::
-(first writer)
+ (first writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-H->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-H->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-The write converts the head page pointer to UPDATE.
+The write converts the head page pointer to UPDATE::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Next writer comes in, and sees the update and sets up the new
-head page.
+head page::
-(second writer)
+ (second writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The nested writer moves the tail page forward. But does not set the old
-update page to NORMAL because it is not the outermost writer.
+update page to NORMAL because it is not the outermost writer::
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Another writer preempts and sees the page after the tail page is a head page.
-It changes it from HEAD to UPDATE.
+It changes it from HEAD to UPDATE::
-(third writer)
+ (third writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |--->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-U->| |--->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-The writer will move the head page forward:
+The writer will move the head page forward::
-(third writer)
+ (third writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-U->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-U->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
But now that the third writer did change the HEAD flag to UPDATE it
-will convert it to normal:
+will convert it to normal::
-(third writer)
+ (third writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
-Then it will move the tail page, and return back to the second writer.
+Then it will move the tail page, and return back to the second writer::
-(second writer)
+ (second writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The second writer will fail to move the tail page because it was already
moved, so it will try again and add its data to the new tail page.
-It will return to the first writer.
+It will return to the first writer::
-(first writer)
+ (first writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
The first writer cannot know atomically if the tail page moved
while it updates the HEAD page. It will then update the head page to
-what it thinks is the new head page.
+what it thinks is the new head page::
-(first writer)
+ (first writer)
- tail page
- |
- v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ tail page
+ |
+ v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Since the cmpxchg returns the old value of the pointer the first writer
will see it succeeded in updating the pointer from NORMAL to HEAD.
But as we can see, this is not good enough. It must also check to see
-if the tail page is either where it use to be or on the next page:
+if the tail page is either where it use to be or on the next page::
-(first writer)
+ (first writer)
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |-H->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |-H->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
If tail page != A and tail page != B, then it must reset the pointer
back to NORMAL. The fact that it only needs to worry about nested
-writers means that it only needs to check this after setting the HEAD page.
+writers means that it only needs to check this after setting the HEAD page::
-(first writer)
+ (first writer)
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |-U->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |-U->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
Now the writer can update the head page. This is also why the head page must
remain in UPDATE and only reset by the outermost writer. This prevents
-the reader from seeing the incorrect head page.
-
+the reader from seeing the incorrect head page::
-(first writer)
- A B tail page
- | | |
- v v v
- +---+ +---+ +---+ +---+
-<---| |--->| |--->| |--->| |-H->
---->| |<---| |<---| |<---| |<---
- +---+ +---+ +---+ +---+
+ (first writer)
+ A B tail page
+ | | |
+ v v v
+ +---+ +---+ +---+ +---+
+ <---| |--->| |--->| |--->| |-H->
+ --->| |<---| |<---| |<---| |<---
+ +---+ +---+ +---+ +---+
diff --git a/Documentation/trace/rv/da_monitor_instrumentation.rst b/Documentation/trace/rv/da_monitor_instrumentation.rst
new file mode 100644
index 000000000000..6c67c7b57811
--- /dev/null
+++ b/Documentation/trace/rv/da_monitor_instrumentation.rst
@@ -0,0 +1,171 @@
+Deterministic Automata Instrumentation
+======================================
+
+The RV monitor file created by dot2k, with the name "$MODEL_NAME.c"
+includes a section dedicated to instrumentation.
+
+In the example of the wip.dot monitor created on [1], it will look like::
+
+ /*
+ * This is the instrumentation part of the monitor.
+ *
+ * This is the section where manual work is required. Here the kernel events
+ * are translated into model's event.
+ *
+ */
+ static void handle_preempt_disable(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(preempt_disable_wip);
+ }
+
+ static void handle_preempt_enable(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(preempt_enable_wip);
+ }
+
+ static void handle_sched_waking(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(sched_waking_wip);
+ }
+
+ static int enable_wip(void)
+ {
+ int retval;
+
+ retval = da_monitor_init_wip();
+ if (retval)
+ return retval;
+
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_disable);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_enable);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_sched_waking);
+
+ return 0;
+ }
+
+The comment at the top of the section explains the general idea: the
+instrumentation section translates *kernel events* into the *model's
+event*.
+
+Tracing callback functions
+--------------------------
+
+The first three functions are the starting point of the callback *handler
+functions* for each of the three events from the wip model. The developer
+does not necessarily need to use them: they are just starting points.
+
+Using the example of::
+
+ void handle_preempt_disable(void *data, /* XXX: fill header */)
+ {
+ da_handle_event_wip(preempt_disable_wip);
+ }
+
+The preempt_disable event from the model connects directly to the
+preemptirq:preempt_disable. The preemptirq:preempt_disable event
+has the following signature, from include/trace/events/preemptirq.h::
+
+ TP_PROTO(unsigned long ip, unsigned long parent_ip)
+
+Hence, the handle_preempt_disable() function will look like::
+
+ void handle_preempt_disable(void *data, unsigned long ip, unsigned long parent_ip)
+
+In this case, the kernel event translates one to one with the automata
+event, and indeed, no other change is required for this function.
+
+The next handler function, handle_preempt_enable() has the same argument
+list from the handle_preempt_disable(). The difference is that the
+preempt_enable event will be used to synchronize the system to the model.
+
+Initially, the *model* is placed in the initial state. However, the *system*
+might or might not be in the initial state. The monitor cannot start
+processing events until it knows that the system has reached the initial state.
+Otherwise, the monitor and the system could be out-of-sync.
+
+Looking at the automata definition, it is possible to see that the system
+and the model are expected to return to the initial state after the
+preempt_enable execution. Hence, it can be used to synchronize the
+system and the model at the initialization of the monitoring section.
+
+The start is informed via a special handle function, the
+"da_handle_start_event_$(MONITOR_NAME)(event)", in this case::
+
+ da_handle_start_event_wip(preempt_enable_wip);
+
+So, the callback function will look like::
+
+ void handle_preempt_enable(void *data, unsigned long ip, unsigned long parent_ip)
+ {
+ da_handle_start_event_wip(preempt_enable_wip);
+ }
+
+Finally, the "handle_sched_waking()" will look like::
+
+ void handle_sched_waking(void *data, struct task_struct *task)
+ {
+ da_handle_event_wip(sched_waking_wip);
+ }
+
+And the explanation is left for the reader as an exercise.
+
+enable and disable functions
+----------------------------
+
+dot2k automatically creates two special functions::
+
+ enable_$(MONITOR_NAME)()
+ disable_$(MONITOR_NAME)()
+
+These functions are called when the monitor is enabled and disabled,
+respectively.
+
+They should be used to *attach* and *detach* the instrumentation to the running
+system. The developer must add to the relative function all that is needed to
+*attach* and *detach* its monitor to the system.
+
+For the wip case, these functions were named::
+
+ enable_wip()
+ disable_wip()
+
+But no change was required because: by default, these functions *attach* and
+*detach* the tracepoints_to_attach, which was enough for this case.
+
+Instrumentation helpers
+-----------------------
+
+To complete the instrumentation, the *handler functions* need to be attached to a
+kernel event, at the monitoring enable phase.
+
+The RV interface also facilitates this step. For example, the macro "rv_attach_trace_probe()"
+is used to connect the wip model events to the relative kernel event. dot2k automatically
+adds "rv_attach_trace_probe()" function call for each model event in the enable phase, as
+a suggestion.
+
+For example, from the wip sample model::
+
+ static int enable_wip(void)
+ {
+ int retval;
+
+ retval = da_monitor_init_wip();
+ if (retval)
+ return retval;
+
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_enable);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_sched_waking);
+ rv_attach_trace_probe("wip", /* XXX: tracepoint */, handle_preempt_disable);
+
+ return 0;
+ }
+
+The probes then need to be detached at the disable phase.
+
+[1] The wip model is presented in::
+
+ Documentation/trace/rv/deterministic_automata.rst
+
+The wip monitor is presented in::
+
+ Documentation/trace/rv/da_monitor_synthesis.rst
diff --git a/Documentation/trace/rv/da_monitor_synthesis.rst b/Documentation/trace/rv/da_monitor_synthesis.rst
new file mode 100644
index 000000000000..0a92729c8a9b
--- /dev/null
+++ b/Documentation/trace/rv/da_monitor_synthesis.rst
@@ -0,0 +1,147 @@
+Deterministic Automata Monitor Synthesis
+========================================
+
+The starting point for the application of runtime verification (RV) techniques
+is the *specification* or *modeling* of the desired (or undesired) behavior
+of the system under scrutiny.
+
+The formal representation needs to be then *synthesized* into a *monitor*
+that can then be used in the analysis of the trace of the system. The
+*monitor* connects to the system via an *instrumentation* that converts
+the events from the *system* to the events of the *specification*.
+
+
+In Linux terms, the runtime verification monitors are encapsulated inside
+the *RV monitor* abstraction. The RV monitor includes a set of instances
+of the monitor (per-cpu monitor, per-task monitor, and so on), the helper
+functions that glue the monitor to the system reference model, and the
+trace output as a reaction to event parsing and exceptions, as depicted
+below::
+
+ Linux +----- RV Monitor ----------------------------------+ Formal
+ Realm | | Realm
+ +-------------------+ +----------------+ +-----------------+
+ | Linux kernel | | Monitor | | Reference |
+ | Tracing | -> | Instance(s) | <- | Model |
+ | (instrumentation) | | (verification) | | (specification) |
+ +-------------------+ +----------------+ +-----------------+
+ | | |
+ | V |
+ | +----------+ |
+ | | Reaction | |
+ | +--+--+--+-+ |
+ | | | | |
+ | | | +-> trace output ? |
+ +------------------------|--|----------------------+
+ | +----> panic ?
+ +-------> <user-specified>
+
+DA monitor synthesis
+--------------------
+
+The synthesis of automata-based models into the Linux *RV monitor* abstraction
+is automated by the dot2k tool and the rv/da_monitor.h header file that
+contains a set of macros that automatically generate the monitor's code.
+
+dot2k
+-----
+
+The dot2k utility leverages dot2c by converting an automaton model in
+the DOT format into the C representation [1] and creating the skeleton of
+a kernel monitor in C.
+
+For example, it is possible to transform the wip.dot model present in
+[1] into a per-cpu monitor with the following command::
+
+ $ dot2k -d wip.dot -t per_cpu
+
+This will create a directory named wip/ with the following files:
+
+- wip.h: the wip model in C
+- wip.c: the RV monitor
+
+The wip.c file contains the monitor declaration and the starting point for
+the system instrumentation.
+
+Monitor macros
+--------------
+
+The rv/da_monitor.h enables automatic code generation for the *Monitor
+Instance(s)* using C macros.
+
+The benefits of the usage of macro for monitor synthesis are 3-fold as it:
+
+- Reduces the code duplication;
+- Facilitates the bug fix/improvement;
+- Avoids the case of developers changing the core of the monitor code
+ to manipulate the model in a (let's say) non-standard way.
+
+This initial implementation presents three different types of monitor instances:
+
+- ``#define DECLARE_DA_MON_GLOBAL(name, type)``
+- ``#define DECLARE_DA_MON_PER_CPU(name, type)``
+- ``#define DECLARE_DA_MON_PER_TASK(name, type)``
+
+The first declares the functions for a global deterministic automata monitor,
+the second for monitors with per-cpu instances, and the third with per-task
+instances.
+
+In all cases, the 'name' argument is a string that identifies the monitor, and
+the 'type' argument is the data type used by dot2k on the representation of
+the model in C.
+
+For example, the wip model with two states and three events can be
+stored in an 'unsigned char' type. Considering that the preemption control
+is a per-cpu behavior, the monitor declaration in the 'wip.c' file is::
+
+ DECLARE_DA_MON_PER_CPU(wip, unsigned char);
+
+The monitor is executed by sending events to be processed via the functions
+presented below::
+
+ da_handle_event_$(MONITOR_NAME)($(event from event enum));
+ da_handle_start_event_$(MONITOR_NAME)($(event from event enum));
+ da_handle_start_run_event_$(MONITOR_NAME)($(event from event enum));
+
+The function ``da_handle_event_$(MONITOR_NAME)()`` is the regular case where
+the event will be processed if the monitor is processing events.
+
+When a monitor is enabled, it is placed in the initial state of the automata.
+However, the monitor does not know if the system is in the *initial state*.
+
+The ``da_handle_start_event_$(MONITOR_NAME)()`` function is used to notify the
+monitor that the system is returning to the initial state, so the monitor can
+start monitoring the next event.
+
+The ``da_handle_start_run_event_$(MONITOR_NAME)()`` function is used to notify
+the monitor that the system is known to be in the initial state, so the
+monitor can start monitoring and monitor the current event.
+
+Using the wip model as example, the events "preempt_disable" and
+"sched_waking" should be sent to monitor, respectively, via [2]::
+
+ da_handle_event_wip(preempt_disable_wip);
+ da_handle_event_wip(sched_waking_wip);
+
+While the event "preempt_enabled" will use::
+
+ da_handle_start_event_wip(preempt_enable_wip);
+
+To notify the monitor that the system will be returning to the initial state,
+so the system and the monitor should be in sync.
+
+Final remarks
+-------------
+
+With the monitor synthesis in place using the rv/da_monitor.h and
+dot2k, the developer's work should be limited to the instrumentation
+of the system, increasing the confidence in the overall approach.
+
+[1] For details about deterministic automata format and the translation
+from one representation to another, see::
+
+ Documentation/trace/rv/deterministic_automata.rst
+
+[2] dot2k appends the monitor's name suffix to the events enums to
+avoid conflicting variables when exporting the global vmlinux.h
+use by BPF programs.
diff --git a/Documentation/trace/rv/deterministic_automata.rst b/Documentation/trace/rv/deterministic_automata.rst
new file mode 100644
index 000000000000..d0638f95a455
--- /dev/null
+++ b/Documentation/trace/rv/deterministic_automata.rst
@@ -0,0 +1,184 @@
+Deterministic Automata
+======================
+
+Formally, a deterministic automaton, denoted by G, is defined as a quintuple:
+
+ *G* = { *X*, *E*, *f*, x\ :subscript:`0`, X\ :subscript:`m` }
+
+where:
+
+- *X* is the set of states;
+- *E* is the finite set of events;
+- x\ :subscript:`0` is the initial state;
+- X\ :subscript:`m` (subset of *X*) is the set of marked (or final) states.
+- *f* : *X* x *E* -> *X* $ is the transition function. It defines the state
+ transition in the occurrence of an event from *E* in the state *X*. In the
+ special case of deterministic automata, the occurrence of the event in *E*
+ in a state in *X* has a deterministic next state from *X*.
+
+For example, a given automaton named 'wip' (wakeup in preemptive) can
+be defined as:
+
+- *X* = { ``preemptive``, ``non_preemptive``}
+- *E* = { ``preempt_enable``, ``preempt_disable``, ``sched_waking``}
+- x\ :subscript:`0` = ``preemptive``
+- X\ :subscript:`m` = {``preemptive``}
+- *f* =
+ - *f*\ (``preemptive``, ``preempt_disable``) = ``non_preemptive``
+ - *f*\ (``non_preemptive``, ``sched_waking``) = ``non_preemptive``
+ - *f*\ (``non_preemptive``, ``preempt_enable``) = ``preemptive``
+
+One of the benefits of this formal definition is that it can be presented
+in multiple formats. For example, using a *graphical representation*, using
+vertices (nodes) and edges, which is very intuitive for *operating system*
+practitioners, without any loss.
+
+The previous 'wip' automaton can also be represented as::
+
+ preempt_enable
+ +---------------------------------+
+ v |
+ #============# preempt_disable +------------------+
+ --> H preemptive H -----------------> | non_preemptive |
+ #============# +------------------+
+ ^ |
+ | sched_waking |
+ +--------------+
+
+Deterministic Automaton in C
+----------------------------
+
+In the paper "Efficient formal verification for the Linux kernel",
+the authors present a simple way to represent an automaton in C that can
+be used as regular code in the Linux kernel.
+
+For example, the 'wip' automata can be presented as (augmented with comments)::
+
+ /* enum representation of X (set of states) to be used as index */
+ enum states {
+ preemptive = 0,
+ non_preemptive,
+ state_max
+ };
+
+ #define INVALID_STATE state_max
+
+ /* enum representation of E (set of events) to be used as index */
+ enum events {
+ preempt_disable = 0,
+ preempt_enable,
+ sched_waking,
+ event_max
+ };
+
+ struct automaton {
+ char *state_names[state_max]; // X: the set of states
+ char *event_names[event_max]; // E: the finite set of events
+ unsigned char function[state_max][event_max]; // f: transition function
+ unsigned char initial_state; // x_0: the initial state
+ bool final_states[state_max]; // X_m: the set of marked states
+ };
+
+ struct automaton aut = {
+ .state_names = {
+ "preemptive",
+ "non_preemptive"
+ },
+ .event_names = {
+ "preempt_disable",
+ "preempt_enable",
+ "sched_waking"
+ },
+ .function = {
+ { non_preemptive, INVALID_STATE, INVALID_STATE },
+ { INVALID_STATE, preemptive, non_preemptive },
+ },
+ .initial_state = preemptive,
+ .final_states = { 1, 0 },
+ };
+
+The *transition function* is represented as a matrix of states (lines) and
+events (columns), and so the function *f* : *X* x *E* -> *X* can be solved
+in O(1). For example::
+
+ next_state = automaton_wip.function[curr_state][event];
+
+Graphviz .dot format
+--------------------
+
+The Graphviz open-source tool can produce the graphical representation
+of an automaton using the (textual) DOT language as the source code.
+The DOT format is widely used and can be converted to many other formats.
+
+For example, this is the 'wip' model in DOT::
+
+ digraph state_automaton {
+ {node [shape = circle] "non_preemptive"};
+ {node [shape = plaintext, style=invis, label=""] "__init_preemptive"};
+ {node [shape = doublecircle] "preemptive"};
+ {node [shape = circle] "preemptive"};
+ "__init_preemptive" -> "preemptive";
+ "non_preemptive" [label = "non_preemptive"];
+ "non_preemptive" -> "non_preemptive" [ label = "sched_waking" ];
+ "non_preemptive" -> "preemptive" [ label = "preempt_enable" ];
+ "preemptive" [label = "preemptive"];
+ "preemptive" -> "non_preemptive" [ label = "preempt_disable" ];
+ { rank = min ;
+ "__init_preemptive";
+ "preemptive";
+ }
+ }
+
+This DOT format can be transformed into a bitmap or vectorial image
+using the dot utility, or into an ASCII art using graph-easy. For
+instance::
+
+ $ dot -Tsvg -o wip.svg wip.dot
+ $ graph-easy wip.dot > wip.txt
+
+dot2c
+-----
+
+dot2c is a utility that can parse a .dot file containing an automaton as
+in the example above and automatically convert it to the C representation
+presented in [3].
+
+For example, having the previous 'wip' model into a file named 'wip.dot',
+the following command will transform the .dot file into the C
+representation (previously shown) in the 'wip.h' file::
+
+ $ dot2c wip.dot > wip.h
+
+The 'wip.h' content is the code sample in section 'Deterministic Automaton
+in C'.
+
+Remarks
+-------
+
+The automata formalism allows modeling discrete event systems (DES) in
+multiple formats, suitable for different applications/users.
+
+For example, the formal description using set theory is better suitable
+for automata operations, while the graphical format for human interpretation;
+and computer languages for machine execution.
+
+References
+----------
+
+Many textbooks cover automata formalism. For a brief introduction see::
+
+ O'Regan, Gerard. Concise guide to software engineering. Springer,
+ Cham, 2017.
+
+For a detailed description, including operations, and application on Discrete
+Event Systems (DES), see::
+
+ Cassandras, Christos G., and Stephane Lafortune, eds. Introduction to discrete
+ event systems. Boston, MA: Springer US, 2008.
+
+For the C representation in kernel, see::
+
+ De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo
+ Silva. Efficient formal verification for the Linux kernel. In:
+ International Conference on Software Engineering and Formal Methods.
+ Springer, Cham, 2019. p. 315-332.
diff --git a/Documentation/trace/rv/index.rst b/Documentation/trace/rv/index.rst
new file mode 100644
index 000000000000..15fa966102c0
--- /dev/null
+++ b/Documentation/trace/rv/index.rst
@@ -0,0 +1,14 @@
+====================
+Runtime Verification
+====================
+
+.. toctree::
+ :maxdepth: 2
+ :glob:
+
+ runtime-verification.rst
+ deterministic_automata.rst
+ da_monitor_synthesis.rst
+ da_monitor_instrumentation.rst
+ monitor_wip.rst
+ monitor_wwnr.rst
diff --git a/Documentation/trace/rv/monitor_wip.rst b/Documentation/trace/rv/monitor_wip.rst
new file mode 100644
index 000000000000..a95763438c48
--- /dev/null
+++ b/Documentation/trace/rv/monitor_wip.rst
@@ -0,0 +1,55 @@
+Monitor wip
+===========
+
+- Name: wip - wakeup in preemptive
+- Type: per-cpu deterministic automaton
+- Author: Daniel Bristot de Oliveira <bristot@kernel.org>
+
+Description
+-----------
+
+The wakeup in preemptive (wip) monitor is a sample per-cpu monitor
+that verifies if the wakeup events always take place with
+preemption disabled::
+
+ |
+ |
+ v
+ #==================#
+ H preemptive H <+
+ #==================# |
+ | |
+ | preempt_disable | preempt_enable
+ v |
+ sched_waking +------------------+ |
+ +--------------- | | |
+ | | non_preemptive | |
+ +--------------> | | -+
+ +------------------+
+
+The wakeup event always takes place with preemption disabled because
+of the scheduler synchronization. However, because the preempt_count
+and its trace event are not atomic with regard to interrupts, some
+inconsistencies might happen. For example::
+
+ preempt_disable() {
+ __preempt_count_add(1)
+ -------> smp_apic_timer_interrupt() {
+ preempt_disable()
+ do not trace (preempt count >= 1)
+
+ wake up a thread
+
+ preempt_enable()
+ do not trace (preempt count >= 1)
+ }
+ <------
+ trace_preempt_disable();
+ }
+
+This problem was reported and discussed here:
+ https://lore.kernel.org/r/cover.1559051152.git.bristot@redhat.com/
+
+Specification
+-------------
+Grapviz Dot file in tools/verification/models/wip.dot
diff --git a/Documentation/trace/rv/monitor_wwnr.rst b/Documentation/trace/rv/monitor_wwnr.rst
new file mode 100644
index 000000000000..9f739030f826
--- /dev/null
+++ b/Documentation/trace/rv/monitor_wwnr.rst
@@ -0,0 +1,45 @@
+Monitor wwnr
+============
+
+- Name: wwrn - wakeup while not running
+- Type: per-task deterministic automaton
+- Author: Daniel Bristot de Oliveira <bristot@kernel.org>
+
+Description
+-----------
+
+This is a per-task sample monitor, with the following
+definition::
+
+ |
+ |
+ v
+ wakeup +-------------+
+ +--------- | |
+ | | not_running |
+ +--------> | | <+
+ +-------------+ |
+ | |
+ | switch_in | switch_out
+ v |
+ +-------------+ |
+ | running | -+
+ +-------------+
+
+This model is broken, the reason is that a task can be running
+in the processor without being set as RUNNABLE. Think about a
+task about to sleep::
+
+ 1: set_current_state(TASK_UNINTERRUPTIBLE);
+ 2: schedule();
+
+And then imagine an IRQ happening in between the lines one and two,
+waking the task up. BOOM, the wakeup will happen while the task is
+running.
+
+- Why do we need this model, so?
+- To test the reactors.
+
+Specification
+-------------
+Grapviz Dot file in tools/verification/models/wwnr.dot
diff --git a/Documentation/trace/rv/runtime-verification.rst b/Documentation/trace/rv/runtime-verification.rst
new file mode 100644
index 000000000000..dae78dfa7cdc
--- /dev/null
+++ b/Documentation/trace/rv/runtime-verification.rst
@@ -0,0 +1,231 @@
+====================
+Runtime Verification
+====================
+
+Runtime Verification (RV) is a lightweight (yet rigorous) method that
+complements classical exhaustive verification techniques (such as *model
+checking* and *theorem proving*) with a more practical approach for complex
+systems.
+
+Instead of relying on a fine-grained model of a system (e.g., a
+re-implementation a instruction level), RV works by analyzing the trace of the
+system's actual execution, comparing it against a formal specification of
+the system behavior.
+
+The main advantage is that RV can give precise information on the runtime
+behavior of the monitored system, without the pitfalls of developing models
+that require a re-implementation of the entire system in a modeling language.
+Moreover, given an efficient monitoring method, it is possible execute an
+*online* verification of a system, enabling the *reaction* for unexpected
+events, avoiding, for example, the propagation of a failure on safety-critical
+systems.
+
+Runtime Monitors and Reactors
+=============================
+
+A monitor is the central part of the runtime verification of a system. The
+monitor stands in between the formal specification of the desired (or
+undesired) behavior, and the trace of the actual system.
+
+In Linux terms, the runtime verification monitors are encapsulated inside the
+*RV monitor* abstraction. A *RV monitor* includes a reference model of the
+system, a set of instances of the monitor (per-cpu monitor, per-task monitor,
+and so on), and the helper functions that glue the monitor to the system via
+trace, as depicted below::
+
+ Linux +---- RV Monitor ----------------------------------+ Formal
+ Realm | | Realm
+ +-------------------+ +----------------+ +-----------------+
+ | Linux kernel | | Monitor | | Reference |
+ | Tracing | -> | Instance(s) | <- | Model |
+ | (instrumentation) | | (verification) | | (specification) |
+ +-------------------+ +----------------+ +-----------------+
+ | | |
+ | V |
+ | +----------+ |
+ | | Reaction | |
+ | +--+--+--+-+ |
+ | | | | |
+ | | | +-> trace output ? |
+ +------------------------|--|----------------------+
+ | +----> panic ?
+ +-------> <user-specified>
+
+In addition to the verification and monitoring of the system, a monitor can
+react to an unexpected event. The forms of reaction can vary from logging the
+event occurrence to the enforcement of the correct behavior to the extreme
+action of taking a system down to avoid the propagation of a failure.
+
+In Linux terms, a *reactor* is an reaction method available for *RV monitors*.
+By default, all monitors should provide a trace output of their actions,
+which is already a reaction. In addition, other reactions will be available
+so the user can enable them as needed.
+
+For further information about the principles of runtime verification and
+RV applied to Linux:
+
+ Bartocci, Ezio, et al. *Introduction to runtime verification.* In: Lectures on
+ Runtime Verification. Springer, Cham, 2018. p. 1-33.
+
+ Falcone, Ylies, et al. *A taxonomy for classifying runtime verification tools.*
+ In: International Conference on Runtime Verification. Springer, Cham, 2018. p.
+ 241-262.
+
+ De Oliveira, Daniel Bristot. *Automata-based formal analysis and
+ verification of the real-time Linux kernel.* Ph.D. Thesis, 2020.
+
+Online RV monitors
+==================
+
+Monitors can be classified as *offline* and *online* monitors. *Offline*
+monitor process the traces generated by a system after the events, generally by
+reading the trace execution from a permanent storage system. *Online* monitors
+process the trace during the execution of the system. Online monitors are said
+to be *synchronous* if the processing of an event is attached to the system
+execution, blocking the system during the event monitoring. On the other hand,
+an *asynchronous* monitor has its execution detached from the system. Each type
+of monitor has a set of advantages. For example, *offline* monitors can be
+executed on different machines but require operations to save the log to a
+file. In contrast, *synchronous online* method can react at the exact moment
+a violation occurs.
+
+Another important aspect regarding monitors is the overhead associated with the
+event analysis. If the system generates events at a frequency higher than the
+monitor's ability to process them in the same system, only the *offline*
+methods are viable. On the other hand, if the tracing of the events incurs
+on higher overhead than the simple handling of an event by a monitor, then a
+*synchronous online* monitors will incur on lower overhead.
+
+Indeed, the research presented in:
+
+ De Oliveira, Daniel Bristot; Cucinotta, Tommaso; De Oliveira, Romulo Silva.
+ *Efficient formal verification for the Linux kernel.* In: International
+ Conference on Software Engineering and Formal Methods. Springer, Cham, 2019.
+ p. 315-332.
+
+Shows that for Deterministic Automata models, the synchronous processing of
+events in-kernel causes lower overhead than saving the same events to the trace
+buffer, not even considering collecting the trace for user-space analysis.
+This motivated the development of an in-kernel interface for online monitors.
+
+For further information about modeling of Linux kernel behavior using automata,
+see:
+
+ De Oliveira, Daniel B.; De Oliveira, Romulo S.; Cucinotta, Tommaso. *A thread
+ synchronization model for the PREEMPT_RT Linux kernel.* Journal of Systems
+ Architecture, 2020, 107: 101729.
+
+The user interface
+==================
+
+The user interface resembles the tracing interface (on purpose). It is
+currently at "/sys/kernel/tracing/rv/".
+
+The following files/folders are currently available:
+
+**available_monitors**
+
+- Reading list the available monitors, one per line
+
+For example::
+
+ # cat available_monitors
+ wip
+ wwnr
+
+**available_reactors**
+
+- Reading shows the available reactors, one per line.
+
+For example::
+
+ # cat available_reactors
+ nop
+ panic
+ printk
+
+**enabled_monitors**:
+
+- Reading lists the enabled monitors, one per line
+- Writing to it enables a given monitor
+- Writing a monitor name with a '!' prefix disables it
+- Truncating the file disables all enabled monitors
+
+For example::
+
+ # cat enabled_monitors
+ # echo wip > enabled_monitors
+ # echo wwnr >> enabled_monitors
+ # cat enabled_monitors
+ wip
+ wwnr
+ # echo '!wip' >> enabled_monitors
+ # cat enabled_monitors
+ wwnr
+ # echo > enabled_monitors
+ # cat enabled_monitors
+ #
+
+Note that it is possible to enable more than one monitor concurrently.
+
+**monitoring_on**
+
+This is an on/off general switcher for monitoring. It resembles the
+"tracing_on" switcher in the trace interface.
+
+- Writing "0" stops the monitoring
+- Writing "1" continues the monitoring
+- Reading returns the current status of the monitoring
+
+Note that it does not disable enabled monitors but stop the per-entity
+monitors monitoring the events received from the system.
+
+**reacting_on**
+
+- Writing "0" prevents reactions for happening
+- Writing "1" enable reactions
+- Reading returns the current status of the reaction
+
+**monitors/**
+
+Each monitor will have its own directory inside "monitors/". There the
+monitor-specific files will be presented. The "monitors/" directory resembles
+the "events" directory on tracefs.
+
+For example::
+
+ # cd monitors/wip/
+ # ls
+ desc enable
+ # cat desc
+ wakeup in preemptive per-cpu testing monitor.
+ # cat enable
+ 0
+
+**monitors/MONITOR/desc**
+
+- Reading shows a description of the monitor *MONITOR*
+
+**monitors/MONITOR/enable**
+
+- Writing "0" disables the *MONITOR*
+- Writing "1" enables the *MONITOR*
+- Reading return the current status of the *MONITOR*
+
+**monitors/MONITOR/reactors**
+
+- List available reactors, with the select reaction for the given *MONITOR*
+ inside "[]". The default one is the nop (no operation) reactor.
+- Writing the name of a reactor enables it to the given MONITOR.
+
+For example::
+
+ # cat monitors/wip/reactors
+ [nop]
+ panic
+ printk
+ # echo panic > monitors/wip/reactors
+ # cat monitors/wip/reactors
+ nop
+ [panic]
+ printk
diff --git a/Documentation/trace/stm.rst b/Documentation/trace/stm.rst
index 99f99963e5e7..1ed49dde04fc 100644
--- a/Documentation/trace/stm.rst
+++ b/Documentation/trace/stm.rst
@@ -33,8 +33,8 @@ This policy is a tree structure containing rules (policy_node) that
have a name (string identifier) and a range of masters and channels
associated with it, located in "stp-policy" subsystem directory in
configfs. The topmost directory's name (the policy) is formatted as
-the STM device name to which this policy applies and and arbitrary
-string identifier separated by a stop. From the examle above, a rule
+the STM device name to which this policy applies and an arbitrary
+string identifier separated by a stop. From the example above, a rule
may look like this::
$ ls /config/stp-policy/dummy_stm.my-policy/user
diff --git a/Documentation/trace/timerlat-tracer.rst b/Documentation/trace/timerlat-tracer.rst
new file mode 100644
index 000000000000..53a56823e903
--- /dev/null
+++ b/Documentation/trace/timerlat-tracer.rst
@@ -0,0 +1,260 @@
+###############
+Timerlat tracer
+###############
+
+The timerlat tracer aims to help the preemptive kernel developers to
+find sources of wakeup latencies of real-time threads. Like cyclictest,
+the tracer sets a periodic timer that wakes up a thread. The thread then
+computes a *wakeup latency* value as the difference between the *current
+time* and the *absolute time* that the timer was set to expire. The main
+goal of timerlat is tracing in such a way to help kernel developers.
+
+Usage
+-----
+
+Write the ASCII text "timerlat" into the current_tracer file of the
+tracing system (generally mounted at /sys/kernel/tracing).
+
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo timerlat > current_tracer
+
+It is possible to follow the trace by reading the trace file::
+
+ [root@f32 tracing]# cat trace
+ # tracer: timerlat
+ #
+ # _-----=> irqs-off
+ # / _----=> need-resched
+ # | / _---=> hardirq/softirq
+ # || / _--=> preempt-depth
+ # || /
+ # |||| ACTIVATION
+ # TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
+ # | | | |||| | | | |
+ <idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
+ <...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
+ <idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
+ <...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
+ <idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
+ <...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
+ <idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
+ <...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
+
+
+The tracer creates a per-cpu kernel thread with real-time priority that
+prints two lines at every activation. The first is the *timer latency*
+observed at the *hardirq* context before the activation of the thread.
+The second is the *timer latency* observed by the thread. The ACTIVATION
+ID field serves to relate the *irq* execution to its respective *thread*
+execution.
+
+The *irq*/*thread* splitting is important to clarify in which context
+the unexpected high value is coming from. The *irq* context can be
+delayed by hardware-related actions, such as SMIs, NMIs, IRQs,
+or by thread masking interrupts. Once the timer happens, the delay
+can also be influenced by blocking caused by threads. For example, by
+postponing the scheduler execution via preempt_disable(), scheduler
+execution, or masking interrupts. Threads can also be delayed by the
+interference from other threads and IRQs.
+
+Tracer options
+---------------------
+
+The timerlat tracer is built on top of osnoise tracer.
+So its configuration is also done in the osnoise/ config
+directory. The timerlat configs are:
+
+ - cpus: CPUs at which a timerlat thread will execute.
+ - timerlat_period_us: the period of the timerlat thread.
+ - stop_tracing_us: stop the system tracing if a
+ timer latency at the *irq* context higher than the configured
+ value happens. Writing 0 disables this option.
+ - stop_tracing_total_us: stop the system tracing if a
+ timer latency at the *thread* context is higher than the configured
+ value happens. Writing 0 disables this option.
+ - print_stack: save the stack of the IRQ occurrence. The stack is printed
+ after the *thread context* event, or at the IRQ handler if *stop_tracing_us*
+ is hit.
+
+timerlat and osnoise
+----------------------------
+
+The timerlat can also take advantage of the osnoise: traceevents.
+For example::
+
+ [root@f32 ~]# cd /sys/kernel/tracing/
+ [root@f32 tracing]# echo timerlat > current_tracer
+ [root@f32 tracing]# echo 1 > events/osnoise/enable
+ [root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us
+ [root@f32 tracing]# tail -10 trace
+ cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 13585 ns
+ cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns
+ cc1-87882 [005] dNLh2.. 548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns
+ cc1-87882 [005] d...3.. 548.771102: thread_noise: cc1:87882 start 548.771078243 duration 9909 ns
+ timerlat/5-1035 [005] ....... 548.771104: #402268 context thread timer_latency 39960 ns
+
+In this case, the root cause of the timer latency does not point to a
+single cause but to multiple ones. Firstly, the timer IRQ was delayed
+for 13 us, which may point to a long IRQ disabled section (see IRQ
+stacktrace section). Then the timer interrupt that wakes up the timerlat
+thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally,
+the cc1 thread noise took 9909 ns of time before the context switch.
+Such pieces of evidence are useful for the developer to use other
+tracing methods to figure out how to debug and optimize the system.
+
+It is worth mentioning that the *duration* values reported
+by the osnoise: events are *net* values. For example, the
+thread_noise does not include the duration of the overhead caused
+by the IRQ execution (which indeed accounted for 12736 ns). But
+the values reported by the timerlat tracer (timerlat_latency)
+are *gross* values.
+
+The art below illustrates a CPU timeline and how the timerlat tracer
+observes it at the top and the osnoise: events at the bottom. Each "-"
+in the timelines means circa 1 us, and the time moves ==>::
+
+ External timer irq thread
+ clock latency latency
+ event 13585 ns 39960 ns
+ | ^ ^
+ v | |
+ |-------------| |
+ |-------------+-------------------------|
+ ^ ^
+ ========================================================================
+ [tmr irq] [dev irq]
+ [another thread...^ v..^ v.......][timerlat/ thread] <-- CPU timeline
+ =========================================================================
+ |-------| |-------|
+ |--^ v-------|
+ | | |
+ | | + thread_noise: 9909 ns
+ | +-> irq_noise: 6139 ns
+ +-> irq_noise: 7597 ns
+
+IRQ stacktrace
+---------------------------
+
+The osnoise/print_stack option is helpful for the cases in which a thread
+noise causes the major factor for the timer latency, because of preempt or
+irq disabled. For example::
+
+ [root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us
+ [root@f32 tracing]# echo 500 > osnoise/print_stack
+ [root@f32 tracing]# echo timerlat > current_tracer
+ [root@f32 tracing]# tail -21 per_cpu/cpu7/trace
+ insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
+ insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
+ insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
+ insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
+ insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
+ timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
+ timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
+ => timerlat_irq
+ => __hrtimer_run_queues
+ => hrtimer_interrupt
+ => __sysvec_apic_timer_interrupt
+ => asm_call_irq_on_stack
+ => sysvec_apic_timer_interrupt
+ => asm_sysvec_apic_timer_interrupt
+ => delay_tsc
+ => dummy_load_1ms_pd_init
+ => do_one_initcall
+ => do_init_module
+ => __do_sys_finit_module
+ => do_syscall_64
+ => entry_SYSCALL_64_after_hwframe
+
+In this case, it is possible to see that the thread added the highest
+contribution to the *timer latency* and the stack trace, saved during
+the timerlat IRQ handler, points to a function named
+dummy_load_1ms_pd_init, which had the following code (on purpose)::
+
+ static int __init dummy_load_1ms_pd_init(void)
+ {
+ preempt_disable();
+ mdelay(1);
+ preempt_enable();
+ return 0;
+
+ }
+
+User-space interface
+---------------------------
+
+Timerlat allows user-space threads to use timerlat infra-structure to
+measure scheduling latency. This interface is accessible via a per-CPU
+file descriptor inside $tracing_dir/osnoise/per_cpu/cpu$ID/timerlat_fd.
+
+This interface is accessible under the following conditions:
+
+ - timerlat tracer is enable
+ - osnoise workload option is set to NO_OSNOISE_WORKLOAD
+ - The user-space thread is affined to a single processor
+ - The thread opens the file associated with its single processor
+ - Only one thread can access the file at a time
+
+The open() syscall will fail if any of these conditions are not met.
+After opening the file descriptor, the user space can read from it.
+
+The read() system call will run a timerlat code that will arm the
+timer in the future and wait for it as the regular kernel thread does.
+
+When the timer IRQ fires, the timerlat IRQ will execute, report the
+IRQ latency and wake up the thread waiting in the read. The thread will be
+scheduled and report the thread latency via tracer - as for the kernel
+thread.
+
+The difference from the in-kernel timerlat is that, instead of re-arming
+the timer, timerlat will return to the read() system call. At this point,
+the user can run any code.
+
+If the application rereads the file timerlat file descriptor, the tracer
+will report the return from user-space latency, which is the total
+latency. If this is the end of the work, it can be interpreted as the
+response time for the request.
+
+After reporting the total latency, timerlat will restart the cycle, arm
+a timer, and go to sleep for the following activation.
+
+If at any time one of the conditions is broken, e.g., the thread migrates
+while in user space, or the timerlat tracer is disabled, the SIG_KILL
+signal will be sent to the user-space thread.
+
+Here is an basic example of user-space code for timerlat::
+
+ int main(void)
+ {
+ char buffer[1024];
+ int timerlat_fd;
+ int retval;
+ long cpu = 0; /* place in CPU 0 */
+ cpu_set_t set;
+
+ CPU_ZERO(&set);
+ CPU_SET(cpu, &set);
+
+ if (sched_setaffinity(gettid(), sizeof(set), &set) == -1)
+ return 1;
+
+ snprintf(buffer, sizeof(buffer),
+ "/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd",
+ cpu);
+
+ timerlat_fd = open(buffer, O_RDONLY);
+ if (timerlat_fd < 0) {
+ printf("error opening %s: %s\n", buffer, strerror(errno));
+ exit(1);
+ }
+
+ for (;;) {
+ retval = read(timerlat_fd, buffer, 1024);
+ if (retval < 0)
+ break;
+ }
+
+ close(timerlat_fd);
+ exit(0);
+ }
diff --git a/Documentation/trace/tracepoint-analysis.rst b/Documentation/trace/tracepoint-analysis.rst
index 716326b9f152..be01bf7b47e5 100644
--- a/Documentation/trace/tracepoint-analysis.rst
+++ b/Documentation/trace/tracepoint-analysis.rst
@@ -26,10 +26,10 @@ assumed that the PCL tool tools/perf has been installed and is in your path.
2.1 Standard Utilities
----------------------
-All possible events are visible from /sys/kernel/debug/tracing/events. Simply
+All possible events are visible from /sys/kernel/tracing/events. Simply
calling::
- $ find /sys/kernel/debug/tracing/events -type d
+ $ find /sys/kernel/tracing/events -type d
will give a fair indication of the number of events available.
@@ -59,7 +59,7 @@ See Documentation/trace/events.rst for a proper description on how events
can be enabled system-wide. A short example of enabling all events related
to page allocation would look something like::
- $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
+ $ for i in `find /sys/kernel/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
3.2 System-Wide Event Enabling with SystemTap
---------------------------------------------
@@ -189,7 +189,7 @@ time on a system-wide basis using -a and sleep.
============================================
When events are enabled the events that are triggering can be read from
-/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
+/sys/kernel/tracing/trace_pipe in human-readable format although binary
options exist as well. By post-processing the output, further information can
be gathered on-line as appropriate. Examples of post-processing might include
diff --git a/Documentation/trace/tracepoints.rst b/Documentation/trace/tracepoints.rst
index 6e3ce3bf3593..0cb8d9ca3d60 100644
--- a/Documentation/trace/tracepoints.rst
+++ b/Documentation/trace/tracepoints.rst
@@ -146,3 +146,30 @@ with jump labels and avoid conditional branches.
define tracepoints. Check http://lwn.net/Articles/379903,
http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362
for a series of articles with more details.
+
+If you require calling a tracepoint from a header file, it is not
+recommended to call one directly or to use the trace_<tracepoint>_enabled()
+function call, as tracepoints in header files can have side effects if a
+header is included from a file that has CREATE_TRACE_POINTS set, as
+well as the trace_<tracepoint>() is not that small of an inline
+and can bloat the kernel if used by other inlined functions. Instead,
+include tracepoint-defs.h and use tracepoint_enabled().
+
+In a C file::
+
+ void do_trace_foo_bar_wrapper(args)
+ {
+ trace_foo_bar(args);
+ }
+
+In the header file::
+
+ DECLARE_TRACEPOINT(foo_bar);
+
+ static inline void some_inline_function()
+ {
+ [..]
+ if (tracepoint_enabled(foo_bar))
+ do_trace_foo_bar_wrapper(args);
+ [..]
+ }
diff --git a/Documentation/trace/uprobetracer.rst b/Documentation/trace/uprobetracer.rst
index 98cde99939d7..01f6a780fb04 100644
--- a/Documentation/trace/uprobetracer.rst
+++ b/Documentation/trace/uprobetracer.rst
@@ -12,13 +12,13 @@ To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
Similar to the kprobe-event tracer, this doesn't need to be activated via
current_tracer. Instead of that, add probe points via
-/sys/kernel/debug/tracing/uprobe_events, and enable it via
-/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enable.
+/sys/kernel/tracing/uprobe_events, and enable it via
+/sys/kernel/tracing/events/uprobes/<EVENT>/enable.
However unlike kprobe-event tracer, the uprobe event interface expects the
user to calculate the offset of the probepoint in the object.
-You can also use /sys/kernel/debug/tracing/dynamic_events instead of
+You can also use /sys/kernel/tracing/dynamic_events instead of
uprobe_events. That interface will provide unified access to other
dynamic events too.
@@ -26,15 +26,17 @@ Synopsis of uprobe_tracer
-------------------------
::
- p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe
- r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
- -:[GRP/]EVENT : Clear uprobe or uretprobe event
+ p[:[GRP/][EVENT]] PATH:OFFSET [FETCHARGS] : Set a uprobe
+ r[:[GRP/][EVENT]] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe)
+ p[:[GRP/][EVENT]] PATH:OFFSET%return [FETCHARGS] : Set a return uprobe (uretprobe)
+ -:[GRP/][EVENT] : Clear uprobe or uretprobe event
GRP : Group name. If omitted, "uprobes" is the default value.
EVENT : Event name. If omitted, the event name is generated based
on PATH+OFFSET.
PATH : Path to an executable or a library.
OFFSET : Offset where the probe is inserted.
+ OFFSET%return : Offset where the return probe is inserted.
FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@@ -53,7 +55,7 @@ Synopsis of uprobe_tracer
(\*1) only for return probe.
(\*2) this is useful for fetching a field of data structures.
- (\*3) Unlike kprobe event, "u" prefix will just be ignored, becuse uprobe
+ (\*3) Unlike kprobe event, "u" prefix will just be ignored, because uprobe
events can access only user-space memory.
Types
@@ -77,7 +79,7 @@ For $comm, the default type is "string"; any other type is invalid.
Event Profiling
---------------
You can check the total number of probe hits per event via
-/sys/kernel/debug/tracing/uprobe_profile. The first column is the filename,
+/sys/kernel/tracing/uprobe_profile. The first column is the filename,
the second is the event name, the third is the number of probe hits.
Usage examples
@@ -85,28 +87,28 @@ Usage examples
* Add a probe as a new uprobe event, write a new definition to uprobe_events
as below (sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash)::
- echo 'p /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+ echo 'p /bin/bash:0x4245c0' > /sys/kernel/tracing/uprobe_events
* Add a probe as a new uretprobe event::
- echo 'r /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+ echo 'r /bin/bash:0x4245c0' > /sys/kernel/tracing/uprobe_events
* Unset registered event::
- echo '-:p_bash_0x4245c0' >> /sys/kernel/debug/tracing/uprobe_events
+ echo '-:p_bash_0x4245c0' >> /sys/kernel/tracing/uprobe_events
* Print out the events that are registered::
- cat /sys/kernel/debug/tracing/uprobe_events
+ cat /sys/kernel/tracing/uprobe_events
* Clear all events::
- echo > /sys/kernel/debug/tracing/uprobe_events
+ echo > /sys/kernel/tracing/uprobe_events
Following example shows how to dump the instruction pointer and %ax register
at the probed text address. Probe zfree function in /bin/zsh::
- # cd /sys/kernel/debug/tracing/
+ # cd /sys/kernel/tracing/
# cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
@@ -166,7 +168,7 @@ Also, you can disable the event by::
# echo 0 > events/uprobes/enable
-And you can see the traced information via /sys/kernel/debug/tracing/trace.
+And you can see the traced information via /sys/kernel/tracing/trace.
::
# cat trace
diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst
new file mode 100644
index 000000000000..1d5a7626e6a6
--- /dev/null
+++ b/Documentation/trace/user_events.rst
@@ -0,0 +1,304 @@
+=========================================
+user_events: User-based Event Tracing
+=========================================
+
+:Author: Beau Belgrave
+
+Overview
+--------
+User based trace events allow user processes to create events and trace data
+that can be viewed via existing tools, such as ftrace and perf.
+To enable this feature, build your kernel with CONFIG_USER_EVENTS=y.
+
+Programs can view status of the events via
+/sys/kernel/tracing/user_events_status and can both register and write
+data out via /sys/kernel/tracing/user_events_data.
+
+Programs can also use /sys/kernel/tracing/dynamic_events to register and
+delete user based events via the u: prefix. The format of the command to
+dynamic_events is the same as the ioctl with the u: prefix applied. This
+requires CAP_PERFMON due to the event persisting, otherwise -EPERM is returned.
+
+Typically programs will register a set of events that they wish to expose to
+tools that can read trace_events (such as ftrace and perf). The registration
+process tells the kernel which address and bit to reflect if any tool has
+enabled the event and data should be written. The registration will give back
+a write index which describes the data when a write() or writev() is called
+on the /sys/kernel/tracing/user_events_data file.
+
+The structures referenced in this document are contained within the
+/include/uapi/linux/user_events.h file in the source tree.
+
+**NOTE:** *Both user_events_status and user_events_data are under the tracefs
+filesystem and may be mounted at different paths than above.*
+
+Registering
+-----------
+Registering within a user process is done via ioctl() out to the
+/sys/kernel/tracing/user_events_data file. The command to issue is
+DIAG_IOCSREG.
+
+This command takes a packed struct user_reg as an argument::
+
+ struct user_reg {
+ /* Input: Size of the user_reg structure being used */
+ __u32 size;
+
+ /* Input: Bit in enable address to use */
+ __u8 enable_bit;
+
+ /* Input: Enable size in bytes at address */
+ __u8 enable_size;
+
+ /* Input: Flags to use, if any */
+ __u16 flags;
+
+ /* Input: Address to update when enabled */
+ __u64 enable_addr;
+
+ /* Input: Pointer to string with event name, description and flags */
+ __u64 name_args;
+
+ /* Output: Index of the event to use when writing data */
+ __u32 write_index;
+ } __attribute__((__packed__));
+
+The struct user_reg requires all the above inputs to be set appropriately.
+
++ size: This must be set to sizeof(struct user_reg).
+
++ enable_bit: The bit to reflect the event status at the address specified by
+ enable_addr.
+
++ enable_size: The size of the value specified by enable_addr.
+ This must be 4 (32-bit) or 8 (64-bit). 64-bit values are only allowed to be
+ used on 64-bit kernels, however, 32-bit can be used on all kernels.
+
++ flags: The flags to use, if any.
+ Callers should first attempt to use flags and retry without flags to ensure
+ support for lower versions of the kernel. If a flag is not supported -EINVAL
+ is returned.
+
++ enable_addr: The address of the value to use to reflect event status. This
+ must be naturally aligned and write accessible within the user program.
+
++ name_args: The name and arguments to describe the event, see command format
+ for details.
+
+The following flags are currently supported.
+
++ USER_EVENT_REG_PERSIST: The event will not delete upon the last reference
+ closing. Callers may use this if an event should exist even after the
+ process closes or unregisters the event. Requires CAP_PERFMON otherwise
+ -EPERM is returned.
+
++ USER_EVENT_REG_MULTI_FORMAT: The event can contain multiple formats. This
+ allows programs to prevent themselves from being blocked when their event
+ format changes and they wish to use the same name. When this flag is used the
+ tracepoint name will be in the new format of "name.unique_id" vs the older
+ format of "name". A tracepoint will be created for each unique pair of name
+ and format. This means if several processes use the same name and format,
+ they will use the same tracepoint. If yet another process uses the same name,
+ but a different format than the other processes, it will use a different
+ tracepoint with a new unique id. Recording programs need to scan tracefs for
+ the various different formats of the event name they are interested in
+ recording. The system name of the tracepoint will also use "user_events_multi"
+ instead of "user_events". This prevents single-format event names conflicting
+ with any multi-format event names within tracefs. The unique_id is output as
+ a hex string. Recording programs should ensure the tracepoint name starts with
+ the event name they registered and has a suffix that starts with . and only
+ has hex characters. For example to find all versions of the event "test" you
+ can use the regex "^test\.[0-9a-fA-F]+$".
+
+Upon successful registration the following is set.
+
++ write_index: The index to use for this file descriptor that represents this
+ event when writing out data. The index is unique to this instance of the file
+ descriptor that was used for the registration. See writing data for details.
+
+User based events show up under tracefs like any other event under the
+subsystem named "user_events". This means tools that wish to attach to the
+events need to use /sys/kernel/tracing/events/user_events/[name]/enable
+or perf record -e user_events:[name] when attaching/recording.
+
+**NOTE:** The event subsystem name by default is "user_events". Callers should
+not assume it will always be "user_events". Operators reserve the right in the
+future to change the subsystem name per-process to accommodate event isolation.
+In addition if the USER_EVENT_REG_MULTI_FORMAT flag is used the tracepoint name
+will have a unique id appended to it and the system name will be
+"user_events_multi" as described above.
+
+Command Format
+^^^^^^^^^^^^^^
+The command string format is as follows::
+
+ name[:FLAG1[,FLAG2...]] [Field1[;Field2...]]
+
+Supported Flags
+^^^^^^^^^^^^^^^
+None yet
+
+Field Format
+^^^^^^^^^^^^
+::
+
+ type name [size]
+
+Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc).
+User programs are encouraged to use clearly sized types like u32.
+
+**NOTE:** *Long is not supported since size can vary between user and kernel.*
+
+The size is only valid for types that start with a struct prefix.
+This allows user programs to describe custom structs out to tools, if required.
+
+For example, a struct in C that looks like this::
+
+ struct mytype {
+ char data[20];
+ };
+
+Would be represented by the following field::
+
+ struct mytype myname 20
+
+Deleting
+--------
+Deleting an event from within a user process is done via ioctl() out to the
+/sys/kernel/tracing/user_events_data file. The command to issue is
+DIAG_IOCSDEL.
+
+This command only requires a single string specifying the event to delete by
+its name. Delete will only succeed if there are no references left to the
+event (in both user and kernel space). User programs should use a separate file
+to request deletes than the one used for registration due to this.
+
+**NOTE:** By default events will auto-delete when there are no references left
+to the event. If programs do not want auto-delete, they must use the
+USER_EVENT_REG_PERSIST flag when registering the event. Once that flag is used
+the event exists until DIAG_IOCSDEL is invoked. Both register and delete of an
+event that persists requires CAP_PERFMON, otherwise -EPERM is returned. When
+there are multiple formats of the same event name, all events with the same
+name will be attempted to be deleted. If only a specific version is wanted to
+be deleted then the /sys/kernel/tracing/dynamic_events file should be used for
+that specific format of the event.
+
+Unregistering
+-------------
+If after registering an event it is no longer wanted to be updated then it can
+be disabled via ioctl() out to the /sys/kernel/tracing/user_events_data file.
+The command to issue is DIAG_IOCSUNREG. This is different than deleting, where
+deleting actually removes the event from the system. Unregistering simply tells
+the kernel your process is no longer interested in updates to the event.
+
+This command takes a packed struct user_unreg as an argument::
+
+ struct user_unreg {
+ /* Input: Size of the user_unreg structure being used */
+ __u32 size;
+
+ /* Input: Bit to unregister */
+ __u8 disable_bit;
+
+ /* Input: Reserved, set to 0 */
+ __u8 __reserved;
+
+ /* Input: Reserved, set to 0 */
+ __u16 __reserved2;
+
+ /* Input: Address to unregister */
+ __u64 disable_addr;
+ } __attribute__((__packed__));
+
+The struct user_unreg requires all the above inputs to be set appropriately.
+
++ size: This must be set to sizeof(struct user_unreg).
+
++ disable_bit: This must be set to the bit to disable (same bit that was
+ previously registered via enable_bit).
+
++ disable_addr: This must be set to the address to disable (same address that was
+ previously registered via enable_addr).
+
+**NOTE:** Events are automatically unregistered when execve() is invoked. During
+fork() the registered events will be retained and must be unregistered manually
+in each process if wanted.
+
+Status
+------
+When tools attach/record user based events the status of the event is updated
+in realtime. This allows user programs to only incur the cost of the write() or
+writev() calls when something is actively attached to the event.
+
+The kernel will update the specified bit that was registered for the event as
+tools attach/detach from the event. User programs simply check if the bit is set
+to see if something is attached or not.
+
+Administrators can easily check the status of all registered events by reading
+the user_events_status file directly via a terminal. The output is as follows::
+
+ Name [# Comments]
+ ...
+
+ Active: ActiveCount
+ Busy: BusyCount
+
+For example, on a system that has a single event the output looks like this::
+
+ test
+
+ Active: 1
+ Busy: 0
+
+If a user enables the user event via ftrace, the output would change to this::
+
+ test # Used by ftrace
+
+ Active: 1
+ Busy: 1
+
+Writing Data
+------------
+After registering an event the same fd that was used to register can be used
+to write an entry for that event. The write_index returned must be at the start
+of the data, then the remaining data is treated as the payload of the event.
+
+For example, if write_index returned was 1 and I wanted to write out an int
+payload of the event. Then the data would have to be 8 bytes (2 ints) in size,
+with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the
+value I want as the payload.
+
+In memory this would look like this::
+
+ int index;
+ int payload;
+
+User programs might have well known structs that they wish to use to emit out
+as payloads. In those cases writev() can be used, with the first vector being
+the index and the following vector(s) being the actual event payload.
+
+For example, if I have a struct like this::
+
+ struct payload {
+ int src;
+ int dst;
+ int flags;
+ } __attribute__((__packed__));
+
+It's advised for user programs to do the following::
+
+ struct iovec io[2];
+ struct payload e;
+
+ io[0].iov_base = &write_index;
+ io[0].iov_len = sizeof(write_index);
+ io[1].iov_base = &e;
+ io[1].iov_len = sizeof(e);
+
+ writev(fd, (const struct iovec*)io, 2);
+
+**NOTE:** *The write_index is not emitted out into the trace being recorded.*
+
+Example Code
+------------
+See sample code in samples/user_events.