diff options
Diffstat (limited to 'tools/perf/Documentation/perf-record.txt')
| -rw-r--r-- | tools/perf/Documentation/perf-record.txt | 455 |
1 files changed, 407 insertions, 48 deletions
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt index d232b13ea713..e8b9aadbbfa5 100644 --- a/tools/perf/Documentation/perf-record.txt +++ b/tools/perf/Documentation/perf-record.txt @@ -9,7 +9,7 @@ SYNOPSIS -------- [verse] 'perf record' [-e <EVENT> | --event=EVENT] [-a] <command> -'perf record' [-e <EVENT> | --event=EVENT] [-a] -- <command> [<options>] +'perf record' [-e <EVENT> | --event=EVENT] [-a] \-- <command> [<options>] DESCRIPTION ----------- @@ -30,8 +30,14 @@ OPTIONS - a symbolic event name (use 'perf list' to list all events) - - a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a - hexadecimal event descriptor. + - a raw PMU event in the form of rN where N is a hexadecimal value + that represents the raw register encoding with the layout of the + event control registers as described by entries in + /sys/bus/event_source/devices/cpu/format/*. + + - a symbolic or raw PMU event followed by an optional colon + and a list of event modifiers, e.g., cpu-cycles:p. See the + linkperf:perf-list[1] man page for details on event modifiers. - a symbolically formed PMU event like 'pmu/param1=0x3,param2/' where 'param1', 'param2', etc are defined as formats for the PMU in @@ -60,6 +66,15 @@ OPTIONS - 'name' : User defined event name. Single quotes (') may be used to escape symbols in the name from parsing by shell and tool like this: name=\'CPU_CLK_UNHALTED.THREAD:cmask=0x1\'. + - 'aux-output': Generate AUX records instead of events. This requires + that an AUX area event is also provided. + - 'aux-action': "pause" or "resume" to pause or resume an AUX + area event (the group leader) when this event occurs. + "start-paused" on an AUX area event itself, will + start in a paused state. + - 'aux-sample-size': Set sample size for AUX area sampling. If the + '--aux-sample' option has been used, set aux-sample-size=0 to disable + AUX area sampling for the event. See the linkperf:perf-list[1] man page for more parameters. @@ -94,9 +109,12 @@ OPTIONS "perf report" to view group events together. --filter=<filter>:: - Event filter. This option should follow an event selector (-e) which - selects either tracepoint event(s) or a hardware trace PMU - (e.g. Intel PT or CoreSight). + Event filter. This option should follow an event selector (-e). + If the event is a tracepoint, the filter string will be parsed by + the kernel. If the event is a hardware trace PMU (e.g. Intel PT + or CoreSight), it'll be processed as an address filter. Otherwise + it means a general filter using BPF which can be applied for any + kind of event. - tracepoint filters @@ -151,6 +169,57 @@ OPTIONS Multiple filters can be separated with space or comma. + - bpf filters + + A BPF filter can access the sample data and make a decision based on the + data. Users need to set an appropriate sample type to use the BPF + filter. BPF filters need root privilege. + + The sample data field can be specified in lower case letter. Multiple + filters can be separated with comma. For example, + + --filter 'period > 1000, cpu == 1' + or + --filter 'mem_op == load || mem_op == store, mem_lvl > l1' + + The former filter only accept samples with period greater than 1000 AND + CPU number is 1. The latter one accepts either load and store memory + operations but it should have memory level above the L1. Since the + mem_op and mem_lvl fields come from the (memory) data_source, it'd only + work with some events which set the data_source field. + + Also user should request to collect that information (with -d option in + the above case). Otherwise, the following message will be shown. + + $ sudo perf record -e cycles --filter 'mem_op == load' + Error: cycles event does not have PERF_SAMPLE_DATA_SRC + Hint: please add -d option to perf record. + failed to set filter "BPF" on event cycles with 22 (Invalid argument) + + Essentially the BPF filter expression is: + + <term> <operator> <value> (("," | "||") <term> <operator> <value>)* + + The <term> can be one of: + ip, id, tid, pid, cpu, time, addr, period, txn, weight, phys_addr, + code_pgsz, data_pgsz, weight1, weight2, weight3, ins_lat, retire_lat, + p_stage_cyc, mem_op, mem_lvl, mem_snoop, mem_remote, mem_lock, + mem_dtlb, mem_blk, mem_hops, uid, gid + + The <operator> can be one of: + ==, !=, >, >=, <, <=, & + + The <value> can be one of: + <number> (for any term) + na, load, store, pfetch, exec (for mem_op) + l1, l2, l3, l4, cxl, io, any_cache, lfb, ram, pmem (for mem_lvl) + na, none, hit, miss, hitm, fwd, peer (for mem_snoop) + remote (for mem_remote) + na, locked (for mem_locked) + na, l1_hit, l1_miss, l2_hit, l2_miss, any_hit, any_miss, walk, fault (for mem_dtlb) + na, by_data, by_addr (for mem_blk) + hops0, hops1, hops2, hops3 (for mem_hops) + --exclude-perf:: Don't record events issued by perf itself. This option should follow an event selector (-e) which selects tracepoint event(s). It adds a @@ -158,6 +227,10 @@ OPTIONS '--filter' exists, the new filter expression will be combined with them by '&&'. +--latency:: + Enable data collection for latency profiling. + Use perf report --latency for latency-centric profile. + -a:: --all-cpus:: System-wide collection from all CPUs (default if no target is specified). @@ -208,26 +281,29 @@ OPTIONS -m:: --mmap-pages=:: Number of mmap data pages (must be a power of two) or size - specification with appended unit character - B/K/M/G. The - size is rounded up to have nearest pages power of two value. - Also, by adding a comma, the number of mmap pages for AUX - area tracing can be specified. - ---group:: - Put all events in a single event group. This precedes the --event - option and remains only for backward compatibility. See --event. + specification in bytes with appended unit character - B/K/M/G. + The size is rounded up to the nearest power-of-two page value. + By adding a comma, an additional parameter with the same + semantics used for the normal mmap areas can be specified for + AUX tracing area. -g:: - Enables call-graph (stack chain/backtrace) recording. + Enables call-graph (stack chain/backtrace) recording for both + kernel space and user space. --call-graph:: Setup and enable call-graph (stack chain/backtrace) recording, - implies -g. Default is "fp". + implies -g. Default is "fp" (for user space). - Allows specifying "fp" (frame pointer) or "dwarf" - (DWARF's CFI - Call Frame Information) or "lbr" - (Hardware Last Branch Record facility) as the method to collect - the information used to show the call graphs. + The unwinding method used for kernel space is dependent on the + unwinder used by the active kernel configuration, i.e + CONFIG_UNWINDER_FRAME_POINTER (fp) or CONFIG_UNWINDER_ORC (orc) + + Any option specified here controls the method used for user space. + + Valid options are "fp" (frame pointer), "dwarf" (DWARF's CFI - + Call Frame Information) or "lbr" (Hardware Last Branch Record + facility). In some systems, where binaries are build with gcc --fomit-frame-pointer, using the "fp" method will produce bogus @@ -244,9 +320,18 @@ OPTIONS User can change the size by passing the size after comma like "--call-graph dwarf,4096". + When "fp" recording is used, perf tries to save stack entries + up to the number specified in sysctl.kernel.perf_event_max_stack + by default. User can change the number by passing it after comma + like "--call-graph fp,32". + + Also "defer" can be used with "fp" (like "--call-graph fp,defer") to + enable deferred user callchain which will collect user-space callchains + when the thread returns to the user space. + -q:: --quiet:: - Don't print any message, useful for scripting. + Don't print any warnings or messages, useful for scripting. -v:: --verbose:: @@ -259,11 +344,17 @@ OPTIONS -d:: --data:: - Record the sample virtual addresses. + Record the sample virtual addresses. Implies --sample-mem-info. --phys-data:: Record the sample physical addresses. +--data-page-size:: + Record the sampled data address data page size. + +--code-page-size:: + Record the sampled code address (ip) page size + -T:: --timestamp:: Record the sample timestamps. Use it with 'perf report -D' to see the @@ -276,6 +367,16 @@ OPTIONS --sample-cpu:: Record the sample cpu. +--sample-identifier:: + Record the sample identifier i.e. PERF_SAMPLE_IDENTIFIER bit set in + the sample_type member of the struct perf_event_attr argument to the + perf_event_open system call. + +--sample-mem-info:: + Record the sample data source information for memory operations. + It requires hardware supports and may work on specific events only. + Please consider using 'perf mem record' instead if you're not sure. + -n:: --no-samples:: Don't sample. @@ -291,6 +392,9 @@ comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0- In per-thread mode with inheritance mode on (default), samples are captured only when the thread executes on the designated CPUs. Default is to monitor all CPUs. +User space tasks can migrate between CPUs, so when tracing selected CPUs, +a dummy event is created to track sideband for all CPUs. + -B:: --no-buildid:: Do not save the build ids of binaries in the perf.data files. This skips @@ -341,6 +445,7 @@ following filters are defined: - any_call: any function call or system call - any_ret: any function return or system call return - ind_call: any indirect branch + - ind_jmp: any indirect jump - call: direct calls, including far (to/from kernel) calls - u: only when the branch target is at the user level - k: only when the branch target is in the kernel @@ -349,7 +454,19 @@ following filters are defined: - no_tx: only when the target is not in a hardware transaction - abort_tx: only when the target is a hardware transaction abort - cond: conditional branches + - call_stack: save call stack + - no_flags: don't save branch flags e.g prediction, misprediction etc + - no_cycles: don't save branch cycles + - hw_index: save branch hardware index - save_type: save branch type during sampling in case binary is not available later + For the platforms with Intel Arch LBR support (12th-Gen+ client or + 4th-Gen Xeon+ server), the save branch type is unconditionally enabled + when the taken branch stack sampling is enabled. + - priv: save privilege state during sampling in case binary is not available later + - counter: save occurrences of the event since the last branch entry. Currently, the + feature is only supported by a newer CPU, e.g., Intel Sierra Forest and + later platforms. An error out is expected if it's used on the unsupported + kernel or CPUs. + The option requires at least one branch type among any, any_call, any_ret, ind_call, cond. @@ -360,13 +477,17 @@ is enabled for all the sampling events. The sampled branch type is the same for The various filters must be specified as a comma separated list: --branch-filter any_ret,u,k Note that this feature may not be available on all processors. +-W:: --weight:: Enable weightened sampling. An additional weight is recorded per sample and can be displayed with the weight and local_weight sort keys. This currently works for TSX abort events and some memory events in precise mode on modern Intel CPUs. --namespaces:: -Record events of type PERF_RECORD_NAMESPACES. +Record events of type PERF_RECORD_NAMESPACES. This enables 'cgroup_id' sort key. + +--all-cgroups:: +Record events of type PERF_RECORD_CGROUP. This enables 'cgroup' sort key. --transaction:: Record transaction flags for transaction related events. @@ -379,8 +500,11 @@ if combined with -a or -C options. -D:: --delay=:: -After starting the program, wait msecs before measuring. This is useful to -filter out the startup phase of the program, which is often very different. +After starting the program, wait msecs before measuring (-1: start with events +disabled), or enable events only for specified ranges of msecs (e.g. +-D 10-20,30-40 means wait 10 msecs, enable for 10 msecs, wait 10 msecs, enable +for 10 msecs, then stop). Note, delaying enabling of events is useful to filter +out the startup phase of the program, which is often very different. -I:: --intr-regs:: @@ -392,7 +516,8 @@ symbolic names, e.g. on x86, ax, si. To list the available registers use --intr-regs=ax,bx. The list of register is architecture dependent. --user-regs:: -Capture user registers at sample time. Same arguments as -I. +Similar to -I, but capture user registers at sample time. To list the available +user registers use --user-regs=\?. --running-time:: Record running and enabled time for read events (:S) @@ -407,9 +532,21 @@ CLOCK_BOOTTIME, CLOCK_REALTIME and CLOCK_TAI. -S:: --snapshot:: Select AUX area tracing Snapshot Mode. This option is valid only with an -AUX area tracing event. Optionally the number of bytes to capture per -snapshot can be specified. In Snapshot Mode, trace data is captured only when -signal SIGUSR2 is received. +AUX area tracing event. Optionally, certain snapshot capturing parameters +can be specified in a string that follows this option: + + - 'e': take one last snapshot on exit; guarantees that there is at least one + snapshot in the output file; + - <size>: if the PMU supports this, specify the desired snapshot size. + +In Snapshot Mode trace data is captured only when signal SIGUSR2 is received +and on exit if the above 'e' option is given. + +--aux-sample[=OPTIONS]:: +Select AUX area sampling. At least one of the events selected by the -e option +must be an AUX area event. Samples on other events will be created containing +data from the AUX area. Optionally sample size may be specified, otherwise it +defaults to 4KiB. --proc-map-timeout:: When processing pre-existing threads /proc/XXX/mmap, it may take a long time, @@ -418,15 +555,9 @@ This option sets the time out limit. The default value is 500 ms. --switch-events:: Record context switch events i.e. events of type PERF_RECORD_SWITCH or -PERF_RECORD_SWITCH_CPU_WIDE. - ---clang-path=PATH:: -Path to clang binary to use for compiling BPF scriptlets. -(enabled when BPF support is on) - ---clang-opt=OPTIONS:: -Options passed to clang when compiling BPF scriptlets. -(enabled when BPF support is on) +PERF_RECORD_SWITCH_CPU_WIDE. In some cases (e.g. Intel PT, CoreSight or Arm SPE) +switch events will be enabled automatically, which can be suppressed by +by the option --no-switch-events. --vmlinux=PATH:: Specify vmlinux path which has debuginfo. @@ -435,17 +566,63 @@ Specify vmlinux path which has debuginfo. --buildid-all:: Record build-id of all DSOs regardless whether it's actually hit or not. +--buildid-mmap:: +Legacy record build-id in map events option which is now the +default. Behaves indentically to --no-buildid. Disable with +--no-buildid-mmap. + --aio[=n]:: Use <n> control blocks in asynchronous (Posix AIO) trace writing mode (default: 1, max: 4). Asynchronous mode is supported only when linking Perf tool with libc library providing implementation for Posix AIO API. +--affinity=mode:: +Set affinity mask of trace reading thread according to the policy defined by 'mode' value: + + - node - thread affinity mask is set to NUMA node cpu mask of the processed mmap buffer + - cpu - thread affinity mask is set to cpu of the processed mmap buffer + +--mmap-flush=number:: + +Specify minimal number of bytes that is extracted from mmap data pages and +processed for output. One can specify the number using B/K/M/G suffixes. + +The maximal allowed value is a quarter of the size of mmaped data pages. + +The default option value is 1 byte which means that every time that the output +writing thread finds some new data in the mmaped buffer the data is extracted, +possibly compressed (-z) and written to the output, perf.data or pipe. + +Larger data chunks are compressed more effectively in comparison to smaller +chunks so extraction of larger chunks from the mmap data pages is preferable +from the perspective of output size reduction. + +Also at some cases executing less output write syscalls with bigger data size +can take less time than executing more output write syscalls with smaller data +size thus lowering runtime profiling overhead. + +-z:: +--compression-level[=n]:: +Produce compressed trace using specified level n (default: 1 - fastest compression, +22 - smallest trace) + --all-kernel:: Configure all used events to run in kernel space. --all-user:: Configure all used events to run in user space. +--kernel-callchains:: +Collect callchains only from kernel space. I.e. this option sets +perf_event_attr.exclude_callchain_user to 1. + +--user-callchains:: +Collect callchains only from user space. I.e. this option sets +perf_event_attr.exclude_callchain_kernel to 1. + +Don't use both --kernel-callchains and --user-callchains at the same time or no +callchains will be collected. + --timestamp-filename Append timestamp to output file name. @@ -455,16 +632,17 @@ Record timestamp boundary (time of first/last samples). --switch-output[=mode]:: Generate multiple perf.data files, timestamp prefixed, switching to a new one based on 'mode' value: - "signal" - when receiving a SIGUSR2 (default value) or - <size> - when reaching the size threshold, size is expected to - be a number with appended unit character - B/K/M/G - <time> - when reaching the time threshold, size is expected to - be a number with appended unit character - s/m/h/d - Note: the precision of the size threshold hugely depends - on your configuration - the number and size of your ring - buffers (-m). It is generally more precise for higher sizes - (like >5M), for lower values expect different sizes. + - "signal" - when receiving a SIGUSR2 (default value) or + - <size> - when reaching the size threshold, size is expected to + be a number with appended unit character - B/K/M/G + - <time> - when reaching the time threshold, size is expected to + be a number with appended unit character - s/m/h/d + + Note: the precision of the size threshold hugely depends + on your configuration - the number and size of your ring + buffers (-m). It is generally more precise for higher sizes + (like >5M), for lower values expect different sizes. A possible use case is to, given an external event, slice the perf.data file that gets then processed, possibly via a perf script, to decide if that @@ -476,6 +654,23 @@ overhead. You can still switch them on with: --switch-output --no-no-buildid --no-no-buildid-cache +--switch-output-event:: +Events that will cause the switch of the perf.data file, auto-selecting +--switch-output=signal, the results are similar as internally the side band +thread will also send a SIGUSR2 to the main one. + +Uses the same syntax as --event, it will just not be recorded, serving only to +switch the perf.data file as soon as the --switch-output event is processed by +a separate sideband thread. + +This sideband thread is also used to other purposes, like processing the +PERF_RECORD_BPF_EVENT records as they happen, asking the kernel for extra BPF +information, etc. + +--switch-max-files=N:: + +When rotating perf.data with --switch-output, only keep N files. + --dry-run:: Parse options then exit. --dry-run can be used to detect errors in cmdline options. @@ -483,6 +678,23 @@ options. 'perf record --dry-run -e' can act as a BPF script compiler if llvm.dump-obj in config file is set to true. +--synth=TYPE:: +Collect and synthesize given type of events (comma separated). Note that +this option controls the synthesis from the /proc filesystem which represent +task status for pre-existing threads. + +Kernel (and some other) events are recorded regardless of the +choice in this option. For example, --synth=no would have MMAP events for +kernel and modules. + +Available types are: + + - 'task' - synthesize FORK and COMM events for each task + - 'mmap' - synthesize MMAP events for each process (implies 'task') + - 'cgroup' - synthesize CGROUP events for each cgroup + - 'all' - synthesize all events (default) + - 'no' - do not synthesize any of the above events + --tail-synthesize:: Instead of collecting non-sample events (for example, fork, comm, mmap) at the beginning of record, collect them during finalizing an output file. @@ -505,6 +717,153 @@ config terms. For example: 'cycles/overwrite/' and 'instructions/no-overwrite/'. Implies --tail-synthesize. +--kcore:: +Make a copy of /proc/kcore and place it into a directory with the perf data file. + +--max-size=<size>:: +Limit the sample data max size, <size> is expected to be a number with +appended unit character - B/K/M/G + +--num-thread-synthesize:: + The number of threads to run when synthesizing events for existing processes. + By default, the number of threads equals 1. + +ifdef::HAVE_LIBPFM[] +--pfm-events events:: +Select a PMU event using libpfm4 syntax (see http://perfmon2.sf.net) +including support for event filters. For example '--pfm-events +inst_retired:any_p:u:c=1:i'. More than one event can be passed to the +option using the comma separator. Hardware events and generic hardware +events cannot be mixed together. The latter must be used with the -e +option. The -e option and this one can be mixed and matched. Events +can be grouped using the {} notation. +endif::HAVE_LIBPFM[] + +--control=fifo:ctl-fifo[,ack-fifo]:: +--control=fd:ctl-fd[,ack-fd]:: +ctl-fifo / ack-fifo are opened and used as ctl-fd / ack-fd as follows. +Listen on ctl-fd descriptor for command to control measurement. + +Available commands: + + - 'enable' : enable events + - 'disable' : disable events + - 'enable name' : enable event 'name' + - 'disable name' : disable event 'name' + - 'snapshot' : AUX area tracing snapshot). + - 'stop' : stop perf record + - 'ping' : ping + - 'evlist [-v|-g|-F] : display all events + + -F Show just the sample frequency used for each event. + -v Show all fields. + -g Show event group information. + +Measurements can be started with events disabled using --delay=-1 option. Optionally +send control command completion ('ack\n') to ack-fd descriptor to synchronize with the +controlling process. Example of bash shell script to enable and disable events during +measurements: + + #!/bin/bash + + ctl_dir=/tmp/ + + ctl_fifo=${ctl_dir}perf_ctl.fifo + test -p ${ctl_fifo} && unlink ${ctl_fifo} + mkfifo ${ctl_fifo} + exec {ctl_fd}<>${ctl_fifo} + + ctl_ack_fifo=${ctl_dir}perf_ctl_ack.fifo + test -p ${ctl_ack_fifo} && unlink ${ctl_ack_fifo} + mkfifo ${ctl_ack_fifo} + exec {ctl_fd_ack}<>${ctl_ack_fifo} + + perf record -D -1 -e cpu-cycles -a \ + --control fd:${ctl_fd},${ctl_fd_ack} \ + -- sleep 30 & + perf_pid=$! + + sleep 5 && echo 'enable' >&${ctl_fd} && read -u ${ctl_fd_ack} e1 && echo "enabled(${e1})" + sleep 10 && echo 'disable' >&${ctl_fd} && read -u ${ctl_fd_ack} d1 && echo "disabled(${d1})" + + exec {ctl_fd_ack}>&- + unlink ${ctl_ack_fifo} + + exec {ctl_fd}>&- + unlink ${ctl_fifo} + + wait -n ${perf_pid} + exit $? + +--threads=<spec>:: +Write collected trace data into several data files using parallel threads. +<spec> value can be user defined list of masks. Masks separated by colon +define CPUs to be monitored by a thread and affinity mask of that thread +is separated by slash: + + <cpus mask 1>/<affinity mask 1>:<cpus mask 2>/<affinity mask 2>:... + +CPUs or affinity masks must not overlap with other corresponding masks. +Invalid CPUs are ignored, but masks containing only invalid CPUs are not +allowed. + +For example user specification like the following: + + 0,2-4/2-4:1,5-7/5-7 + +specifies parallel threads layout that consists of two threads, +the first thread monitors CPUs 0 and 2-4 with the affinity mask 2-4, +the second monitors CPUs 1 and 5-7 with the affinity mask 5-7. + +<spec> value can also be a string meaning predefined parallel threads +layout: + + - cpu - create new data streaming thread for every monitored cpu + - core - create new thread to monitor CPUs grouped by a core + - package - create new thread to monitor CPUs grouped by a package + - numa - create new threed to monitor CPUs grouped by a NUMA domain + +Predefined layouts can be used on systems with large number of CPUs in +order not to spawn multiple per-cpu streaming threads but still avoid LOST +events in data directory files. Option specified with no or empty value +defaults to CPU layout. Masks defined or provided by the option value are +filtered through the mask provided by -C option. + +--debuginfod[=URLs]:: + Specify debuginfod URL to be used when cacheing perf.data binaries, + it follows the same syntax as the DEBUGINFOD_URLS variable, like: + + http://192.168.122.174:8002 + + If the URLs is not specified, the value of DEBUGINFOD_URLS + system environment variable is used. + +--off-cpu:: + Enable off-cpu profiling with BPF. The BPF program will collect + task scheduling information with (user) stacktrace and save them + as sample data of a software event named "offcpu-time". The + sample period will have the time the task slept in nanoseconds. + + Note that BPF can collect stack traces using frame pointer ("fp") + only, as of now. So the applications built without the frame + pointer might see bogus addresses. + + off-cpu profiling consists two types of samples: direct samples, which + share the same behavior as regular samples, and the accumulated + samples, stored in BPF stack trace map, presented after all the regular + samples. + +--off-cpu-thresh:: + Once a task's off-cpu time reaches this threshold (in milliseconds), it + generates a direct off-cpu sample. The default is 500ms. + +--setup-filter=<action>:: + Prepare BPF filter to be used by regular users. The action should be + either "pin" or "unpin". The filter can be used after it's pinned. + + +include::intel-hybrid.txt[] + SEE ALSO -------- -linkperf:perf-stat[1], linkperf:perf-list[1] +linkperf:perf-stat[1], linkperf:perf-list[1], linkperf:perf-intel-pt[1] |
