diff options
Diffstat (limited to 'tools/perf/Documentation/perf-report.txt')
| -rw-r--r-- | tools/perf/Documentation/perf-report.txt | 274 |
1 files changed, 244 insertions, 30 deletions
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt index 9fa84617181e..acef3ff4178e 100644 --- a/tools/perf/Documentation/perf-report.txt +++ b/tools/perf/Documentation/perf-report.txt @@ -27,7 +27,7 @@ OPTIONS -q:: --quiet:: - Do not show any message. (Suppress -v) + Do not show any warnings or messages. (Suppress -v) -n:: --show-nr-samples:: @@ -44,7 +44,7 @@ OPTIONS --comms=:: Only consider symbols in these comms. CSV that understands file://filename entries. This option will affect the percentage of - the overhead column. See --percentage for more info. + the overhead and latency columns. See --percentage for more info. --pid=:: Only show events for given process ID (comma separated list). @@ -54,12 +54,12 @@ OPTIONS --dsos=:: Only consider symbols in these dsos. CSV that understands file://filename entries. This option will affect the percentage of - the overhead column. See --percentage for more info. + the overhead and latency columns. See --percentage for more info. -S:: --symbols=:: Only consider these symbols. CSV that understands file://filename entries. This option will affect the percentage of - the overhead column. See --percentage for more info. + the overhead and latency columns. See --percentage for more info. --symbol-filter=:: Only show symbols that match (partially) with this filter. @@ -68,45 +68,84 @@ OPTIONS --hide-unresolved:: Only display entries resolved to a symbol. +--parallelism:: + Only consider these parallelism levels. Parallelism level is the number + of threads that actively run on CPUs at the time of sample. The flag + accepts single number, comma-separated list, and ranges (for example: + "1", "7,8", "1,64-128"). This is useful in understanding what a program + is doing during sequential/low-parallelism phases as compared to + high-parallelism phases. This option will affect the percentage of + the overhead and latency columns. See --percentage for more info. + Also see the `CPU and latency overheads' section for more details. + +--latency:: + Show latency-centric profile rather than the default + CPU-consumption-centric profile + (requires perf record --latency flag). + -s:: --sort=:: Sort histogram entries by given key(s) - multiple keys can be specified in CSV format. Following sort keys are available: pid, comm, dso, symbol, parent, cpu, socket, srcline, weight, - local_weight, cgroup_id. + local_weight, cgroup_id, addr. Each key has following meaning: - comm: command (name) of the task which can be read via /proc/<pid>/comm - pid: command and tid of the task + - tgid: command and tgid of the task - dso: name of library or module executed at the time of sample + - dso_size: size of library or module executed at the time of sample - symbol: name of function executed at the time of sample - symbol_size: size of function executed at the time of sample - parent: name of function matched to the parent regex filter. Unmatched entries are displayed as "[other]". - cpu: cpu number the task ran at the time of sample - socket: processor socket number the task ran at the time of sample + - parallelism: number of running threads at the time of sample - srcline: filename and line number executed at the time of sample. The DWARF debugging info must be provided. - - srcfile: file name of the source file of the same. Requires dwarf + - srcfile: file name of the source file of the samples. Requires dwarf information. - weight: Event specific weight, e.g. memory latency or transaction abort cost. This is the global weight. - local_weight: Local weight version of the weight above. - cgroup_id: ID derived from cgroup namespace device and inode numbers. + - cgroup: cgroup pathname in the cgroupfs. - transaction: Transaction abort flags. - - overhead: Overhead percentage of sample - - overhead_sys: Overhead percentage of sample running in system mode - - overhead_us: Overhead percentage of sample running in user mode - - overhead_guest_sys: Overhead percentage of sample running in system mode + - overhead: CPU overhead percentage of sample. + - latency: latency (wall-clock) overhead percentage of sample. + See the `CPU and latency overheads' section for more details. + - overhead_sys: CPU overhead percentage of sample running in system mode + - overhead_us: CPU overhead percentage of sample running in user mode + - overhead_guest_sys: CPU overhead percentage of sample running in system mode on guest machine - - overhead_guest_us: Overhead percentage of sample running in user mode on + - overhead_guest_us: CPU overhead percentage of sample running in user mode on guest machine - sample: Number of sample - period: Raw number of event count of sample - - By default, comm, dso and symbol keys are used. - (i.e. --sort comm,dso,symbol) + - time: Separate the samples by time stamp with the resolution specified by + --time-quantum (default 100ms). Specify with overhead and before it. + - code_page_size: the code page size of sampled code address (ip) + - ins_lat: Instruction latency in core cycles. This is the global instruction + latency + - local_ins_lat: Local instruction latency version + - p_stage_cyc: On powerpc, this presents the number of cycles spent in a + pipeline stage. And currently supported only on powerpc. + - addr: (Full) virtual address of the sampled instruction + - retire_lat: On X86, this reports pipeline stall of this instruction compared + to the previous instruction in cycles. And currently supported only on X86 + - simd: Flags describing a SIMD operation. "e" for empty Arm SVE predicate. "p" for partial Arm SVE predicate + - type: Data type of sample memory access. + - typeoff: Offset in the data type of sample memory access. + - symoff: Offset in the symbol. + - weight1: Average value of event specific weight (1st field of weight_struct). + - weight2: Average value of event specific weight (2nd field of weight_struct). + - weight3: Average value of event specific weight (3rd field of weight_struct). + + By default, overhead, comm, dso and symbol keys are used. + (i.e. --sort overhead,comm,dso,symbol). If --branch-stack option is used, following sort keys are also available: @@ -125,9 +164,17 @@ OPTIONS And default sort keys are changed to comm, dso_from, symbol_from, dso_to and symbol_to, see '--branch-stack'. + When the sort key symbol is specified, columns "IPC" and "IPC Coverage" + are enabled automatically. Column "IPC" reports the average IPC per function + and column "IPC coverage" reports the percentage of instructions with + sampled IPC in this function. IPC means Instruction Per Cycle. If it's low, + it indicates there may be a performance bottleneck when the function is + executed, such as a memory access bottleneck. If a function has high overhead + and low IPC, it's worth further analyzing it to optimize its performance. + If the --mem-mode option is used, the following sort keys are also available (incompatible with --branch-stack): - symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline. + symbol_daddr, dso_daddr, locked, tlb, mem, snoop, dcacheline, blocked. - symbol_daddr: name of data symbol being executed on at the time of sample - dso_daddr: name of library or module containing the data being executed @@ -137,9 +184,13 @@ OPTIONS - mem: type of memory access for the data at the time of the sample - snoop: type of snoop (if any) for the data at the time of the sample - dcacheline: the cacheline the data address is on at the time of the sample + - phys_daddr: physical address of data being executed on at the time of sample + - data_page_size: the data page size of data being executed on at the time of sample + - blocked: reason of blocked load access for the data at the time of the sample And the default sort keys are changed to local_weight, mem, sym, dso, - symbol_daddr, dso_daddr, snoop, tlb, locked, see '--mem-mode'. + symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat, + see '--mem-mode'. If the data file has tracepoint event(s), following (dynamic) sort keys are also available: @@ -169,7 +220,11 @@ OPTIONS --fields=:: Specify output field - multiple keys can be specified in CSV format. Following fields are available: - overhead, overhead_sys, overhead_us, overhead_children, sample and period. + overhead, latency, overhead_sys, overhead_us, overhead_children, sample, + period, weight1, weight2, weight3, ins_lat, p_stage_cyc and retire_lat. + The last 3 names are alias for the corresponding weights. When the weight + fields are used, they will show the average value of the weight. + Also it can contain any sort key(s). By default, every sort keys not specified in -F will be appended @@ -204,6 +259,9 @@ OPTIONS --dump-raw-trace:: Dump raw trace in ASCII. +--disable-order:: + Disable raw trace ordering. + -g:: --call-graph=<print_type,threshold[,print_limit],order,sort_key[,branch],value>:: Display call chains using type, min percent threshold, print limit, @@ -242,7 +300,7 @@ OPTIONS Usually more convenient to use --branch-history for this. value can be: - - percent: diplay overhead percent (default) + - percent: display overhead percent (default) - period: display event period - count: display event count @@ -250,7 +308,7 @@ OPTIONS Accumulate callchain of children to parent entry so that then can show up in the output. The output will have a new "Children" column and will be sorted on the data. It requires callchains are recorded. - See the `overhead calculation' section for more details. Enabled by + See the `Overhead calculation' section for more details. Enabled by default, disable with --no-children. --max-stack:: @@ -295,6 +353,9 @@ OPTIONS --vmlinux=<file>:: vmlinux pathname +--ignore-vmlinux:: + Ignore vmlinux files. + --kallsyms=<file>:: kallsyms pathname @@ -349,11 +410,34 @@ OPTIONS This allows to examine the path the program took to each sample. The data collection must have used -b (or -j) and -g. + Also show with some branch flags that can be: + - Predicted: display the average percentage of predicated branches. + (predicated number / total number) + - Abort: display the number of tsx aborted branches. + - Cycles: cycles in basic block. + + - iterations: display the average number of iterations in callchain list. + +--addr2line=<path>:: + Path to addr2line binary. + --objdump=<path>:: Path to objdump binary. +--prefix=PREFIX:: +--prefix-strip=N:: + Remove first N entries from source file path names in executables + and add PREFIX. This allows to display source code compiled on systems + with different file system layout. + --group:: - Show event group information together. + Show event group information together. It forces group output also + if there are no groups defined in data file. + +--group-sort-idx:: + Sort the output by the event at the index n in group. If n is invalid, + sort by the first event. It can support multiple groups with different + amount of events. WARNING: This should be used on grouped events. --demangle:: Demangle symbol names to human readable form. It's enabled by default, @@ -366,7 +450,7 @@ OPTIONS Use the data addresses of samples in addition to instruction addresses to build the histograms. To generate meaningful output, the perf.data file must have been obtained using perf record -d -W and using a - special event -e cpu/mem-loads/ or -e cpu/mem-stores/. See + special event -e cpu/mem-loads/p or -e cpu/mem-stores/p. See 'perf mem' for simpler access. --percent-limit:: @@ -377,9 +461,9 @@ OPTIONS --call-graph option for details. --percentage:: - Determine how to display the overhead percentage of filtered entries. - Filters can be applied by --comms, --dsos and/or --symbols options and - Zoom operations on the TUI (thread, dso, etc). + Determine how to display the CPU and latency overhead percentage + of filtered entries. Filters can be applied by --comms, --dsos, --symbols + and/or --parallelism options and Zoom operations on the TUI (thread, dso, etc). "relative" means it's relative to filtered entries only so that the sum of shown entries will be always 100%. "absolute" means it retains @@ -396,10 +480,48 @@ OPTIONS --time:: Only analyze samples within given time window: <start>,<stop>. Times - have the format seconds.microseconds. If start is not given (i.e., time + have the format seconds.nanoseconds. If start is not given (i.e. time string is ',x.y') then analysis starts at the beginning of the file. If - stop time is not given (i.e, time string is 'x.y,') then analysis goes - to end of file. + stop time is not given (i.e. time string is 'x.y,') then analysis goes + to end of file. Multiple ranges can be separated by spaces, which + requires the argument to be quoted e.g. --time "1234.567,1234.789 1235," + + Also support time percent with multiple time ranges. Time string is + 'a%/n,b%/m,...' or 'a%-b%,c%-%d,...'. + + For example: + Select the second 10% time slice: + + perf report --time 10%/2 + + Select from 0% to 10% time slice: + + perf report --time 0%-10% + + Select the first and second 10% time slices: + + perf report --time 10%/1,10%/2 + + Select from 0% to 10% and 30% to 40% slices: + + perf report --time 0%-10%,30%-40% + +--switch-on EVENT_NAME:: + Only consider events after this event is found. + + This may be interesting to measure a workload only after some initialization + phase is over, i.e. insert a perf probe at that point and then using this + option with that probe. + +--switch-off EVENT_NAME:: + Stop considering events after this event is found. + +--show-on-off-events:: + Show the --switch-on/off events too. This has no effect in 'perf report' now + but probably we'll make the default not to show the switch-on/off events + on the --group mode and if there is only one event besides the off/on ones, + go straight to the histogram browser, just like 'perf report' with no events + explicitly specified does. --itrace:: Options for decoding instruction tracing data. The options are: @@ -422,21 +544,113 @@ include::itrace.txt[] This option extends the perf report to show reference callgraphs, which collected by reference event, in no callgraph event. +--stitch-lbr:: + Show callgraph with stitched LBRs, which may have more complete + callgraph. The perf.data file must have been obtained using + perf record --call-graph lbr. + Disabled by default. In common cases with call stack overflows, + it can recreate better call stacks than the default lbr call stack + output. But this approach is not foolproof. There can be cases + where it creates incorrect call stacks from incorrect matches. + The known limitations include exception handing such as + setjmp/longjmp will have calls/returns not match. + --socket-filter:: Only report the samples on the processor socket that match with this filter +--samples=N:: + Save N individual samples for each histogram entry to show context in perf + report tui browser. + --raw-trace:: When displaying traceevent output, do not use print fmt or plugins. +-H:: --hierarchy:: - Enable hierarchical output. + Enable hierarchical output. In the hierarchy mode, each sort key groups + samples based on the criteria and then sub-divide it using the lower + level sort key. + + For example: + In normal output: + + perf report -s dso,sym + # Overhead Shared Object Symbol + 50.00% [kernel.kallsyms] [k] kfunc1 + 20.00% perf [.] foo + 15.00% [kernel.kallsyms] [k] kfunc2 + 10.00% perf [.] bar + 5.00% libc.so [.] libcall + + In hierarchy output: + + perf report -s dso,sym --hierarchy + # Overhead Shared Object / Symbol + 65.00% [kernel.kallsyms] + 50.00% [k] kfunc1 + 15.00% [k] kfunc2 + 30.00% perf + 20.00% [.] foo + 10.00% [.] bar + 5.00% libc.so + 5.00% [.] libcall --inline:: If a callgraph address belongs to an inlined function, the inline stack - will be printed. Each entry is function name or file/line. + will be printed. Each entry is function name or file/line. Enabled by + default, disable with --no-inline. + +--mmaps:: + Show --tasks output plus mmap information in a format similar to + /proc/<PID>/maps. + + Please note that not all mmaps are stored, options affecting which ones + are include 'perf record --data', for instance. + +--ns:: + Show time stamps in nanoseconds. + +--stats:: + Display overall events statistics without any further processing. + (like the one at the end of the perf report -D command) + +--tasks:: + Display monitored tasks stored in perf data. Displaying pid/tid/ppid + plus the command string aligned to distinguish parent and child tasks. + +--percent-type:: + Set annotation percent type from following choices: + global-period, local-period, global-hits, local-hits + + The local/global keywords set if the percentage is computed + in the scope of the function (local) or the whole data (global). + The period/hits keywords set the base the percentage is computed + on - the samples period or the number of samples (hits). + +--time-quantum:: + Configure time quantum for time sort key. Default 100ms. + Accepts s, us, ms, ns units. + +--total-cycles:: + When --total-cycles is specified, it supports sorting for all blocks by + 'Sampled Cycles%'. This is useful to concentrate on the globally hottest + blocks. In output, there are some new columns: + + 'Sampled Cycles%' - block sampled cycles aggregation / total sampled cycles + 'Sampled Cycles' - block sampled cycles aggregation + 'Avg Cycles%' - block average sampled cycles / sum of total block average + sampled cycles + 'Avg Cycles' - block average sampled cycles + 'Branch Counter' - block branch counter histogram (with -v showing the number) + +--skip-empty:: + Do not print 0 results in the --stat output. + +include::cpu-and-latency-overheads.txt[] include::callchain-overhead-calculation.txt[] SEE ALSO -------- -linkperf:perf-stat[1], linkperf:perf-annotate[1] +linkperf:perf-stat[1], linkperf:perf-annotate[1], linkperf:perf-record[1], +linkperf:perf-intel-pt[1] |
