summaryrefslogtreecommitdiff
path: root/tools/perf/Documentation/perf-mem.txt
diff options
context:
space:
mode:
Diffstat (limited to 'tools/perf/Documentation/perf-mem.txt')
-rw-r--r--tools/perf/Documentation/perf-mem.txt176
1 files changed, 143 insertions, 33 deletions
diff --git a/tools/perf/Documentation/perf-mem.txt b/tools/perf/Documentation/perf-mem.txt
index 19862572e3f2..4d164836d094 100644
--- a/tools/perf/Documentation/perf-mem.txt
+++ b/tools/perf/Documentation/perf-mem.txt
@@ -21,22 +21,17 @@ and stores are sampled. Use the -t option to limit to loads or stores.
Note that on Intel systems the memory latency reported is the use-latency,
not the pure load (or store latency). Use latency includes any pipeline
-queueing delays in addition to the memory subsystem latency.
+queuing delays in addition to the memory subsystem latency.
On Arm64 this uses SPE to sample load and store operations, therefore hardware
and kernel support is required. See linkperf:perf-arm-spe[1] for a setup guide.
Due to the statistical nature of SPE sampling, not every memory operation will
be sampled.
-OPTIONS
--------
-<command>...::
- Any command you can specify in a shell.
-
--i::
---input=<file>::
- Input file name.
+On AMD this use IBS Op PMU to sample load-store operations.
+COMMON OPTIONS
+--------------
-f::
--force::
Don't do ownership validation
@@ -45,24 +40,9 @@ OPTIONS
--type=<type>::
Select the memory operation type: load or store (default: load,store)
--D::
---dump-raw-samples::
- Dump the raw decoded samples on the screen in a format that is easy to parse with
- one sample per line.
-
--x::
---field-separator=<separator>::
- Specify the field separator used when dump raw samples (-D option). By default,
- The separator is the space character.
-
--C::
---cpu=<cpu>::
- Monitor only on the list of CPUs provided. Multiple CPUs can be provided as a
- comma-separated list with no space: 0,1. Ranges of CPUs are specified with -: 0-2. Default
- is to monitor all CPUS.
--U::
---hide-unresolved::
- Only display entries resolved to a symbol.
+-v::
+--verbose::
+ Be more verbose (show counter open errors, etc)
-p::
--phys-data::
@@ -73,6 +53,9 @@ OPTIONS
RECORD OPTIONS
--------------
+<command>...::
+ Any command you can specify in a shell.
+
-e::
--event <event>::
Event selector. Use 'perf mem record -e list' to list available events.
@@ -85,17 +68,144 @@ RECORD OPTIONS
--all-user::
Configure all used events to run in user space.
--v::
---verbose::
- Be more verbose (show counter open errors, etc)
-
--ldlat <n>::
- Specify desired latency for loads event. Supported on Intel and Arm64
- processors only. Ignored on other archs.
+ Specify desired latency for loads event. Supported on Intel, Arm64 and
+ some AMD processors. Ignored on other archs.
+
+ On supported AMD processors:
+ - /sys/bus/event_source/devices/ibs_op/caps/ldlat file contains '1'.
+ - Supported latency values are 128 to 2048 (both inclusive).
+ - Latency value which is a multiple of 128 incurs a little less profiling
+ overhead compared to other values.
+ - Load latency filtering is disabled by default.
+
+REPORT OPTIONS
+--------------
+-i::
+--input=<file>::
+ Input file name.
+
+-C::
+--cpu=<cpu>::
+ Monitor only on the list of CPUs provided. Multiple CPUs can be provided as a
+ comma-separated list with no space: 0,1. Ranges of CPUs are specified with -
+ like 0-2. Default is to monitor all CPUS.
+
+-D::
+--dump-raw-samples::
+ Dump the raw decoded samples on the screen in a format that is easy to parse with
+ one sample per line.
+
+-s::
+--sort=<key>::
+ Group result by given key(s) - multiple keys can be specified
+ in CSV format. The keys are specific to memory samples are:
+ symbol_daddr, symbol_iaddr, dso_daddr, locked, tlb, mem, snoop,
+ dcacheline, phys_daddr, data_page_size, blocked.
+
+ - symbol_daddr: name of data symbol being executed on at the time of sample
+ - symbol_iaddr: name of code symbol being executed on at the time of sample
+ - dso_daddr: name of library or module containing the data being executed
+ on at the time of the sample
+ - locked: whether the bus was locked at the time of the sample
+ - tlb: type of tlb access for the data at the time of the sample
+ - mem: type of memory access for the data at the time of the sample
+ - snoop: type of snoop (if any) for the data at the time of the sample
+ - dcacheline: the cacheline the data address is on at the time of the sample
+ - phys_daddr: physical address of data being executed on at the time of sample
+ - data_page_size: the data page size of data being executed on at the time of sample
+ - blocked: reason of blocked load access for the data at the time of the sample
+
+ And the default sort keys are changed to local_weight, mem, sym, dso,
+ symbol_daddr, dso_daddr, snoop, tlb, locked, blocked, local_ins_lat.
+
+-F::
+--fields=::
+ Specify output field - multiple keys can be specified in CSV format.
+ Please see linkperf:perf-report[1] for details.
+
+ In addition to the default fields, 'perf mem report' will provide the
+ following fields to break down sample periods.
+
+ - op: operation in the sample instruction (load, store, prefetch, ...)
+ - cache: location in CPU cache (L1, L2, ...) where the sample hit
+ - mem: location in memory or other places the sample hit
+ - dtlb: location in Data TLB (L1, L2) where the sample hit
+ - snoop: snoop result for the sampled data access
+
+ Please take a look at the OUTPUT FIELD SELECTION section for caveats.
+
+-T::
+--type-profile::
+ Show data-type profile result instead of code symbols. This requires
+ the debug information and it will change the default sort keys to:
+ mem, snoop, tlb, type.
+
+-U::
+--hide-unresolved::
+ Only display entries resolved to a symbol.
+
+-x::
+--field-separator=<separator>::
+ Specify the field separator used when dump raw samples (-D option). By default,
+ The separator is the space character.
In addition, for report all perf report options are valid, and for record
all perf record options.
+OVERHEAD CALCULATION
+--------------------
+Unlike linkperf:perf-report[1], which calculates overhead from the actual
+sample period, perf-mem overhead is calculated using sample weight. E.g.
+there are two samples in perf.data file, both with the same sample period,
+but one sample with weight 180 and the other with weight 20:
+
+ $ perf script -F period,data_src,weight,ip,sym
+ 100000 629080842 |OP LOAD|LVL L3 hit|... 20 7e69b93ca524 strcmp
+ 100000 1a29081042 |OP LOAD|LVL RAM hit|... 180 ffffffff82429168 memcpy
+
+ $ perf report -F overhead,symbol
+ 50% [.] strcmp
+ 50% [k] memcpy
+
+ $ perf mem report -F overhead,symbol
+ 90% [k] memcpy
+ 10% [.] strcmp
+
+OUTPUT FIELD SELECTION
+----------------------
+"perf mem report" adds a number of new output fields specific to data source
+information in the sample. Some of them have the same name with the existing
+sort keys ("mem" and "snoop"). So unlike other fields and sort keys, they'll
+behave differently when it's used by -F/--fields or -s/--sort.
+
+Using those two as output fields will aggregate samples altogether and show
+breakdown.
+
+ $ perf mem report -F mem,snoop
+ ...
+ # ------ Memory ------- --- Snoop ----
+ # RAM Uncach Other HitM Other
+ # ..................... ..............
+ #
+ 3.5% 0.0% 96.5% 25.1% 74.9%
+
+But using the same name for sort keys will aggregate samples for each type
+separately.
+
+ $ perf mem report -s mem,snoop
+ # Overhead Samples Memory access Snoop
+ # ........ ............ ....................................... ............
+ #
+ 47.99% 1509 L2 hit N/A
+ 25.08% 338 core, same node Any cache hit HitM
+ 10.24% 54374 N/A N/A
+ 6.77% 35938 L1 hit N/A
+ 6.39% 101 core, same node Any cache hit N/A
+ 3.50% 69 RAM hit N/A
+ 0.03% 158 LFB/MAB hit N/A
+ 0.00% 2 Uncached hit N/A
+
SEE ALSO
--------
linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-arm-spe[1]