diff options
Diffstat (limited to 'tools/perf/Documentation/perf-c2c.txt')
| -rw-r--r-- | tools/perf/Documentation/perf-c2c.txt | 132 |
1 files changed, 101 insertions, 31 deletions
diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt index 095aebdc5bb7..40b0f71a2c44 100644 --- a/tools/perf/Documentation/perf-c2c.txt +++ b/tools/perf/Documentation/perf-c2c.txt @@ -9,7 +9,7 @@ SYNOPSIS -------- [verse] 'perf c2c record' [<options>] <command> -'perf c2c record' [<options>] -- [<record command options>] <command> +'perf c2c record' [<options>] \-- [<record command options>] <command> 'perf c2c report' [<options>] DESCRIPTION @@ -19,8 +19,16 @@ C2C stands for Cache To Cache. The perf c2c tool provides means for Shared Data C2C/HITM analysis. It allows you to track down the cacheline contentions. -The tool is based on x86's load latency and precise store facility events -provided by Intel CPUs. These events provide: +On Intel, the tool is based on load latency and precise store facility events +provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling +with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware +limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to +sample load and store operations, therefore hardware and kernel support is +required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the +statistical nature of Arm SPE sampling, not every memory operation will be +sampled. + +These events provide: - memory address of the access - type of the access (load and store details) - latency (in cycles) of the load access @@ -37,7 +45,7 @@ RECORD OPTIONS -------------- -e:: --event=:: - Select the PMU event. Use 'perf mem record -e list' + Select the PMU event. Use 'perf c2c record -e list' to list available events. -v:: @@ -46,7 +54,15 @@ RECORD OPTIONS -l:: --ldlat:: - Configure mem-loads latency. + Configure mem-loads latency. Supported on Intel, Arm64 and some AMD + processors. Ignored on other archs. + + On supported AMD processors: + - /sys/bus/event_source/devices/ibs_op/caps/ldlat file contains '1'. + - Supported latency values are 128 to 2048 (both inclusive). + - Latency value which is a multiple of 128 incurs a little less profiling + overhead compared to other values. + - Load latency filtering is disabled by default. -k:: --all-kernel:: @@ -106,7 +122,33 @@ REPORT OPTIONS -d:: --display:: - Switch to HITM type (rmt, lcl) to display and sort on. Total HITMs as default. + Switch to HITM type (rmt, lcl) or peer snooping type (peer) to display + and sort on. Total HITMs (tot) as default, except Arm64 uses peer mode + as default. + +--stitch-lbr:: + Show callgraph with stitched LBRs, which may have more complete + callgraph. The perf.data file must have been obtained using + perf c2c record --call-graph lbr. + Disabled by default. In common cases with call stack overflows, + it can recreate better call stacks than the default lbr call stack + output. But this approach is not foolproof. There can be cases + where it creates incorrect call stacks from incorrect matches. + The known limitations include exception handing such as + setjmp/longjmp will have calls/returns not match. + +--double-cl:: + Group the detection of shared cacheline events into double cacheline + granularity. Some architectures have an Adjacent Cacheline Prefetch + feature, which causes cacheline sharing to behave like the cacheline + size is doubled. + +-M:: +--disassembler-style=:: + Set disassembler style for objdump. + +--objdump=<path>:: + Path to objdump binary. C2C RECORD ---------- @@ -119,11 +161,20 @@ Following perf record options are configured by default: -W,-d,--phys-data,--sample-cpu Unless specified otherwise with '-e' option, following events are monitored by -default: +default on Intel: cpu/mem-loads,ldlat=30/P cpu/mem-stores/P +following on AMD: + + ibs_op// + +and following on PowerPC: + + cpu/mem-loads/ + cpu/mem-stores/ + User can pass any 'perf record' option behind '--' mark, like (to enable callchains and system wide monitoring): @@ -155,42 +206,57 @@ For each cacheline in the 1) list we display following data: Cacheline - cacheline address (hex number) - Total records - - sum of all cachelines accesses - - Rmt/Lcl Hitm + Rmt/Lcl Hitm (Display with HITM types) - cacheline percentage of all Remote/Local HITM accesses - LLC Load Hitm - Total, Lcl, Rmt + Peer Snoop (Display with peer type) + - cacheline percentage of all peer accesses + + LLC Load Hitm - Total, LclHitm, RmtHitm (For display with HITM types) - count of Total/Local/Remote load HITMs - Store Reference - Total, L1Hit, L1Miss - Total - all store accesses - L1Hit - store accesses that hit L1 - L1Hit - store accesses that missed L1 + Load Peer - Total, Local, Remote (For display with peer type) + - count of Total/Local/Remote load from peer cache or DRAM - Load Dram - - count of local and remote DRAM accesses - - LLC Ld Miss - - count of all accesses that missed LLC + Total records + - sum of all cachelines accesses - Total Loads + Total loads - sum of all load accesses + Total stores + - sum of all store accesses + + Store Reference - L1Hit, L1Miss, N/A + L1Hit - store accesses that hit L1 + L1Miss - store accesses that missed L1 + N/A - store accesses with memory level is not available + Core Load Hit - FB, L1, L2 - count of load hits in FB (Fill Buffer), L1 and L2 cache - LLC Load Hit - Llc, Rmt - - count of LLC and Remote load hits + LLC Load Hit - LlcHit, LclHitm + - count of LLC load accesses, includes LLC hits and LLC HITMs + + RMT Load Hit - RmtHit, RmtHitm + - count of remote load accesses, includes remote hits and remote HITMs; + on Arm neoverse cores, RmtHit is used to account remote accesses, + includes remote DRAM or any upward cache level in remote node + + Load Dram - Lcl, Rmt + - count of local and remote DRAM accesses For each offset in the 2) list we display following data: - HITM - Rmt, Lcl + HITM - Rmt, Lcl (Display with HITM types) - % of Remote/Local HITM accesses for given offset within cacheline - Store Refs - L1 Hit, L1 Miss - - % of store accesses that hit/missed L1 for given offset within cacheline + Peer Snoop - Rmt, Lcl (Display with peer type) + - % of Remote/Local peer accesses for given offset within cacheline + + Store Refs - L1 Hit, L1 Miss, N/A + - % of store accesses that hit L1, missed L1 and N/A (no available) memory + level for given offset within cacheline Data address - Offset - offset address @@ -204,9 +270,12 @@ For each offset in the 2) list we display following data: Code address - code address responsible for the accesses - cycles - rmt hitm, lcl hitm, load + cycles - rmt hitm, lcl hitm, load (Display with HITM types) - sum of cycles for given accesses - Remote/Local HITM and generic load + cycles - rmt peer, lcl peer, load (Display with peer type) + - sum of cycles for given accesses - Remote/Local peer load and generic load + cpu cnt - number of cpus that participated on the access @@ -228,7 +297,8 @@ The 'Node' field displays nodes that accesses given cacheline offset. Its output comes in 3 flavors: - node IDs separated by ',' - node IDs with stats for each ID, in following format: - Node{cpus %hitms %stores} + Node{cpus %hitms %stores} (Display with HITM types) + Node{cpus %peers %stores} (Display with peer type) - node IDs with list of affected CPUs in following format: Node{cpu list} @@ -240,7 +310,7 @@ COALESCE User can specify how to sort offsets for cacheline. Following fields are available and governs the final -output fields set for caheline offsets output: +output fields set for cacheline offsets output: tid - coalesced by process TIDs pid - coalesced by process PIDs @@ -287,4 +357,4 @@ Check Joe's blog on c2c tool for detailed use case explanation: SEE ALSO -------- -linkperf:perf-record[1], linkperf:perf-mem[1] +linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1] |
