summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-08-09make sure that __dentry_kill() always invalidates d_seq, unhashed or notAl Viro
RCU pathwalk relies upon the assumption that anything that changes ->d_inode of a dentry will invalidate its ->d_seq. That's almost true - the one exception is that the final dput() of already unhashed dentry does *not* touch ->d_seq at all. Unhashing does, though, so for anything we'd found by RCU dcache lookup we are fine. Unfortunately, we can *start* with an unhashed dentry or jump into it. We could try and be careful in the (few) places where that could happen. Or we could just make the final dput() invalidate the damn thing, unhashed or not. The latter is much simpler and easier to backport, so let's do it that way. Reported-by: "Dae R. Jeong" <threeearcat@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09fix __legitimize_mnt()/mntput() raceAl Viro
__legitimize_mnt() has two problems - one is that in case of success the check of mount_lock is not ordered wrt preceding increment of refcount, making it possible to have successful __legitimize_mnt() on one CPU just before the otherwise final mntpu() on another, with __legitimize_mnt() not seeing mntput() taking the lock and mntput() not seeing the increment done by __legitimize_mnt(). Solved by a pair of barriers. Another is that failure of __legitimize_mnt() on the second read_seqretry() leaves us with reference that'll need to be dropped by caller; however, if that races with final mntput() we can end up with caller dropping rcu_read_lock() and doing mntput() to release that reference - with the first mntput() having freed the damn thing just as rcu_read_lock() had been dropped. Solution: in "do mntput() yourself" failure case grab mount_lock, check if MNT_DOOMED has been set by racing final mntput() that has missed our increment and if it has - undo the increment and treat that as "failure, caller doesn't need to drop anything" case. It's not easy to hit - the final mntput() has to come right after the first read_seqretry() in __legitimize_mnt() *and* manage to miss the increment done by __legitimize_mnt() before the second read_seqretry() in there. The things that are almost impossible to hit on bare hardware are not impossible on SMP KVM, though... Reported-by: Oleg Nesterov <oleg@redhat.com> Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09fix mntput/mntput raceAl Viro
mntput_no_expire() does the calculation of total refcount under mount_lock; unfortunately, the decrement (as well as all increments) are done outside of it, leading to false positives in the "are we dropping the last reference" test. Consider the following situation: * mnt is a lazy-umounted mount, kept alive by two opened files. One of those files gets closed. Total refcount of mnt is 2. On CPU 42 mntput(mnt) (called from __fput()) drops one reference, decrementing component * After it has looked at component #0, the process on CPU 0 does mntget(), incrementing component #0, gets preempted and gets to run again - on CPU 69. There it does mntput(), which drops the reference (component #69) and proceeds to spin on mount_lock. * On CPU 42 our first mntput() finishes counting. It observes the decrement of component #69, but not the increment of component #0. As the result, the total it gets is not 1 as it should've been - it's 0. At which point we decide that vfsmount needs to be killed and proceed to free it and shut the filesystem down. However, there's still another opened file on that filesystem, with reference to (now freed) vfsmount, etc. and we are screwed. It's not a wide race, but it can be reproduced with artificial slowdown of the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups. Fix consists of moving the refcount decrement under mount_lock; the tricky part is that we want (and can) keep the fast case (i.e. mount that still has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu() before that mntput(). IOW, if mntput() observes (under rcu_read_lock()) a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to be dropped. Reported-by: Jann Horn <jannh@google.com> Tested-by: Jann Horn <jannh@google.com> Fixes: 48a066e72d97 ("RCU'd vsfmounts") Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-08-09Merge branch 'bpf-fix-cpu-and-devmap-teardown'Daniel Borkmann
Jesper Dangaard Brouer says: ==================== Removing entries from cpumap and devmap, goes through a number of syncronization steps to make sure no new xdp_frames can be enqueued. But there is a small chance, that xdp_frames remains which have not been flushed/processed yet. Flushing these during teardown, happens from RCU context and not as usual under RX NAPI context. The optimization introduced in commt 389ab7f01af9 ("xdp: introduce xdp_return_frame_rx_napi"), missed that the flush operation can also be called from RCU context. Thus, we cannot always use the xdp_return_frame_rx_napi call, which take advantage of the protection provided by XDP RX running under NAPI protection. The samples/bpf xdp_redirect_cpu have a --stress-mode, that is adjusted to easier reproduce (verified by Red Hat QA). ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-09xdp: fix bug in devmap teardown code pathJesper Dangaard Brouer
Like cpumap teardown, the devmap teardown code also flush remaining xdp_frames, via bq_xmit_all() in case map entry is removed. The code can call xdp_return_frame_rx_napi, from the the wrong context, in-case ndo_xdp_xmit() fails. Fixes: 389ab7f01af9 ("xdp: introduce xdp_return_frame_rx_napi") Fixes: 735fc4054b3a ("xdp: change ndo_xdp_xmit API to support bulking") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-09samples/bpf: xdp_redirect_cpu adjustment to reproduce teardown race easierJesper Dangaard Brouer
The teardown race in cpumap is really hard to reproduce. These changes makes it easier to reproduce, for QA. The --stress-mode now have a case of a very small queue size of 8, that helps to trigger teardown flush to encounter a full queue, which results in calling xdp_return_frame API, in a non-NAPI protect context. Also increase MAX_CPUS, as my QA department have larger machines than me. Tested-by: Jean-Tsung Hsiao <jhsiao@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-09xdp: fix bug in cpumap teardown code pathJesper Dangaard Brouer
When removing a cpumap entry, a number of syncronization steps happen. Eventually the teardown code __cpu_map_entry_free is invoked from/via call_rcu. The teardown code __cpu_map_entry_free() flushes remaining xdp_frames, by invoking bq_flush_to_queue, which calls xdp_return_frame_rx_napi(). The issues is that the teardown code is not running in the RX NAPI code path. Thus, it is not allowed to invoke the NAPI variant of xdp_return_frame. This bug was found and triggered by using the --stress-mode option to the samples/bpf program xdp_redirect_cpu. It is hard to trigger, because the ptr_ring have to be full and cpumap bulk queue max contains 8 packets, and a remote CPU is racing to empty the ptr_ring queue. Fixes: 389ab7f01af9 ("xdp: introduce xdp_return_frame_rx_napi") Tested-by: Jean-Tsung Hsiao <jhsiao@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-09x86/relocs: Add __end_rodata_aligned to S_RELJoerg Roedel
This new symbol needs to be in the workaround-list for buggy binutils, otherwise the build with gcc-4.6 fails. Fixes: 39d668e04eda ('x86/mm/pti: Make pti_clone_kernel_text() compile on 32 bit') Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Sedat Dilek <sedat.dilek@gmail.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linux-Next Mailing List <linux-next@vger.kernel.org> Link: https://lkml.kernel.org/r/20180809094449.ddmnrkz7qkvo3j2x@suse.de
2018-08-09perf probe powerpc: Fix trace event post-processingSandipan Das
In some cases, a symbol may have multiple aliases. Attempting to add an entry probe for such symbols results in a probe being added at an incorrect location while it fails altogether for return probes. This is only applicable for binaries with debug information. During the arch-dependent post-processing, the offset from the start of the symbol at which the probe is to be attached is determined and added to the start address of the symbol to get the probe's location. In case there are multiple aliases, this offset gets added multiple times for each alias of the symbol and we end up with an incorrect probe location. This can be verified on a powerpc64le system as shown below. $ nm /lib/modules/$(uname -r)/build/vmlinux | grep "sys_open$" ... c000000000414290 T __se_sys_open c000000000414290 T sys_open $ objdump -d /lib/modules/$(uname -r)/build/vmlinux | grep -A 10 "<__se_sys_open>:" c000000000414290 <__se_sys_open>: c000000000414290: 19 01 4c 3c addis r2,r12,281 c000000000414294: 70 c4 42 38 addi r2,r2,-15248 c000000000414298: a6 02 08 7c mflr r0 c00000000041429c: e8 ff a1 fb std r29,-24(r1) c0000000004142a0: f0 ff c1 fb std r30,-16(r1) c0000000004142a4: f8 ff e1 fb std r31,-8(r1) c0000000004142a8: 10 00 01 f8 std r0,16(r1) c0000000004142ac: c1 ff 21 f8 stdu r1,-64(r1) c0000000004142b0: 78 23 9f 7c mr r31,r4 c0000000004142b4: 78 1b 7e 7c mr r30,r3 For both the entry probe and the return probe, the probe location should be _text+4276888 (0xc000000000414298). Since another alias exists for 'sys_open', the post-processing code will end up adding the offset (8 for powerpc64le) twice and perf will attempt to add the probe at _text+4276896 (0xc0000000004142a0) instead. Before: # perf probe -v -a sys_open probe-definition(0): sys_open symbol:sys_open file:(null) line:0 offset:0 return:0 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /lib/modules/4.18.0-rc8+/build/vmlinux for symbols Open Debuginfo file: /lib/modules/4.18.0-rc8+/build/vmlinux Try to find probe point from debuginfo. Symbol sys_open address found : c000000000414290 Matched function: __se_sys_open [2ad03a0] Probe point found: __se_sys_open+0 Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing/kprobe_events write=1 Writing event: p:probe/sys_open _text+4276896 Added new event: probe:sys_open (on sys_open) ... # perf probe -v -a sys_open%return $retval probe-definition(0): sys_open%return symbol:sys_open file:(null) line:0 offset:0 return:1 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /lib/modules/4.18.0-rc8+/build/vmlinux for symbols Open Debuginfo file: /lib/modules/4.18.0-rc8+/build/vmlinux Try to find probe point from debuginfo. Symbol sys_open address found : c000000000414290 Matched function: __se_sys_open [2ad03a0] Probe point found: __se_sys_open+0 Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing/README write=0 Opening /sys/kernel/debug/tracing/kprobe_events write=1 Parsing probe_events: p:probe/sys_open _text+4276896 Group:probe Event:sys_open probe:p Writing event: r:probe/sys_open__return _text+4276896 Failed to write event: Invalid argument Error: Failed to add events. Reason: Invalid argument (Code: -22) After: # perf probe -v -a sys_open probe-definition(0): sys_open symbol:sys_open file:(null) line:0 offset:0 return:0 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /lib/modules/4.18.0-rc8+/build/vmlinux for symbols Open Debuginfo file: /lib/modules/4.18.0-rc8+/build/vmlinux Try to find probe point from debuginfo. Symbol sys_open address found : c000000000414290 Matched function: __se_sys_open [2ad03a0] Probe point found: __se_sys_open+0 Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing/kprobe_events write=1 Writing event: p:probe/sys_open _text+4276888 Added new event: probe:sys_open (on sys_open) ... # perf probe -v -a sys_open%return $retval probe-definition(0): sys_open%return symbol:sys_open file:(null) line:0 offset:0 return:1 lazy:(null) 0 arguments Looking at the vmlinux_path (8 entries long) Using /lib/modules/4.18.0-rc8+/build/vmlinux for symbols Open Debuginfo file: /lib/modules/4.18.0-rc8+/build/vmlinux Try to find probe point from debuginfo. Symbol sys_open address found : c000000000414290 Matched function: __se_sys_open [2ad03a0] Probe point found: __se_sys_open+0 Found 1 probe_trace_events. Opening /sys/kernel/debug/tracing/README write=0 Opening /sys/kernel/debug/tracing/kprobe_events write=1 Parsing probe_events: p:probe/sys_open _text+4276888 Group:probe Event:sys_open probe:p Writing event: r:probe/sys_open__return _text+4276888 Added new event: probe:sys_open__return (on sys_open%return) ... Reported-by: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Sandipan Das <sandipan@linux.ibm.com> Acked-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Fixes: 99e608b5954c ("perf probe ppc64le: Fix probe location when using DWARF") Link: http://lkml.kernel.org/r/20180809161929.35058-1-sandipan@linux.ibm.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-09Merge branch 'linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fix from Herbert Xu: "This fixes a performance regression in arm64 NEON crypto as well as a crash in x86 aegis/morus on unsupported CPUs" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: x86/aegis,morus - Fix and simplify CPUID checks crypto: arm64 - revert NEON yield for fast AEAD implementations
2018-08-09Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) The real fix for the ipv6 route metric leak Sabrina was seeing, from Cong Wang. 2) Fix syzbot triggers AF_PACKET v3 ring buffer insufficient room conditions, from Willem de Bruijn. 3) vsock can reinitialize active work struct, fix from Cong Wang. 4) RXRPC keepalive generator can wedge a cpu, fix from David Howells. 5) Fix locking in AF_SMC ioctl, from Ursula Braun. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: dsa: slave: eee: Allow ports to use phylink net/smc: move sock lock in smc_ioctl() net/smc: allow sysctl rmem and wmem defaults for servers net/smc: no shutdown in state SMC_LISTEN net: aquantia: Fix IFF_ALLMULTI flag functionality rxrpc: Fix the keepalive generator [ver #2] net/mlx5e: Cleanup of dcbnl related fields net/mlx5e: Properly check if hairpin is possible between two functions vhost: reset metadata cache when initializing new IOTLB llc: use refcount_inc_not_zero() for llc_sap_find() dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart() tipc: fix an interrupt unsafe locking scenario vsock: split dwork to avoid reinitializations net: thunderx: check for failed allocation lmac->dmacs cxgb4: mk_act_open_req() buggers ->{local, peer}_ip on big-endian hosts packet: refine ring v3 block size test to hold one frame ip6_tunnel: use the right value for ipv4 min mtu check in ip6_tnl_xmit ipv6: fix double refcount of fib6_metrics
2018-08-09i2c: xlp9xx: Fix case where SSIF read transaction completes earlyGeorge Cherian
During ipmi stress tests we see occasional failure of transactions at the boot time. This happens in the case of a I2C_M_RECV_LEN transactions, when the read transfer completes (with the initial read length of 34) before the driver gets a chance to handle interrupts. The current driver code expects at least 2 interrupts for I2C_M_RECV_LEN transactions. The length is updated during the first interrupt, and the buffer contents are only copied during subsequent interrupts. In case of just one interrupt, we will complete the transaction without copying out the bytes from RX fifo. Update the code to drain the RX fifo after the length update, so that the transaction completes correctly in all cases. Signed-off-by: George Cherian <george.cherian@cavium.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
2018-08-08dsa: slave: eee: Allow ports to use phylinkAndrew Lunn
For a port to be able to use EEE, both the MAC and the PHY must support EEE. A phy can be provided by both a phydev or phylink. Verify at least one of these exist, not just phydev. Fixes: aab9c4067d23 ("net: dsa: Plug in PHYLINK support") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08Merge branch 'smc-fixes'David S. Miller
Ursula Braun says: ==================== net/smc: fixes 2018-08-08 here are small fixes for SMC: The first patch makes sure, shutdown code is not executed for sockets in state SMC_LISTEN. The second patch resets send and receive buffer values for accepted sockets, since TCP buffer size optimizations for the internal CLC socket should not be forwarded to the outer SMC socket. The third patch solves a race between connect and ioctl reported by syzbot. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08net/smc: move sock lock in smc_ioctl()Ursula Braun
When an SMC socket is connecting it is decided whether fallback to TCP is needed. To avoid races between connect and ioctl move the sock lock before the use_fallback check. Reported-by: syzbot+5b2cece1a8ecb2ca77d8@syzkaller.appspotmail.com Reported-by: syzbot+19557374321ca3710990@syzkaller.appspotmail.com Fixes: 1992d99882af ("net/smc: take sock lock in smc_ioctl()") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08net/smc: allow sysctl rmem and wmem defaults for serversUrsula Braun
Without setsockopt SO_SNDBUF and SO_RCVBUF settings, the sysctl defaults net.ipv4.tcp_wmem and net.ipv4.tcp_rmem should be the base for the sizes of the SMC sndbuf and rcvbuf. Any TCP buffer size optimizations for servers should be ignored. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08net/smc: no shutdown in state SMC_LISTENUrsula Braun
Invoking shutdown for a socket in state SMC_LISTEN does not make sense. Nevertheless programs like syzbot fuzzing the kernel may try to do this. For SMC this means a socket refcounting problem. This patch makes sure a shutdown call for an SMC socket in state SMC_LISTEN simply returns with -ENOTCONN. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08net: aquantia: Fix IFF_ALLMULTI flag functionalityDmitry Bogdanov
It was noticed that NIC always pass all multicast traffic to the host regardless of IFF_ALLMULTI flag on the interface. The rule in MC Filter Table in NIC, that is configured to accept any multicast packets, is turning on if IFF_MULTICAST flag is set on the interface. It leads to passing all multicast traffic to the host. This fix changes the condition to turn on that rule by checking IFF_ALLMULTI flag as it should. Fixes: b21f502f84be ("net:ethernet:aquantia: Fix for multicast filter handling.") Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08rxrpc: Fix the keepalive generator [ver #2]David Howells
AF_RXRPC has a keepalive message generator that generates a message for a peer ~20s after the last transmission to that peer to keep firewall ports open. The implementation is incorrect in the following ways: (1) It mixes up ktime_t and time64_t types. (2) It uses ktime_get_real(), the output of which may jump forward or backward due to adjustments to the time of day. (3) If the current time jumps forward too much or jumps backwards, the generator function will crank the base of the time ring round one slot at a time (ie. a 1s period) until it catches up, spewing out VERSION packets as it goes. Fix the problem by: (1) Only using time64_t. There's no need for sub-second resolution. (2) Use ktime_get_seconds() rather than ktime_get_real() so that time isn't perceived to go backwards. (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two parts: (a) The "worker" function that manages the buckets and the timer. (b) The "dispatch" function that takes the pending peers and potentially transmits a keepalive packet before putting them back in the ring into the slot appropriate to the revised last-Tx time. (4) Taking everything that's pending out of the ring and splicing it into a temporary collector list for processing. In the case that there's been a significant jump forward, the ring gets entirely emptied and then the time base can be warped forward before the peers are processed. The warping can't happen if the ring isn't empty because the slot a peer is in is keepalive-time dependent, relative to the base time. (5) Limit the number of iterations of the bucket array when scanning it. (6) Set the timer to skip any empty slots as there's no point waking up if there's nothing to do yet. This can be triggered by an incoming call from a server after a reboot with AF_RXRPC and AFS built into the kernel causing a peer record to be set up before userspace is started. The system clock is then adjusted by userspace, thereby potentially causing the keepalive generator to have a meltdown - which leads to a message like: watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23] ... Workqueue: krxrpcd rxrpc_peer_keepalive_worker EIP: lock_acquire+0x69/0x80 ... Call Trace: ? rxrpc_peer_keepalive_worker+0x5e/0x350 ? _raw_spin_lock_bh+0x29/0x60 ? rxrpc_peer_keepalive_worker+0x5e/0x350 ? rxrpc_peer_keepalive_worker+0x5e/0x350 ? __lock_acquire+0x3d3/0x870 ? process_one_work+0x110/0x340 ? process_one_work+0x166/0x340 ? process_one_work+0x110/0x340 ? worker_thread+0x39/0x3c0 ? kthread+0xdb/0x110 ? cancel_delayed_work+0x90/0x90 ? kthread_stop+0x70/0x70 ? ret_from_fork+0x19/0x24 Fixes: ace45bec6d77 ("rxrpc: Fix firewall route keepalive") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08Merge branch 'mlx5-fixes'David S. Miller
Saeed Mahameed says: ==================== Mellanox, mlx5e fixes 2018-08-07 I know it is late into 4.18 release, and this is why I am submitting only two mlx5e ethernet fixes. The first one from Or, is needed for -stable and it fixes hairpin for "same device" check. The second fix is a non risk fix from Huy which cleans up and improves error return value reporting for dcbnl_ieee_setapp. For -stable v4.16 - net/mlx5e: Properly check if hairpin is possible between two functions ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08net/mlx5e: Cleanup of dcbnl related fieldsHuy Nguyen
Remove unused netdev_registered_init/remove in en.h Return ENOSUPPORT if the check MLX5_DSCP_SUPPORTED fails. Remove extra white space Fixes: 2a5e7a1344f4 ("net/mlx5e: Add dcbnl dscp to priority support") Signed-off-by: Huy Nguyen <huyn@mellanox.com> Cc: Yuval Shaia <yuval.shaia@oracle.com> Reviewed-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08net/mlx5e: Properly check if hairpin is possible between two functionsOr Gerlitz
The current check relies on function BDF addresses and can get us wrong e.g when two VFs are assigned into a VM and the PCI v-address is set by the hypervisor. Fixes: 5c65c564c962 ('net/mlx5e: Support offloading TC NIC hairpin flows') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Reported-by: Alaa Hleihel <alaa@mellanox.com> Tested-by: Alaa Hleihel <alaa@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08parisc: Define mb() and add memory barriers to assembler unlock sequencesJohn David Anglin
For years I thought all parisc machines executed loads and stores in order. However, Jeff Law recently indicated on gcc-patches that this is not correct. There are various degrees of out-of-order execution all the way back to the PA7xxx processor series (hit-under-miss). The PA8xxx series has full out-of-order execution for both integer operations, and loads and stores. This is described in the following article: http://web.archive.org/web/20040214092531/http://www.cpus.hp.com/technical_references/advperf.shtml For this reason, we need to define mb() and to insert a memory barrier before the store unlocking spinlocks. This ensures that all memory accesses are complete prior to unlocking. The ldcw instruction performs the same function on entry. Signed-off-by: John David Anglin <dave.anglin@bell.net> Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Helge Deller <deller@gmx.de>
2018-08-08parisc: Enable CONFIG_MLONGCALLS by defaultHelge Deller
Enable the -mlong-calls compiler option by default, because otherwise in most cases linking the vmlinux binary fails due to truncations of R_PARISC_PCREL22F relocations. This fixes building the 64-bit defconfig. Cc: stable@vger.kernel.org # 4.0+ Signed-off-by: Helge Deller <deller@gmx.de>
2018-08-08Merge branch 'sockmap-fixes'Alexei Starovoitov
Daniel Borkmann says: ==================== Two sockmap fixes in bpf_tcp_sendmsg(), and one fix for the sockmap kernel selftest. Thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-08bpf, sockmap: fix cork timeout for select due to epipeDaniel Borkmann
I ran into the same issue as a009f1f396d0 ("selftests/bpf: test_sockmap, timing improvements") where I had a broken pipe error on the socket due to remote end timing out on select and then shutting down it's sockets while the other side was still sending. We may need to do a bigger rework in general on the test_sockmap.c, but for now increase it to a more suitable timeout. Fixes: a18fda1a62c3 ("bpf: reduce runtime of test_sockmap tests") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-08bpf, sockmap: fix leak in bpf_tcp_sendmsg wait for mem pathDaniel Borkmann
In bpf_tcp_sendmsg() the sk_alloc_sg() may fail. In the case of ENOMEM, it may also mean that we've partially filled the scatterlist entries with pages. Later jumping to sk_stream_wait_memory() we could further fail with an error for several reasons, however we miss to call free_start_sg() if the local sk_msg_buff was used. Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-08bpf, sockmap: fix bpf_tcp_sendmsg sock error handlingDaniel Borkmann
While working on bpf_tcp_sendmsg() code, I noticed that when a sk->sk_err is set we error out with err = sk->sk_err. However this is problematic since sk->sk_err is a positive error value and therefore we will neither go into sk_stream_error() nor will we report an error back to user space. I had this case with EPIPE and user space was thinking sendmsg() succeeded since EPIPE is a positive value, thinking we submitted 32 bytes. Fix it by negating the sk->sk_err value. Fixes: 4f738adba30a ("bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-08perf map: Optimize maps__fixup_overlappings()Konstantin Khlebnikov
This function splits and removes overlapping areas. Maps in tree are ordered by start address thus we could find first overlap and stop if next map does not overlap. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/153365189407.435244.7234821822450484712.stgit@buzz Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf map: Synthesize maps only for thread group leaderKonstantin Khlebnikov
Threads share map_groups, all map events are merged into it. Thus we could send mmaps only for thread group leader. Otherwise it took ages to attach and record something from processes with many vmas and threads. Thread group leader could be already dead, but it seems perf cannot handle this case anyway. Testing dummy: #include <stdio.h> #include <stdlib.h> #include <sys/mman.h> #include <pthread.h> #include <unistd.h> void *thread(void *arg) { pause(); } int main(int argc, char **argv) { int threads = 10000; int vmas = 50000; pthread_t th; for (int i = 0; i < threads; i++) pthread_create(&th, NULL, thread, NULL); for (int i = 0; i < vmas; i++) mmap(NULL, 4096, (i & 1) ? PROT_READ : PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0); sleep(60); return 0; } Comment by Jiri Olsa: We actualy synthesize the group leader (if we found one) for the thread even if it's not present in the thread_map, so the process maps are always in data. Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/153363294102.396323.6277944760215058174.stgit@buzz Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf trace: Wire up the augmented syscalls with the syscalls:sys_enter_FOO ↵Arnaldo Carvalho de Melo
beautifier We just check that the evsel is the one we associated with the bpf-output event associated with the "__augmented_syscalls__" eBPF map, to show that the formatting is done properly: # perf trace -e perf/tools/perf/examples/bpf/augmented_syscalls.c,openat cat /etc/passwd > /dev/null 0.000 ( ): __augmented_syscalls__:dfd: CWD, filename: 0x43e06da8, flags: CLOEXEC 0.006 ( ): syscalls:sys_enter_openat:dfd: CWD, filename: 0x43e06da8, flags: CLOEXEC 0.007 ( 0.004 ms): cat/11486 openat(dfd: CWD, filename: 0x43e06da8, flags: CLOEXEC ) = 3 0.029 ( ): __augmented_syscalls__:dfd: CWD, filename: 0x4400ece0, flags: CLOEXEC 0.030 ( ): syscalls:sys_enter_openat:dfd: CWD, filename: 0x4400ece0, flags: CLOEXEC 0.031 ( 0.004 ms): cat/11486 openat(dfd: CWD, filename: 0x4400ece0, flags: CLOEXEC ) = 3 0.249 ( ): __augmented_syscalls__:dfd: CWD, filename: 0xc3700d6 0.250 ( ): syscalls:sys_enter_openat:dfd: CWD, filename: 0xc3700d6 0.252 ( 0.003 ms): cat/11486 openat(dfd: CWD, filename: 0xc3700d6 ) = 3 # Now we just need to get the full blown enter/exit handlers to check if the evsel being processed is the augmented_syscalls one to go pick the pointer payloads from the end of the payload. We also need to state somehow what is the layout for multi pointer arg syscalls. Also handy would be to have a BTF file with the struct definitions used in syscalls, compact, generated at kernel built time and available for use in eBPF programs. Till we get there we can go on doing some manual coupling of the most relevant syscalls with some hand built beautifiers. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-r6ba5izrml82nwfmwcp7jpkm@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf trace: Setup the augmented syscalls bpf-output event fieldsArnaldo Carvalho de Melo
The payload that is put in place by the eBPF script attached to syscalls:sys_enter_openat (and other syscalls with pointers, in the future) can be consumed by the existing sys_enter beautifiers if evsel->priv is setup with a struct syscall_tp with struct tp_fields for the 'syscall_id' and 'args' fields expected by the beautifiers, this patch does just that. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-xfjyog8oveg2fjys9r1yy1es@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Make bpf__setup_output_event() return the bpf-output eventArnaldo Carvalho de Melo
We're calling it to setup that event, and we'll need it later to decide if the bpf-output event we're handling is the one setup for a specific purpose, return it using ERR_PTR, etc. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-zhachv7il2n1lopt9aonwhu7@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf trace: Handle "bpf-output" events associated with ↵Arnaldo Carvalho de Melo
"__augmented_syscalls__" BPF map Add an example BPF script that writes syscalls:sys_enter_openat raw tracepoint payloads augmented with the first 64 bytes of the "filename" syscall pointer arg. Then catch it and print it just like with things written to the "__bpf_stdout__" map associated with a PERF_COUNT_SW_BPF_OUTPUT software event, by just letting the default tracepoint handler in 'perf trace', trace__event_handler(), to use bpf_output__fprintf(trace, sample), just like it does with all other PERF_COUNT_SW_BPF_OUTPUT events, i.e. just do a dump on the payload, so that we can check if what is being printed has at least the first 64 bytes of the "filename" arg: The augmented_syscalls.c eBPF script: # cat tools/perf/examples/bpf/augmented_syscalls.c // SPDX-License-Identifier: GPL-2.0 #include <stdio.h> struct bpf_map SEC("maps") __augmented_syscalls__ = { .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, .key_size = sizeof(int), .value_size = sizeof(u32), .max_entries = __NR_CPUS__, }; struct syscall_enter_openat_args { unsigned long long common_tp_fields; long syscall_nr; long dfd; char *filename_ptr; long flags; long mode; }; struct augmented_enter_openat_args { struct syscall_enter_openat_args args; char filename[64]; }; int syscall_enter(openat)(struct syscall_enter_openat_args *args) { struct augmented_enter_openat_args augmented_args; probe_read(&augmented_args.args, sizeof(augmented_args.args), args); probe_read_str(&augmented_args.filename, sizeof(augmented_args.filename), args->filename_ptr); perf_event_output(args, &__augmented_syscalls__, BPF_F_CURRENT_CPU, &augmented_args, sizeof(augmented_args)); return 1; } license(GPL); # So it will just prepare a raw_syscalls:sys_enter payload for the "openat" syscall. This will eventually be done for all syscalls with pointer args, globally or just when the user asks, using some spec, which args of which syscalls it wants "expanded" this way, we'll probably start with just all the syscalls that have char * pointers with familiar names, the ones we already handle with the probe:vfs_getname kprobe if it is in place hooking the kernel getname_flags() function used to copy from user the paths. Running it we get: # perf trace -e perf/tools/perf/examples/bpf/augmented_syscalls.c,openat cat /etc/passwd > /dev/null 0.000 ( ): __augmented_syscalls__:X?.C......................`\..................../etc/ld.so.cache..#......,....ao.k...............k......1."......... 0.006 ( ): syscalls:sys_enter_openat:dfd: CWD, filename: 0x5c600da8, flags: CLOEXEC 0.008 ( 0.005 ms): cat/31292 openat(dfd: CWD, filename: 0x5c600da8, flags: CLOEXEC ) = 3 0.036 ( ): __augmented_syscalls__:X?.C.......................\..................../lib64/libc.so.6......... .\....#........?.......=.C..../."......... 0.037 ( ): syscalls:sys_enter_openat:dfd: CWD, filename: 0x5c808ce0, flags: CLOEXEC 0.039 ( 0.007 ms): cat/31292 openat(dfd: CWD, filename: 0x5c808ce0, flags: CLOEXEC ) = 3 0.323 ( ): __augmented_syscalls__:X?.C.....................P....................../etc/passwd......>.C....@................>.C.....,....ao.>.C........ 0.325 ( ): syscalls:sys_enter_openat:dfd: CWD, filename: 0xe8be50d6 0.327 ( 0.004 ms): cat/31292 openat(dfd: CWD, filename: 0xe8be50d6 ) = 3 # We need to go on optimizing this to avoid seding trash or zeroes in the pointer content payload, using the return from bpf_probe_read_str(), but to keep things simple at this stage and make incremental progress, lets leave it at that for now. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-g360n1zbj6bkbk6q0qo11c28@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Add wrappers to BPF_FUNC_probe_read(_str) functionsArnaldo Carvalho de Melo
Will be used shortly in the augmented syscalls work together with a PERF_COUNT_SW_BPF_OUTPUT software event to insert syscalls + pointer contents in the perf ring buffer, to be consumed by 'perf trace' beautifiers. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-ajlkpz4cd688ulx1u30htkj3@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Add bpf__setup_output_event() strerror() counterpartArnaldo Carvalho de Melo
That is just bpf__strerror_setup_stdout() renamed to the more general "setup_output_event" method, keep the existing stdout() as a wrapper. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-nwnveo428qn0b48axj50vkc7@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Generalize bpf__setup_stdout()Arnaldo Carvalho de Melo
We will use it to set up other bpf-output events, for instance to generate augmented syscall entry tracepoints with pointer contents. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-4r7kw0nsyi4vyz6xm1tzx6a3@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Make bpf__for_each_stdout_map() genericArnaldo Carvalho de Melo
By passing a 'name' arg, that will eventually be used to setup more "bpf-output" events, e.g. to create a event where to create raw_syscalls like events that in addition to the syscall arguments will also copy the pointer contents being passed from/to userspace. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-talrnxps9p3qozk3aeh91fgv@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Add bpf/stdio.h wrapper to bpf_perf_event_output functionArnaldo Carvalho de Melo
That, together with the map __bpf_output__ that is already handled by 'perf trace' to print that event's contents as strings provides a debugging facility, to show it in use, print a simple string everytime the syscalls:sys_enter_openat() syscall tracepoint is hit: # cat tools/perf/examples/bpf/hello.c #include <stdio.h> int syscall_enter(openat)(void *args) { puts("Hello, world\n"); return 0; } license(GPL); # # perf trace -e openat,tools/perf/examples/bpf/hello.c cat /etc/passwd > /dev/null 0.016 ( ): __bpf_stdout__:Hello, world 0.018 ( 0.010 ms): cat/9079 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC) = 3 0.057 ( ): __bpf_stdout__:Hello, world 0.059 ( 0.011 ms): cat/9079 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC) = 3 0.417 ( ): __bpf_stdout__:Hello, world 0.419 ( 0.009 ms): cat/9079 openat(dfd: CWD, filename: /etc/passwd) = 3 # This is part of an ongoing experimentation on making eBPF scripts as consumed by perf to be as concise as possible and using familiar concepts such as stdio.h functions, that end up just wrapping the existing BPF functions, trying to hide as much boilerplate as possible while using just conventions and C preprocessor tricks. Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-4tiaqlx5crf0fwpe7a6j84x7@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf bpf: Add struct bpf_map structArnaldo Carvalho de Melo
A helper structure used by eBPF C program to describe map attributes to elf_bpf loader, to be used initially by the special __bpf_stdout__ map used to print strings into the perf ring buffer in BPF scripts, e.g.: Using the upcoming stdio.h and puts() macros to use the __bpf_stdout__ map to add strings to the ring buffer: # cat tools/perf/examples/bpf/hello.c #include <stdio.h> int syscall_enter(openat)(void *args) { puts("Hello, world\n"); return 0; } license(GPL); # # cat ~/.perfconfig [llvm] dump-obj = true # perf trace -e openat,tools/perf/examples/bpf/hello.c/call-graph=dwarf/ cat /etc/passwd > /dev/null LLVM: dumping tools/perf/examples/bpf/hello.o 0.016 ( ): __bpf_stdout__:Hello, world 0.018 ( 0.010 ms): cat/9079 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC ) = 3 0.057 ( ): __bpf_stdout__:Hello, world 0.059 ( 0.011 ms): cat/9079 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC ) = 3 0.417 ( ): __bpf_stdout__:Hello, world 0.419 ( 0.009 ms): cat/9079 openat(dfd: CWD, filename: /etc/passwd ) = 3 # # file tools/perf/examples/bpf/hello.o tools/perf/examples/bpf/hello.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped # readelf -SW tools/perf/examples/bpf/hello.o There are 10 section headers, starting at offset 0x208: Section Headers: [Nr] Name Type Address Off Size ES Flg Lk Inf Al [ 0] NULL 0000000000000000 000000 000000 00 0 0 0 [ 1] .strtab STRTAB 0000000000000000 000188 00007f 00 0 0 1 [ 2] .text PROGBITS 0000000000000000 000040 000000 00 AX 0 0 4 [ 3] syscalls:sys_enter_openat PROGBITS 0000000000000000 000040 000088 00 AX 0 0 8 [ 4] .relsyscalls:sys_enter_openat REL 0000000000000000 000178 000010 10 9 3 8 [ 5] maps PROGBITS 0000000000000000 0000c8 00001c 00 WA 0 0 4 [ 6] .rodata.str1.1 PROGBITS 0000000000000000 0000e4 00000e 01 AMS 0 0 1 [ 7] license PROGBITS 0000000000000000 0000f2 000004 00 WA 0 0 1 [ 8] version PROGBITS 0000000000000000 0000f8 000004 00 WA 0 0 4 [ 9] .symtab SYMTAB 0000000000000000 000100 000078 18 1 1 8 Key to Flags: W (write), A (alloc), X (execute), M (merge), S (strings), I (info), L (link order), O (extra OS processing required), G (group), T (TLS), C (compressed), x (unknown), o (OS specific), E (exclude), p (processor specific) # readelf -s tools/perf/examples/bpf/hello.o Symbol table '.symtab' contains 5 entries: Num: Value Size Type Bind Vis Ndx Name 0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND 1: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 5 __bpf_stdout__ 2: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 7 _license 3: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 8 _version 4: 0000000000000000 0 NOTYPE GLOBAL DEFAULT 3 syscall_enter_openat # Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Wang Nan <wangnan0@huawei.com> Link: https://lkml.kernel.org/n/tip-81fg60om2ifnatsybzwmiga3@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf report: Add --percent-type optionJiri Olsa
Set annotation percent type from following choices: global-period, local-period, global-hits, local-hits With following report option setup the percent type will be passed to annotation browser: $ perf report --percent-type period-local The local/global keywords set if the percentage is computed in the scope of the function (local) or the whole data (global). The period/hits keywords set the base the percentage is computed on - the samples period or the number of samples (hits). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-21-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Add --percent-type optionJiri Olsa
Add --percent-type option to set annotation percent type from following choices: global-period, local-period, global-hits, local-hits Examples: $ perf annotate --percent-type period-local --stdio | head -1 Percent | Source code ... es, percent: local period) $ perf annotate --percent-type hits-local --stdio | head -1 Percent | Source code ... es, percent: local hits) $ perf annotate --percent-type hits-global --stdio | head -1 Percent | Source code ... es, percent: global hits) $ perf annotate --percent-type period-global --stdio | head -1 Percent | Source code ... es, percent: global period) The local/global keywords set if the percentage is computed in the scope of the function (local) or the whole data (global). The period/hits keywords set the base the percentage is computed on - the samples period or the number of samples (hits). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-20-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Display percent type in stdio outputJiri Olsa
In following patches we will allow to switch percent type even for stdio annotation outputs. Adding the percent type value into the annotation outputs title. $ perf annotate --stdio Percent | Sou ... instructions:u } (2805 samples, percent: local period) --------------------------- ... ------------------------------------------------------ ... $ perf annotate --stdio2 Samples: 2K of events 'anon ... count (approx.): 156525487, [percent: local period] safe_write.c() /usr/bin/yes Percent ... Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-19-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Make local period the default percent typeJiri Olsa
Currently we display the percentages in annotation output based on number of samples hits. Switching it to period based percentage by default, because it corresponds more to the time spent on the line. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-18-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Add support to toggle percent typeJiri Olsa
Add new key bindings to toggle percent type/base in annotation UI browser: 'p' to switch between local and global percent type 'b' to switch between hits and perdio percent base Add the following help messages to the UI browser '?' window: ... p Toggle percent type [local/global] b Toggle percent base [period/hits] ... Signed-off-by: Jiri Olsa <jolsa@kernel.org> Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-17-jolsa@kernel.org [ Moved percent_type to be the last arg to sym_title(), its an arg to what is being formmated (buf, size) ] Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Pass browser percent_type in annotate_browser__calc_percent()Jiri Olsa
Pass browser percent_type in annotate_browser__calc_percent(). Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-16-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Pass 'struct annotation_options' to map_symbol__annotation_dump()Jiri Olsa
Pass 'struct annotation_options' to map_symbol__annotation_dump(), to carry on and pass the percent_type value. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-15-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Pass struct annotation_options to symbol__calc_lines()Jiri Olsa
Pass struct annotation_options to symbol__calc_lines(), to carry on and pass the percent_type value. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-14-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Add percent_type to struct annotation_optionsJiri Olsa
It will be used to carry user selection of percent type for annotation output. Passing the percent_type to the annotation_line__print function as the first step and making it default to current percentage type (PERCENT_HITS_LOCAL) value. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-13-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2018-08-08perf annotate: Add PERCENT_PERIOD_GLOBAL percent valueJiri Olsa
Adding and computing global period percent value for annotation line. Storing it in struct annotation_data percent array under new PERCENT_PERIOD_GLOBAL index. At the moment it's not displayed, it's coming in following patches. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/20180804130521.11408-12-jolsa@kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>