summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2018-03-13perf/core: Implement fast breakpoint modification via _IOC_MODIFY_ATTRIBUTESMilind Chabbi
Problem and motivation: Once a breakpoint perf event (PERF_TYPE_BREAKPOINT) is created, there is no flexibility to change the breakpoint type (bp_type), breakpoint address (bp_addr), or breakpoint length (bp_len). The only option is to close the perf event and configure a new breakpoint event. This inflexibility has a significant performance overhead. For example, sampling-based, lightweight performance profilers (and also concurrency bug detection tools), monitor different addresses for a short duration using PERF_TYPE_BREAKPOINT and change the address (bp_addr) to another address or change the kind of breakpoint (bp_type) from "write" to a "read" or vice-versa or change the length (bp_len) of the address being monitored. The cost of these modifications is prohibitive since it involves unmapping the circular buffer associated with the perf event, closing the perf event, opening another perf event and mmaping another circular buffer. Solution: The new ioctl flag for perf events, PERF_EVENT_IOC_MODIFY_ATTRIBUTES, introduced in this patch takes a pointer to a struct perf_event_attr as an argument to update an old breakpoint event with new address, type, and size. This facility allows retaining a previous mmaped perf events ring buffer and avoids having to close and reopen another perf event. This patch supports only changing PERF_TYPE_BREAKPOINT event type; future implementations can extend this feature. The patch replicates some of its functionality of modify_user_hw_breakpoint() in kernel/events/hw_breakpoint.c. modify_user_hw_breakpoint cannot be called directly since perf_event_ctx_lock() is already held in _perf_ioctl(). Evidence: Experiments show that the baseline (not able to modify an already created breakpoint) costs an order of magnitude (~10x) more than the suggested optimization (having the ability to dynamically modifying a configured breakpoint via ioctl). When the breakpoints typically do not trap, the speedup due to the suggested optimization is ~10x; even when the breakpoints always trap, the speedup is ~4x due to the suggested optimization. Testing: tests posted at https://github.com/linux-contrib/perf_event_modify_bp demonstrate the performance significance of this patch. Tests also check the functional correctness of the patch. Signed-off-by: Milind Chabbi <chabbi.milind@gmail.com> [ Using modify_user_hw_breakpoint_check function. ] [ Reformated PERF_EVENT_IOC_*, so the values are all in one column. ] Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-8-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13tracing: Unify the "boot" and "mono" tracing clocksThomas Gleixner
Unify the "boot" and "mono" tracing clocks and document the new behaviour. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165150.489635255@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13hrtimer: Unify MONOTONIC and BOOTTIME clock behaviorThomas Gleixner
Now that th MONOTONIC and BOOTTIME clocks are indentical remove all the special casing. The user space visible interfaces still support both clocks, but their behavior is identical. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165150.410218515@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13posix-timers: Unify MONOTONIC and BOOTTIME clock behaviorThomas Gleixner
Now that the MONOTONIC and BOOTTIME clocks are indentical remove all the special casing. The user space visible interfaces still support both clocks, but their behavior is identical. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165150.315745557@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13timekeeping: Remove boot time specific codeThomas Gleixner
Now that the MONOTONIC and BOOTTIME clocks are the same, remove all the special handling from timekeeping. Keep wrappers for the existing users of the *boot* timekeeper interfaces. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165150.236279497@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clockThomas Gleixner
The MONOTONIC clock is not fast forwarded by the time spent in suspend on resume. This is only done for the BOOTTIME clock. The reason why the MONOTONIC clock is not forwarded is historical: the original Linux implementation was using jiffies as a base for the MONOTONIC clock and jiffies have never been advanced after resume. At some point when timekeeping was unified in the core code, the MONONOTIC clock was advanced after resume which also advanced jiffies causing interesting side effects. As a consequence the the MONOTONIC clock forwarding was disabled again and the BOOTTIME clock was introduced, which allows to read time since boot. Back then it was not possible to completely distangle the MONOTONIC clock and jiffies because there were still interfaces which exposed the MONOTONIC clock behaviour based on the timer wheel and therefore jiffies. As of today none of the MONOTONIC clock facilities depends on jiffies anymore so the forwarding can be done seperately. This is achieved by forwarding the variables which are used for the jiffies update after resume before the tick is restarted, In timekeeping resume, the change is rather simple. Instead of updating the offset between the MONOTONIC clock and the REALTIME/BOOTTIME clocks, advance the time keeper base for the MONOTONIC and the MONOTONIC_RAW clocks by the time spent in suspend. The MONOTONIC clock is now the same as the BOOTTIME clock and the offset between the REALTIME and the MONOTONIC clocks is the same as before suspend. There might be side effects in applications, which rely on the (unfortunately) well documented behaviour of the MONOTONIC clock, but the downsides of the existing behaviour are probably worse. There is one obvious issue. Up to now it was possible to retrieve the time spent in suspend by observing the delta between the MONOTONIC clock and the BOOTTIME clock. This is not longer available, but the previously introduced mechanism to read the active non-suspended monotonic time can mitigate that in a detectable fashion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165150.062975504@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clockThomas Gleixner
The planned change to unify the behaviour of the MONOTONIC and BOOTTIME clocks vs. suspend removes the ability to retrieve the active non-suspended time of a system. Provide a new CLOCK_MONOTONIC_ACTIVE clock which returns the active non-suspended time of the system via clock_gettime(). This preserves the old behaviour of CLOCK_MONOTONIC before the BOOTTIME/MONOTONIC unification. This new clock also allows applications to detect programmatically that the MONOTONIC and BOOTTIME clocks are identical. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kevin Easton <kevin@guarana.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Salyzyn <salyzyn@android.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20180301165149.965235774@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13perf/core: Move perf_event_attr::sample_max_stack into perf_copy_attr()Jiri Olsa
Move the sample_max_stack check and setup into perf_copy_attr(), so we have all perf_event_attr initial setup in one place and can easily compare attrs in the new ioctl introduced in following change. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Milind Chabbi <chabbi.milind@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-7-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13hw_breakpoint: Add perf_event_attr fields check in __modify_user_hw_breakpoint()Jiri Olsa
And rename it to modify_user_hw_breakpoint_check(). We are about to use modify_user_hw_breakpoint_check() for user space breakpoints modification, we must be very strict to check only the fields we can change have changed. As Peter explained: "Suppose someone does: attr = malloc(sizeof(*attr)); // uninitialized memory attr->type = BP; attr->bp_addr = new_addr; attr->bp_type = bp_type; attr->bp_len = bp_len; ioctl(fd, PERF_IOC_MOD_ATTR, &attr); And feeds absolute shite for the rest of the fields. Then we later want to extend IOC_MOD_ATTR to allow changing attr::sample_type but we can't, because that would break the above application." I'm making this check optional because we already export modify_user_hw_breakpoint() and with this check we could break existing users. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Milind Chabbi <chabbi.milind@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-6-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13hw_breakpoint: Factor out __modify_user_hw_breakpoint() functionJiri Olsa
Moving out all the functionality without the events disabling/enabling calls, because we want to call another disabling/enabling functions in following change. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Milind Chabbi <chabbi.milind@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-5-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13hw_breakpoint: Add modify_bp_slot() functionJiri Olsa
Add the modify_bp_slot() function to keep slot numbers correct when changing the breakpoint type. Using existing __release_bp_slot()/__reserve_bp_slot() call sequence to update the slot counts. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Milind Chabbi <chabbi.milind@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-4-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13hw_breakpoint: Pass bp_type argument to __reserve_bp_slot|__release_bp_slot()Jiri Olsa
Passing bp_type argument to __reserve_bp_slot() and __release_bp_slot() functions, so we can pass another bp_type than the one defined in bp->attr.bp_type. This will be handy in following change that fixes breakpoint slot counts during its modification. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Milind Chabbi <chabbi.milind@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-3-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-13hw_breakpoint: Pass bp_type directly as find_slot_idx() argumentJiri Olsa
Pass bp_type directly as a find_slot_idx() argument, so we don't need to have whole event to get the breakpoint slot type. It will be used in following changes. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Hari Bathini <hbathini@linux.vnet.ibm.com> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Milind Chabbi <chabbi.milind@gmail.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Oleg Nesterov <onestero@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/20180312134548.31532-2-jolsa@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12error-injection: Fix to prohibit jump optimizationMasami Hiramatsu
Since the kprobe which was optimized by jump can not change the execution path, the kprobe for error-injection must not be optimized. To prohibit it, set a dummy post-handler as officially stated in Documentation/kprobes.txt. Fixes: 4b1a29a7f542 ("error-injection: Support fault injection framework") Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-03-12perf/core: Fix installing cgroup events on CPUleilei.lin
There's two problems when installing cgroup events on CPUs: firstly list_update_cgroup_event() only tries to set cpuctx->cgrp for the first event, if that mismatches on @cgrp we'll not try again for later additions. Secondly, when we install a cgroup event into an active context, only issue an event reprogram when the event matches the current cgroup context. This avoids a pointless event reprogramming. Signed-off-by: leilei.lin <leilei.lin@alibaba-inc.com> [ Improved the changelog and comments. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: brendan.d.gregg@gmail.com Cc: eranian@gmail.com Cc: linux-kernel@vger.kernel.org Cc: yang_oliver@hotmail.com Link: http://lkml.kernel.org/r/20180306093637.28247-1-linxiulei@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Optimize perf_rotate_context() event schedulingPeter Zijlstra
The event schedule order (as per perf_event_sched_in()) is: - cpu pinned - task pinned - cpu flexible - task flexible But perf_rotate_context() will unschedule cpu-flexible even if it doesn't need a rotation. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Fix tree based event rotationPeter Zijlstra
Similar to how first programming cpu=-1 and then cpu=# is wrong, so is rotating both. It was especially wrong when we were still programming the PMU in this same order, because in that scenario we might never actually end up running cpu=# events at all. Cure this by using the active_list to pick the rotation event; since at programming we already select the left-most event. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Simpify perf_event_groups_for_each()Peter Zijlstra
The last argument is, and always must be, the same. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Optimize ctx_sched_out()Peter Zijlstra
When an event group contains more events than can be scheduled on the hardware, iterating the full event group for ctx_sched_out is a waste of time. Keep track of the events that got programmed on the hardware, such that we can iterate this smaller list in order to schedule them out. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Remove perf_event::group_entryPeter Zijlstra
Now that all the grouping is done with RB trees, we no longer need group_entry and can replace the whole thing with sibling_list. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Fix event schedule orderPeter Zijlstra
Scheduling in events with cpu=-1 before events with cpu=# changes semantics and is undesirable in that it would priorize these events. Given that groups->index is across all groups we actually have an inter-group ordering, meaning we can merge-sort two groups, which is just what we need to preserve semantics. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Cleanup the rb-tree codePeter Zijlstra
Trivial comment and code fixups.. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexey Budankov <alexey.budankov@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/cor: Use RB trees for pinned/flexible groupsAlexey Budankov
Change event groups into RB trees sorted by CPU and then by a 64bit index, so that multiplexing hrtimer interrupt handler would be able skipping to the current CPU's list and ignore groups allocated for the other CPUs. New API for manipulating event groups in the trees is implemented as well as adoption on the API in the current implementation. pinned_group_sched_in() and flexible_group_sched_in() API are introduced to consolidate code enabling the whole group from pinned and flexible groups appropriately. Signed-off-by: Alexey Budankov <alexey.budankov@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Dmitri Prokhorov <Dmitry.Prohorov@intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valery Cherepennikov <valery.cherepennikov@intel.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/372f9c8b-0cfe-4240-e44d-83d863d40813@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12perf/core: Fix perf_output_read_group()Peter Zijlstra
Mark reported his arm64 perf fuzzer runs sometimes splat like: armv8pmu_read_counter+0x1e8/0x2d8 armpmu_event_update+0x8c/0x188 armpmu_read+0xc/0x18 perf_output_read+0x550/0x11e8 perf_event_read_event+0x1d0/0x248 perf_event_exit_task+0x468/0xbb8 do_exit+0x690/0x1310 do_group_exit+0xd0/0x2b0 get_signal+0x2e8/0x17a8 do_signal+0x144/0x4f8 do_notify_resume+0x148/0x1e8 work_pending+0x8/0x14 which asserts that we only call pmu::read() on ACTIVE events. The above callchain does: perf_event_exit_task() perf_event_exit_task_context() task_ctx_sched_out() // INACTIVE perf_event_exit_event() perf_event_set_state(EXIT) // EXIT sync_child_event() perf_event_read_event() perf_output_read() perf_output_read_group() leader->pmu->read() Which results in doing a pmu::read() on an !ACTIVE event. I _think_ this is 'new' since we added attr.inherit_stat, which added the perf_event_read_event() to the exit path, without that perf_event_read_output() would only trigger from samples and for @event to trigger a sample, it's leader _must_ be ACTIVE too. Still, adding this check makes it consistent with the @sub case for the siblings. Reported-and-Tested-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-12Merge tag 'v4.16-rc5' into locking/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-11Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Thomas Gleixner: "Another set of perf updates: - Fix a Skylake Uncore event format declaration - Prevent perf pipe mode from crahsing which was caused by a missing buffer allocation - Make the perf top popup message which tells the user that it uses fallback mode on older kernels a debug message. - Make perf context rescheduling work correcctly - Robustify the jump error drawing in perf browser mode so it does not try to create references to NULL initialized offset entries - Make trigger_on() robust so it does not enable the trigger before everything is set up correctly to handle it - Make perf auxtrace respect the --no-itrace option so it does not try to queue AUX data for decoding. - Prevent having different number of field separators in CVS output lines when a counter is not supported. - Make the perf kallsyms man page usage behave like it does for all other perf commands. - Synchronize the kernel headers" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix ctx_event_type in ctx_resched() perf tools: Fix trigger class trigger_on() perf auxtrace: Prevent decoding when --no-itrace perf stat: Fix CVS output format for non-supported counters tools headers: Sync x86's cpufeatures.h tools headers: Sync copy of kvm UAPI headers perf record: Fix crash in pipe mode perf annotate browser: Be more robust when drawing jump arrows perf top: Fix annoying fallback message on older kernels perf kallsyms: Fix the usage on the man page perf/x86/intel/uncore: Fix Skylake UPI event format
2018-03-11Merge branch 'locking-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking fix from Thomas Gleixner: "rt_mutex_futex_unlock() grew a new irq-off call site, but the function assumes that its always called from irq enabled context. Use (un)lock_irqsafe() to handle the new call site correctly" * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: rtmutex: Make rt_mutex_futex_unlock() safe for irq-off callsites
2018-03-11Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU updates from Paul E. McKenney: - Miscellaneous fixes, perhaps most notably removing obsolete code whose only purpose in life was to gather information for the now-removed RCU debugfs facility. Other notable changes include removing NO_HZ_FULL_ALL in favor of the nohz_full kernel boot parameter, minor optimizations for expedited grace periods, some added tracing, creating an RCU-specific workqueue using Tejun's new WQ_MEM_RECLAIM flag, and several cleanups to code and comments. - SRCU cleanups and optimizations. - Torture-test updates, perhaps most notably the adding of ARMv8 support, but also including numerous cleanups and usability fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2018-03-10tracing: Embed replace_filter_string() helper functionSteven Rostedt (VMware)
The replace_filter_string() frees the current string and then copies a given string. But in the two locations that it was used, the allocation happened right after the filter was allocated (nothing to replace). There's no need for this to be a helper function. Embedding the allocation in the two places where it was called will make changing the code in the future easier. Also make the variable consistent (always use "filter_string" as the name, as it was used in one instance as "filter_str") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Only add filter list when neededSteven Rostedt (VMware)
replace_system_preds() creates a filter list to free even when it doesn't really need to have it. Only save filters that require synchronize_sched() in the filter list to free. This will allow the code to be updated a bit easier in the future. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Remove filter allocator helperSteven Rostedt (VMware)
The __alloc_filter() function does nothing more that allocate the filter. There's no reason to have it as a helper function. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Use trace_seq instead of open code string appendingSteven Rostedt (VMware)
The filter code does open code string appending to produce an error message. Instead it can be simplified by using trace_seq function helpers. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Remove BUG_ON() from append_filter_string()Steven Rostedt (VMware)
There's no reason to BUG if there's a bug in the filtering code. Simply do a warning and return. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add inter-event blurb to HIST_TRIGGERS config optionTom Zanussi
So that users know that inter-event tracing is supported as part of the HIST_TRIGGERS option, include text to that effect in the help text. Link: http://lkml.kernel.org/r/a38e24231d8d980be636b56d35814570acfd167a.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Use the ring-buffer nesting to allow synthetic events to be tracedSteven Rostedt (VMware)
Synthetic events can be done within the recording of other events. Notify the ring buffer via ring_buffer_nest_start() and ring_buffer_nest_end() that this is intended and not to block it due to its recursion protection. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10ring-buffer: Add nesting for adding events within eventsSteven Rostedt (VMware)
The ring-buffer code has recusion protection in case tracing ends up tracing itself, the ring-buffer will detect that it was called at the same context (normal, softirq, interrupt or NMI), and not continue to record the event. With the histogram synthetic events, they are called while tracing another event at the same context. The recusion protection triggers because it detects tracing at the same context and stops it. Add ring_buffer_nest_start() and ring_buffer_nest_end() that will notify the ring buffer that a trace is about to happen within another trace and that it is intended, and not to trigger the recursion blocking. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add a clock attribute for hist triggersTom Zanussi
The default clock if timestamps are used in a histogram is "global". If timestamps aren't used, the clock is irrelevant. Use the "clock=" param only if you want to override the default "global" clock for a histogram with timestamps. Link: http://lkml.kernel.org/r/427bed1389c5d22aa40c3e0683e30cc3d151e260.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Rajvi Jingar <rajvi.jingar@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Make tracing_set_clock() non-staticTom Zanussi
Allow tracing code outside of trace.c to access tracing_set_clock(). Some applications may require a particular clock in order to function properly, such as latency calculations. Also, add an accessor returning the current clock string. Link: http://lkml.kernel.org/r/6d1c53e9ee2163f54e1849f5376573f54f0e6009.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add 'last error' error facility for hist triggersTom Zanussi
With the addition of variables and actions, it's become necessary to provide more detailed error information to users about syntax errors. Add a 'last error' facility accessible via the erroring event's 'hist' file. Reading the hist file after an error will display more detailed information about what went wrong, if information is available. This extended error information will be available until the next hist trigger command for that event. # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger echo: write error: Invalid argument # cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/hist ERROR: Couldn't yyy: zzz Last command: xxx Also add specific error messages for variable and action errors. Link: http://lkml.kernel.org/r/64e9c422fc8aeafcc2f7a3b4328c0cffe7969129.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add hist trigger support for variable reference aliasesTom Zanussi
Add support for alias=$somevar where alias can be used as onmatch.xxx($alias). Aliases are a way of creating a new name for an existing variable, for flexibly in making naming more clear in certain cases. For example in the below the user perhaps feels that using $new_lat in the synthetic event invocation is opaque or doesn't fit well stylistically with previous triggers, so creates an alias of $new_lat named $latency and uses that in the call instead: # echo 'hist:keys=next_pid:new_lat=common_timestamp.usecs' > /sys/kernel/debug/tracing/events/sched/sched_switch/trigger # echo 'hist:keys=pid:latency=$new_lat: onmatch(sched.sched_switch).wake2($latency,pid)' > /sys/kernel/debug/tracing/events/synthetic/wake1/trigger Link: http://lkml.kernel.org/r/ef20a65d921af3a873a6f1e8c71407c926d5586f.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add cpu field for hist triggersTom Zanussi
A common key to use in a histogram is the cpuid - add a new cpu 'synthetic' field named 'cpu' for that purpose. Link: http://lkml.kernel.org/r/89537645bfc957e0d76e2cacf5f0ada88691a6cc.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Allow whitespace to surround hist trigger filterTom Zanussi
The existing code only allows for one space before and after the 'if' specifying the filter for a hist trigger. Add code to make that more permissive as far as whitespace goes. Specifically, we want to allow spaces in the trigger itself now that we have additional syntax (onmatch/onmax) where spaces are more natural e.g. spaces after commas in param lists. Link: http://lkml.kernel.org/r/1053090c3c308d4f431accdeb59dff4b511d4554.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add 'onmax' hist trigger action supportTom Zanussi
Add an 'onmax(var).save(field,...)' hist trigger action which is invoked whenever an event exceeds the current maximum. The end result is that the trace event fields or variables specified as the onmax.save() params will be saved if 'var' exceeds the current maximum for that hist trigger entry. This allows context from the event that exhibited the new maximum to be saved for later reference. When the histogram is displayed, additional fields displaying the saved values will be printed. As an example the below defines a couple of hist triggers, one for sched_wakeup and another for sched_switch, keyed on pid. Whenever a sched_wakeup occurs, the timestamp is saved in the entry corresponding to the current pid, and when the scheduler switches back to that pid, the timestamp difference is calculated. If the resulting latency exceeds the current maximum latency, the specified save() values are saved: # echo 'hist:keys=pid:ts0=common_timestamp.usecs \ if comm=="cyclictest"' >> \ /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger # echo 'hist:keys=next_pid:\ wakeup_lat=common_timestamp.usecs-$ts0:\ onmax($wakeup_lat).save(next_comm,prev_pid,prev_prio,prev_comm) \ if next_comm=="cyclictest"' >> \ /sys/kernel/debug/tracing/events/sched/sched_switch/trigger When the histogram is displayed, the max value and the saved values corresponding to the max are displayed following the rest of the fields: # cat /sys/kernel/debug/tracing/events/sched/sched_switch/hist { next_pid: 3728 } hitcount: 199 \ max: 123 next_comm: cyclictest prev_pid: 0 \ prev_prio: 120 prev_comm: swapper/3 { next_pid: 3730 } hitcount: 1321 \ max: 15 next_comm: cyclictest prev_pid: 0 \ prev_prio: 120 prev_comm: swapper/1 { next_pid: 3729 } hitcount: 1973\ max: 25 next_comm: cyclictest prev_pid: 0 \ prev_prio: 120 prev_comm: swapper/0 Totals: Hits: 3493 Entries: 3 Dropped: 0 Link: http://lkml.kernel.org/r/006907f71b1e839bb059337ec3c496f84fcb71de.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add 'onmatch' hist trigger action supportTom Zanussi
Add an 'onmatch(matching.event).<synthetic_event_name>(param list)' hist trigger action which is invoked with the set of variables or event fields named in the 'param list'. The result is the generation of a synthetic event that consists of the values contained in those variables and/or fields at the time the invoking event was hit. As an example the below defines a simple synthetic event using a variable defined on the sched_wakeup_new event, and shows the event definition with unresolved fields, since the sched_wakeup_new event with the testpid variable hasn't been defined yet: # echo 'wakeup_new_test pid_t pid; int prio' >> \ /sys/kernel/debug/tracing/synthetic_events # cat /sys/kernel/debug/tracing/synthetic_events wakeup_new_test pid_t pid; int prio The following hist trigger both defines a testpid variable and specifies an onmatch() trace action that uses that variable along with a non-variable field to generate a wakeup_new_test synthetic event whenever a sched_wakeup_new event occurs, which because of the 'if comm == "cyclictest"' filter only happens when the executable is cyclictest: # echo 'hist:testpid=pid:keys=$testpid:\ onmatch(sched.sched_wakeup_new).wakeup_new_test($testpid, prio) \ if comm=="cyclictest"' >> \ /sys/kernel/debug/tracing/events/sched/sched_wakeup_new/trigger Creating and displaying a histogram based on those events is now just a matter of using the fields and new synthetic event in the tracing/events/synthetic directory, as usual: # echo 'hist:keys=pid,prio:sort=pid,prio' >> \ /sys/kernel/debug/tracing/events/synthetic/wakeup_new_test/trigger Link: http://lkml.kernel.org/r/8c2a574bcb7530c876629c901ecd23911b14afe8.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Rajvi Jingar <rajvi.jingar@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add support for 'field variables'Tom Zanussi
Users should be able to directly specify event fields in hist trigger 'actions' rather than being forced to explicitly create a variable for that purpose. Add support allowing fields to be used directly in actions, which essentially does just that - creates 'invisible' variables for each bare field specified in an action. If a bare field refers to a field on another (matching) event, it even creates a special histogram for the purpose (since variables can't be defined on an existing histogram after histogram creation). Here's a simple example that demonstrates both. Basically the onmatch() action creates a list of variables corresponding to the parameters of the synthetic event to be generated, and then uses those values to generate the event. So for the wakeup_latency synthetic event 'call' below the first param, $wakeup_lat, is a variable defined explicitly on sched_switch, where 'next_pid' is just a normal field on sched_switch, and prio is a normal field on sched_waking. Since the mechanism works on variables, those two normal fields just have 'invisible' variables created internally for them. In the case of 'prio', which is on another event, we actually need to create an additional hist trigger and define the invisible variable on that, since once a hist trigger is defined, variables can't be added to it later. echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> /sys/kernel/debug/tracing/synthetic_events echo 'hist:keys=pid:ts0=common_timestamp.usecs >> /sys/kernel/debug/tracing/events/sched/sched_waking/trigger echo 'hist:keys=next_pid:wakeup_lat=common_timestamp.usecs-$ts0: onmatch(sched.sched_waking).wakeup_latency($wakeup_lat,next_pid,prio) >> /sys/kernel/debug/tracing/events/sched/sched_switch/trigger Link: http://lkml.kernel.org/r/8e8dcdac1ea180ed7a3689e1caeeccede9dc42b3.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add support for 'synthetic' eventsTom Zanussi
Synthetic events are user-defined events generated from hist trigger variables saved from one or more other events. To define a synthetic event, the user writes a simple specification consisting of the name of the new event along with one or more variables and their type(s), to the tracing/synthetic_events file. For instance, the following creates a new event named 'wakeup_latency' with 3 fields: lat, pid, and prio: # echo 'wakeup_latency u64 lat; pid_t pid; int prio' >> \ /sys/kernel/debug/tracing/synthetic_events Reading the tracing/synthetic_events file lists all the currently-defined synthetic events, in this case the event we defined above: # cat /sys/kernel/debug/tracing/synthetic_events wakeup_latency u64 lat; pid_t pid; int prio At this point, the synthetic event is ready to use, and a histogram can be defined using it: # echo 'hist:keys=pid,prio,lat.log2:sort=pid,lat' >> \ /sys/kernel/debug/tracing/events/synthetic/wakeup_latency/trigger The new event is created under the tracing/events/synthetic/ directory and looks and behaves just like any other event: # ls /sys/kernel/debug/tracing/events/synthetic/wakeup_latency enable filter format hist id trigger Although a histogram can be defined for it, nothing will happen until an action tracing that event via the trace_synth() function occurs. The trace_synth() function is very similar to all the other trace_* invocations spread throughout the kernel, except in this case the trace_ function and its corresponding tracepoint isn't statically generated but defined by the user at run-time. How this can be automatically hooked up via a hist trigger 'action' is discussed in a subsequent patch. Link: http://lkml.kernel.org/r/c68df2284b7d172669daf9be29db62ad49bbc559.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> [fix noderef.cocci warnings, sizeof pointer for kcalloc of event->fields] Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add hist trigger action hookTom Zanussi
Add a hook for executing extra actions whenever a histogram entry is added or updated. The default 'action' when a hist entry is added to a histogram is to update the set of values associated with it. Some applications may want to perform additional actions at that point, such as generate another event, or compare and save a maximum. Add a simple framework for doing that; specific actions will be implemented on top of it in later patches. Link: http://lkml.kernel.org/r/9482ba6a3eaf5ca6e60954314beacd0e25c05b24.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add variable reference handling to hist triggersTom Zanussi
Add the necessary infrastructure to allow the variables defined on one event to be referenced in another. This allows variables set by a previous event to be referenced and used in expressions combining the variable values saved by that previous event and the event fields of the current event. For example, here's how a latency can be calculated and saved into yet another variable named 'wakeup_lat': # echo 'hist:keys=pid,prio:ts0=common_timestamp ... # echo 'hist:keys=next_pid:wakeup_lat=common_timestamp-$ts0 ... In the first event, the event's timetamp is saved into the variable ts0. In the next line, ts0 is subtracted from the second event's timestamp to produce the latency. Further users of variable references will be described in subsequent patches, such as for instance how the 'wakeup_lat' variable above can be displayed in a latency histogram. Link: http://lkml.kernel.org/r/b1d3e6975374e34d501ff417c20189c3f9b2c7b8.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Add hist_field 'type' fieldTom Zanussi
Future support for synthetic events requires hist_field 'type' information, so add a field for that. Also, make other hist_field attribute usage consistent (size, is_signed, etc). Link: http://lkml.kernel.org/r/3fd12a2e86316b05151ba0d7c68268e780af2c9d.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2018-03-10tracing: Pass tracing_map_elt to hist_field accessor functionsTom Zanussi
Some accessor functions, such as for variable references, require access to a corrsponding tracing_map_elt. Add a tracing_map_elt param to the function signature and update the accessor functions accordingly. Link: http://lkml.kernel.org/r/e0f292b068e9e4948da1d5af21b5ae0efa9b5717.1516069914.git.tom.zanussi@linux.intel.com Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>