summaryrefslogtreecommitdiff
path: root/kernel/trace/ring_buffer.c
AgeCommit message (Collapse)Author
2017-11-17Merge tag 'trace-v4.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from - allow module init functions to be traced - clean up some unused or not used by config events (saves space) - clean up of trace histogram code - add support for preempt and interrupt enabled/disable events - other various clean ups * tag 'trace-v4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (30 commits) tracing, thermal: Hide cpu cooling trace events when not in use tracing, thermal: Hide devfreq trace events when not in use ftrace: Kill FTRACE_OPS_FL_PER_CPU perf/ftrace: Small cleanup perf/ftrace: Fix function trace events perf/ftrace: Revert ("perf/ftrace: Fix double traces of perf on ftrace:function") tracing, dma-buf: Remove unused trace event dma_fence_annotate_wait_on tracing, memcg, vmscan: Hide trace events when not in use tracing/xen: Hide events that are not used when X86_PAE is not defined tracing: mark trace_test_buffer as __maybe_unused printk: Remove superfluous memory barriers from printk_safe ftrace: Clear hashes of stale ips of init memory tracing: Add support for preempt and irq enable/disable events tracing: Prepare to add preempt and irq trace events ftrace/kallsyms: Have /proc/kallsyms show saved mod init functions ftrace: Add freeing algorithm to free ftrace_mod_maps ftrace: Save module init functions kallsyms symbols for tracing ftrace: Allow module init functions to be traced ftrace: Add a ftrace_free_mem() function for modules to use tracing: Reimplement log2 ...
2017-11-15kmemcheck: remove annotationsLevin, Alexander (Sasha Levin)
Patch series "kmemcheck: kill kmemcheck", v2. As discussed at LSF/MM, kill kmemcheck. KASan is a replacement that is able to work without the limitation of kmemcheck (single CPU, slow). KASan is already upstream. We are also not aware of any users of kmemcheck (or users who don't consider KASan as a suitable replacement). The only objection was that since KASAN wasn't supported by all GCC versions provided by distros at that time we should hold off for 2 years, and try again. Now that 2 years have passed, and all distros provide gcc that supports KASAN, kill kmemcheck again for the very same reasons. This patch (of 4): Remove kmemcheck annotations, and calls to kmemcheck from the kernel. [alexander.levin@verizon.com: correctly remove kmemcheck call from dma_map_sg_attrs] Link: http://lkml.kernel.org/r/20171012192151.26531-1-alexander.levin@verizon.com Link: http://lkml.kernel.org/r/20171007030159.22241-2-alexander.levin@verizon.com Signed-off-by: Sasha Levin <alexander.levin@verizon.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tim Hansen <devtimhansen@gmail.com> Cc: Vegard Nossum <vegardno@ifi.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-10-25locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns ↵Mark Rutland
to READ_ONCE()/WRITE_ONCE() Please do not apply this to mainline directly, instead please re-run the coccinelle script shown below and apply its output. For several reasons, it is desirable to use {READ,WRITE}_ONCE() in preference to ACCESS_ONCE(), and new code is expected to use one of the former. So far, there's been no reason to change most existing uses of ACCESS_ONCE(), as these aren't harmful, and changing them results in churn. However, for some features, the read/write distinction is critical to correct operation. To distinguish these cases, separate read/write accessors must be used. This patch migrates (most) remaining ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following coccinelle script: ---- // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and // WRITE_ONCE() // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch virtual patch @ depends on patch @ expression E1, E2; @@ - ACCESS_ONCE(E1) = E2 + WRITE_ONCE(E1, E2) @ depends on patch @ expression E; @@ - ACCESS_ONCE(E) + READ_ONCE(E) ---- Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: davem@davemloft.net Cc: linux-arch@vger.kernel.org Cc: mpe@ellerman.id.au Cc: shuah@kernel.org Cc: snitzer@redhat.com Cc: thor.thayer@linux.intel.com Cc: tj@kernel.org Cc: viro@zeniv.linux.org.uk Cc: will.deacon@arm.com Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-04ring-buffer: Rewrite trace_recursive_(un)lock() to be simplerSteven Rostedt (VMware)
The current method to prevent the ring buffer from entering into a recursize loop is to use a bitmask and set the bit that maps to the current context (normal, softirq, irq or NMI), and if that bit was already set, it is considered a recursive loop. New code is being added that may require the ring buffer to be entered a second time in the current context. The recursive locking prevents that from happening. Instead of mapping a bitmask to the current context, just allow 4 levels of nesting in the ring buffer. This matches the 4 context levels that it can already nest. It is highly unlikely to have more than two levels, thus it should be fine when we add the second entry into the ring buffer. If that proves to be a problem, we can always up the number to 8. An added benefit is that reading preempt_count() to get the current level adds a very slight but noticeable overhead. This removes that need. Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-08-02ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPUSteven Rostedt (VMware)
Chunyu Hu reported: "per_cpu trace directories and files are created for all possible cpus, but only the cpus which have ever been on-lined have their own per cpu ring buffer (allocated by cpuhp threads). While trace_buffers_open, the open handler for trace file 'trace_pipe_raw' is always trying to access field of ring_buffer_per_cpu, and would panic with the NULL pointer. Align the behavior of trace_pipe_raw with trace_pipe, that returns -NODEV when openning it if that cpu does not have trace ring buffer. Reproduce: cat /sys/kernel/debug/tracing/per_cpu/cpu31/trace_pipe_raw (cpu31 is never on-lined, this is a 16 cores x86_64 box) Tested with: 1) boot with maxcpus=14, read trace_pipe_raw of cpu15. Got -NODEV. 2) oneline cpu15, read trace_pipe_raw of cpu15. Get the raw trace data. Call trace: [ 5760.950995] RIP: 0010:ring_buffer_alloc_read_page+0x32/0xe0 [ 5760.961678] tracing_buffers_read+0x1f6/0x230 [ 5760.962695] __vfs_read+0x37/0x160 [ 5760.963498] ? __vfs_read+0x5/0x160 [ 5760.964339] ? security_file_permission+0x9d/0xc0 [ 5760.965451] ? __vfs_read+0x5/0x160 [ 5760.966280] vfs_read+0x8c/0x130 [ 5760.967070] SyS_read+0x55/0xc0 [ 5760.967779] do_syscall_64+0x67/0x150 [ 5760.968687] entry_SYSCALL64_slow_path+0x25/0x25" This was introduced by the addition of the feature to reuse reader pages instead of re-allocating them. The problem is that the allocation of a reader page (which is per cpu) does not check if the cpu is online and set up for the ring buffer. Link: http://lkml.kernel.org/r/1500880866-1177-1-git-send-email-chuhu@redhat.com Cc: stable@vger.kernel.org Fixes: 73a757e63114 ("ring-buffer: Return reader page back into existing ring buffer") Reported-by: Chunyu Hu <chuhu@redhat.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-07-19tracing/ring_buffer: Try harder to allocateJoel Fernandes
ftrace can fail to allocate per-CPU ring buffer on systems with a large number of CPUs coupled while large amounts of cache happening in the page cache. Currently the ring buffer allocation doesn't retry in the VM implementation even if direct-reclaim made some progress but still wasn't able to find a free page. On retrying I see that the allocations almost always succeed. The retry doesn't happen because __GFP_NORETRY is used in the tracer to prevent the case where we might OOM, however if we drop __GFP_NORETRY, we risk destabilizing the system if OOM killer is triggered. To prevent this situation, use the __GFP_RETRY_MAYFAIL flag introduced recently [1]. Tested the following still succeeds without destabilizing a system with 1GB memory. echo 300000 > /sys/kernel/debug/tracing/buffer_size_kb [1] https://marc.info/?l=linux-mm&m=149820805124906&w=2 Link: http://lkml.kernel.org/r/20170713021416.8897-1-joelaf@google.com Cc: Tim Murray <timmurray@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@kernel.org> Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-05-03Merge tag 'trace-v4.12' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "New features for this release: - Pretty much a full rewrite of the processing of function plugins. i.e. echo do_IRQ:stacktrace > set_ftrace_filter - The rewrite was needed to add plugins to be unique to tracing instances. i.e. mkdir instance/foo; cd instances/foo; echo do_IRQ:stacktrace > set_ftrace_filter The old way was written very hacky. This removes a lot of those hacks. - New "function-fork" tracing option. When set, pids in the set_ftrace_pid will have their children added when the processes with their pids listed in the set_ftrace_pid file forks. - Exposure of "maxactive" for kretprobe in kprobe_events - Allow for builtin init functions to be traced by the function tracer (via the kernel command line). Module init function tracing will come in the next release. - Added more selftests, and have selftests also test in an instance" * tag 'trace-v4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (60 commits) ring-buffer: Return reader page back into existing ring buffer selftests: ftrace: Allow some event trigger tests to run in an instance selftests: ftrace: Have some basic tests run in a tracing instance too selftests: ftrace: Have event tests also run in an tracing instance selftests: ftrace: Make func_event_triggers and func_traceonoff_triggers tests do instances selftests: ftrace: Allow some tests to be run in a tracing instance tracing/ftrace: Allow for instances to trigger their own stacktrace probes tracing/ftrace: Allow for the traceonoff probe be unique to instances tracing/ftrace: Enable snapshot function trigger to work with instances tracing/ftrace: Allow instances to have their own function probes tracing/ftrace: Add a better way to pass data via the probe functions ftrace: Dynamically create the probe ftrace_ops for the trace_array tracing: Pass the trace_array into ftrace_probe_ops functions tracing: Have the trace_array hold the list of registered func probes ftrace: If the hash for a probe fails to update then free what was initialized ftrace: Have the function probes call their own function ftrace: Have each function probe use its own ftrace_ops ftrace: Have unregister_ftrace_function_probe_func() return a value ftrace: Add helper function ftrace_hash_move_and_update_ops() ftrace: Remove data field from ftrace_func_probe structure ...
2017-05-01ring-buffer: Return reader page back into existing ring bufferSteven Rostedt (VMware)
When reading the ring buffer for consuming, it is optimized for splice, where a page is taken out of the ring buffer (zero copy) and sent to the reading consumer. When the read is finished with the page, it calls ring_buffer_free_read_page(), which simply frees the page. The next time the reader needs to get a page from the ring buffer, it must call ring_buffer_alloc_read_page() which allocates and initializes a reader page for the ring buffer to be swapped into the ring buffer for a new filled page for the reader. The problem is that there's no reason to actually free the page when it is passed back to the ring buffer. It can hold it off and reuse it for the next iteration. This completely removes the interaction with the page_alloc mechanism. Using the trace-cmd utility to record all events (causing trace-cmd to require reading lots of pages from the ring buffer, and calling ring_buffer_alloc/free_read_page() several times), and also assigning a stack trace trigger to the mm_page_alloc event, we can see how many times the ring_buffer_alloc_read_page() needed to allocate a page for the ring buffer. Before this change: # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1 # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l 9968 After this change: # trace-cmd record -e all -e mem_page_alloc -R stacktrace sleep 1 # trace-cmd report |grep ring_buffer_alloc_read_page | wc -l 4 Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-19ring-buffer: Have ring_buffer_iter_empty() return true when emptySteven Rostedt (VMware)
I noticed that reading the snapshot file when it is empty no longer gives a status. It suppose to show the status of the snapshot buffer as well as how to allocate and use it. For example: ># cat snapshot # tracer: nop # # # * Snapshot is allocated * # # Snapshot commands: # echo 0 > snapshot : Clears and frees snapshot buffer # echo 1 > snapshot : Allocates snapshot buffer, if not already allocated. # Takes a snapshot of the main buffer. # echo 2 > snapshot : Clears snapshot buffer (but does not allocate or free) # (Doesn't have to be '2' works with any number that # is not a '0' or '1') But instead it just showed an empty buffer: ># cat snapshot # tracer: nop # # entries-in-buffer/entries-written: 0/0 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | What happened was that it was using the ring_buffer_iter_empty() function to see if it was empty, and if it was, it showed the status. But that function was returning false when it was empty. The reason was that the iter header page was on the reader page, and the reader page was empty, but so was the buffer itself. The check only tested to see if the iter was on the commit page, but the commit page was no longer pointing to the reader page, but as all pages were empty, the buffer is also. Cc: stable@vger.kernel.org Fixes: 651e22f2701b ("ring-buffer: Always reset iterator to reader page") Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-04-05ring-buffer: Fix return value check in test_ringbuffer()Wei Yongjun
In case of error, the function kthread_run() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Link: http://lkml.kernel.org/r/1466184839-14927-1-git-send-email-weiyj_lk@163.com Cc: stable@vger.kernel.org Fixes: 6c43e554a ("ring-buffer: Add ring buffer startup selftest") Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2017-03-02sched/headers: Prepare for new header dependencies before moving code to ↵Ingo Molnar
<linux/sched/clock.h> We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which will have to be picked up from other headers and .c files. Create a trivial placeholder <linux/sched/clock.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-15Merge tag 'trace-v4.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "This release has a few updates: - STM can hook into the function tracer - Function filtering now supports more advance glob matching - Ftrace selftests updates and added tests - Softirq tag in traces now show only softirqs - ARM nop added to non traced locations at compile time - New trace_marker_raw file that allows for binary input - Optimizations to the ring buffer - Removal of kmap in trace_marker - Wakeup and irqsoff tracers now adhere to the set_graph_notrace file - Other various fixes and clean ups" * tag 'trace-v4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (42 commits) selftests: ftrace: Shift down default message verbosity kprobes/trace: Fix kprobe selftest for newer gcc tracing/kprobes: Add a helper method to return number of probe hits tracing/rb: Init the CPU mask on allocation tracing: Use SOFTIRQ_OFFSET for softirq dectection for more accurate results tracing/fgraph: Have wakeup and irqsoff tracers ignore graph functions too fgraph: Handle a case where a tracer ignores set_graph_notrace tracing: Replace kmap with copy_from_user() in trace_marker writing ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps to it tracing: Allow benchmark to be enabled at early_initcall() tracing: Have system enable return error if one of the events fail tracing: Do not start benchmark on boot up tracing: Have the reg function allow to fail ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inline ring-buffer: Froce rb_update_write_stamp() to be inlined ring-buffer: Force inline of hotpath helper functions tracing: Make __buffer_unlock_commit() always_inline tracing: Make tracepoint_printk a static_key ring-buffer: Always inline rb_event_data() ring-buffer: Make rb_reserve_next_event() always inlined ...
2016-12-12tracing/rb: Init the CPU mask on allocationSebastian Andrzej Siewior
Before commit b32614c03413 ("tracing/rb: Convert to hotplug state machine") the allocated cpumask was initialized to the mask of ONLINE or POSSIBLE CPUs. After the CPU hotplug changes the buffer initialisation moved to trace_rb_cpu_prepare() but I forgot to initially set the cpumask to zero. This is done now. Link: http://lkml.kernel.org/r/20161207133133.hzkcqfllxcdi3joz@linutronix.de Fixes: b32614c03413 ("tracing/rb: Convert to hotplug state machine") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-12-07tracing/rb: Init the CPU mask on allocationSebastian Andrzej Siewior
Before commit b32614c03413 ("tracing/rb: Convert to hotplug state machine") the allocated cpumask was initialized to the mask of online or possible CPUs. After the CPU hotplug changes the buffer initialization moved to trace_rb_cpu_prepare() but the cpumask is allocated with alloc_cpumask() and therefor has random content. As a consequence the cpu buffers are not initialized and a later access dereferences a NULL pointer. Use zalloc_cpumask() instead so trace_rb_cpu_prepare() initializes the buffers properly. Fixes: b32614c03413 ("tracing/rb: Convert to hotplug state machine") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20161207133133.hzkcqfllxcdi3joz@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-02tracing/rb: Convert to hotplug state machineSebastian Andrzej Siewior
Install the callbacks via the state machine. The notifier in struct ring_buffer is replaced by the multi instance interface. Upon __ring_buffer_alloc() invocation, cpuhp_state_add_instance() will invoke the trace_rb_cpu_prepare() on each CPU. This callback may now fail. This means __ring_buffer_alloc() will fail and cleanup (like previously) and during a CPU up event this failure will not allow the CPU to come up. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20161126231350.10321-7-bigeasy@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-11-23ring-buffer: Force rb_end_commit() and rb_set_commit_to_write() inlineSteven Rostedt (Red Hat)
Both rb_end_commit() and rb_set_commit_to_write() are in the fast path of the ring buffer recording. Make sure they are always inlined. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Froce rb_update_write_stamp() to be inlinedSteven Rostedt (Red Hat)
The function rb_update_write_stamp() is in the hotpath of the ring buffer recording. Make sure that it is inlined as well. There's not many places that call it. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Force inline of hotpath helper functionsSteven Rostedt (Red Hat)
There's several small helper functions in ring_buffer.c that are used in the hot path. For some reason, even though they are marked inline, gcc tends not to enforce it. Make sure these functions are always inlined. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Always inline rb_event_data()Steven Rostedt (Red Hat)
The rb_event_data() is the fast path of getting the ring buffer data from an event. Externally, ring_buffer_event_data() is used to access this function. But unfortunately, rb_event_data() is not inlined, and calling ring_buffer_event_data() causes that function to be called again. Force rb_event_data() to be inlined to lower the number of operations needed when calling ring_buffer_event_data(). Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-11-23ring-buffer: Make rb_reserve_next_event() always inlinedSteven Rostedt (Red Hat)
The function rb_reserved_next_event() is called by two functions: ring_buffer_lock_reserve() and ring_buffer_write(). This is in a very hot path of the tracing code, and it is best that they are not functions. The two callers are basically wrapers for rb_reserver_next_event(). Removing the function calls can save execution time in the hotpath of tracing. Link: http://lkml.kernel.org/r/20161121183700.GW26852@two.firstfloor.org Reported-by: Andi Kleen <andi@firstfloor.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-05-13ring-buffer: Prevent overflow of size in ring_buffer_resize()Steven Rostedt (Red Hat)
If the size passed to ring_buffer_resize() is greater than MAX_LONG - BUF_PAGE_SIZE then the DIV_ROUND_UP() will return zero. Here's the details: # echo 18014398509481980 > /sys/kernel/debug/tracing/buffer_size_kb tracing_entries_write() processes this and converts kb to bytes. 18014398509481980 << 10 = 18446744073709547520 and this is passed to ring_buffer_resize() as unsigned long size. size = DIV_ROUND_UP(size, BUF_PAGE_SIZE); Where DIV_ROUND_UP(a, b) is (a + b - 1)/b BUF_PAGE_SIZE is 4080 and here 18446744073709547520 + 4080 - 1 = 18446744073709551599 where 18446744073709551599 is still smaller than 2^64 2^64 - 18446744073709551599 = 17 But now 18446744073709551599 / 4080 = 4521260802379792 and size = size * 4080 = 18446744073709551360 This is checked to make sure its still greater than 2 * 4080, which it is. Then we convert to the number of buffer pages needed. nr_page = DIV_ROUND_UP(size, BUF_PAGE_SIZE) but this time size is 18446744073709551360 and 2^64 - (18446744073709551360 + 4080 - 1) = -3823 Thus it overflows and the resulting number is less than 4080, which makes 3823 / 4080 = 0 an nr_pages is set to this. As we already checked against the minimum that nr_pages may be, this causes the logic to fail as well, and we crash the kernel. There's no reason to have the two DIV_ROUND_UP() (that's just result of historical code changes), clean up the code and fix this bug. Cc: stable@vger.kernel.org # 3.5+ Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic") Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-05-13ring-buffer: Use long for nr_pages to avoid overflow failuresSteven Rostedt (Red Hat)
The size variable to change the ring buffer in ftrace is a long. The nr_pages used to update the ring buffer based on the size is int. On 64 bit machines this can cause an overflow problem. For example, the following will cause the ring buffer to crash: # cd /sys/kernel/debug/tracing # echo 10 > buffer_size_kb # echo 8556384240 > buffer_size_kb Then you get the warning of: WARNING: CPU: 1 PID: 318 at kernel/trace/ring_buffer.c:1527 rb_update_pages+0x22f/0x260 Which is: RB_WARN_ON(cpu_buffer, nr_removed); Note each ring buffer page holds 4080 bytes. This is because: 1) 10 causes the ring buffer to have 3 pages. (10kb requires 3 * 4080 pages to hold) 2) (2^31 / 2^10 + 1) * 4080 = 8556384240 The value written into buffer_size_kb is shifted by 10 and then passed to ring_buffer_resize(). 8556384240 * 2^10 = 8761737461760 3) The size passed to ring_buffer_resize() is then divided by BUF_PAGE_SIZE which is 4080. 8761737461760 / 4080 = 2147484672 4) nr_pages is subtracted from the current nr_pages (3) and we get: 2147484669. This value is saved in a signed integer nr_pages_to_update 5) 2147484669 is greater than 2^31 but smaller than 2^32, a signed int turns into the value of -2147482627 6) As the value is a negative number, in update_pages_handler() it is negated and passed to rb_remove_pages() and 2147482627 pages will be removed, which is much larger than 3 and it causes the warning because not all the pages asked to be removed were removed. Link: https://bugzilla.kernel.org/show_bug.cgi?id=118001 Cc: stable@vger.kernel.org # 2.6.28+ Fixes: 7a8e76a3829f1 ("tracing: unified trace buffer") Reported-by: Hao Qin <QEver.cn@gmail.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-25ring-buffer: Process commits whenever moving to a new page.Steven Rostedt (Red Hat)
When crossing over to a new page, commit the current work. This will allow readers to get data with less latency, and also simplifies the work to get timestamps working for interrupted events. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Remove redundant update of page timestampSteven Rostedt (Red Hat)
The first commit of a buffer page updates the timestamp of that page. No need to have the update to the next page add the timestamp too. It will only be replaced by the first commit on that page anyway. Only update to a page if it contains an event. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Use READ_ONCE() for most tail_page accessSteven Rostedt (Red Hat)
As cpu_buffer->tail_page may be modified by interrupts at almost any time, the flow of logic is very important. Do not let gcc get smart with re-reading cpu_buffer->tail_page by adding READ_ONCE() around most of its accesses. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Put back the length if crossed page with add_timestampSteven Rostedt (Red Hat)
Commit fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data" added a descriptor that holds various data instead of passing around several variables through parameters. The problem was that one of the parameters was modified in a function and the code was designed not to have an effect on that modified parameter. Now that the parameter is a descriptor and any modifications to it are non-volatile, the size of the data could be unnecessarily expanded. Remove the extra space added if a timestamp was added and the event went across the page. Cc: stable@vger.kernel.org # 4.3+ Fixes: fcc742eaad7c "ring-buffer: Add event descriptor to simplify passing data" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-24ring-buffer: Update read stamp with first real commit on pageSteven Rostedt (Red Hat)
Do not update the read stamp after swapping out the reader page from the write buffer. If the reader page is swapped out of the buffer before an event is written to it, then the read_stamp may get an out of date timestamp, as the page timestamp is updated on the first commit to that page. rb_get_reader_page() only returns a page if it has an event on it, otherwise it will return NULL. At that point, check if the page being returned has events and has not been read yet. Then at that point update the read_stamp to match the time stamp of the reader page. Cc: stable@vger.kernel.org # 2.6.30+ Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring-buffer: rb_event_is_commit() can return booleanYaowei Bai
Make rb_event_is_commit() return bool to improve readability due to this particular function only using either one or zero as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-7-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring-buffer: rb_per_cpu_empty() can return booleanYaowei Bai
Makes rb_per_cpu_empty() return bool to improve readability. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-6-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring_buffer: ring_buffer_empty{cpu}() can return booleanYaowei Bai
Make ring_buffer_empty() and ring_buffer_empty_cpu() return bool. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-5-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-11-02ring-buffer: rb_is_reader_page() can return booleanYaowei Bai
Make rb_is_reader_page() return bool to improve readability due to this particular function only using either true or false as its return value. No functional change. Link: http://lkml.kernel.org/r/1443537816-5788-4-git-send-email-bywxiaobai@163.com Signed-off-by: Yaowei Bai <bywxiaobai@163.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-09-03ring-buffer: Revert "ring-buffer: Get timestamp after event is allocated"Steven Rostedt (Red Hat)
The commit a4543a2fa9ef31 "ring-buffer: Get timestamp after event is allocated" is needed for some future work. But after adding it, there is a race somewhere that causes the saved timestamp to have a slight shift, and get ahead of the actual timestamp and make it look like time goes backwards. I'm still looking into why this happens, but in the mean time, this is holding up other work to get in. I'm reverting the change for now (which makes the problem go away), and will add it back after I know what is wrong and fix it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-07-20ring-buffer: Reorganize function locationsSteven Rostedt (Red Hat)
Functions in ring-buffer.c have gotten interleaved between different use cases. Move the functions around to get like functions closer together. This may or may not help gcc keep cache locality, but it makes it a little easier to work with the code. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-07-20ring-buffer: Make sure event has enough room for extend and paddingSteven Rostedt (Red Hat)
Now that events only add time extends after it is committed, in case an event comes in before it can discard the allocated event, the time extend needs to be stored within the event. If the event is bigger than then size needed for the time extend, padding must be added. The minimum padding size is 8 bytes. Thus if the event is 12 bytes (size of time extend + 4), there will not be enough room to add both the time extend and padding. Make sure all events are either 8 bytes or 16 or more bytes. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-07-20ring-buffer: Get timestamp after event is allocatedSteven Rostedt (Red Hat)
Move the capturing of the timestamp to after an event is allocated. If the event is not a commit (where it is an event that preempted another event), then no timestamp is needed, because the delta of nested events is always zero. If the event starts on a new page, no delta needs to be calculated as the full timestamp will be added to the page header, and the event will have a delta of zero. Now if the event requires a time extend (the delta does not fit in the 27 bit delta slot in the header), then the event is discarded, the length is extended to hold the TIME_EXTEND event that allows for a 59 bit delta, and the commit is tried again. If the event can't be discarded (another event came in after it), then the TIME_EXTEND is added directly to the allocated event and the rest of the event is given padding. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-07-20ring-buffer: Move the adding of the extended timestamp out of lineSteven Rostedt (Red Hat)
Requiring a extended time stamp is an uncommon occurrence, and it is best to do it out of line when needed. Add a noinline function that handles the extended timestamp and have it called with an unlikely to completely move it out of the fast path. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-07-20ring-buffer: Add event descriptor to simplify passing dataSteven Rostedt (Red Hat)
Add rb_event_info descriptor to pass event info to functions a bit easier than using a bunch of parameters. This will also allow for changing the code around a bit to find better fast paths. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-29ring-buffer: Add enum names for the context levelsSteven Rostedt (Red Hat)
Instead of having hard coded numbers for the context levels, use enums to describe them more. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-28ring-buffer: Remove useless unused tracing_off_permanent()Steven Rostedt (Red Hat)
The tracing_off_permanent() call is a way to disable all ring_buffers. Nothing uses it and nothing should use it, as tracing_off() and friends are better, as they disable the ring buffers related to tracing. The tracing_off_permanent() even disabled non tracing ring buffers. This is a bit drastic, and was added to handle NMIs doing outputs that could corrupt the ring buffer when only tracing used them. It is now obsolete and adds a little overhead, it should be removed. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-28ring-buffer: Give NMIs a chance to lock the reader_lockSteven Rostedt (Red Hat)
Currently, if an NMI does a dump of a ring buffer, it disables all ring buffers from ever doing any writes again. This is because it wont take the locks for the cpu_buffer and this can cause corruption if it preempted a read, or a read happens on another CPU for the current cpu buffer. This is a bit overkill. First, it should at least try to take the lock, and if it fails then disable it. Also, there's no need to disable all ring buffers, even those that are unrelated to what is being read. Only disable the per cpu ring buffer that is being read if it can not get the lock for it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-27ring-buffer: Add trace_recursive checks to ring_buffer_write()Steven Rostedt (Red Hat)
The ring_buffer_write() function isn't protected by the trace recursive writes. Luckily, this function is not used as much and is unlikely to ever recurse. But it should still have the protection, because even a call to ring_buffer_lock_reserve() could cause ring buffer corruption if called when ring_buffer_write() is being used. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-27ring-buffer: Allways do the trace_recursive checksSteven Rostedt (Red Hat)
Currently the trace_recursive checks are only done if CONFIG_TRACING is enabled. That was because there use to be a dependency with tracing for the recursive checks (it used the task_struct trace recursive variable). But now it uses its own variable and there is no dependency. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-27ring-buffer: Move recursive check to per_cpu descriptorSteven Rostedt (Red Hat)
Instead of using a global per_cpu variable to perform the recursive checks into the ring buffer, use the already existing per_cpu descriptor that is part of the ring buffer itself. Not only does this simplify the code, it also allows for one ring buffer to be used within the guts of the use of another ring buffer. For example trace_printk() can now be used within the ring buffer to record changes done by an instance into the main ring buffer. The recursion checks will prevent the trace_printk() itself from causing recursive issues with the main ring buffer (it is just ignored), but the recursive checks wont prevent the trace_printk() from recording other ring buffers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-21ring-buffer: Add unlikelys to make fast path the defaultSteven Rostedt (Red Hat)
I was running the trace_event benchmark and noticed that the times to record a trace_event was all over the place. I looked at the assembly of the ring_buffer_lock_reserver() and saw this: <ring_buffer_lock_reserve>: 31 c0 xor %eax,%eax 48 83 3d 76 47 bd 00 cmpq $0x1,0xbd4776(%rip) # ffffffff81d10d60 <ring_buffer_flags> 01 55 push %rbp 48 89 e5 mov %rsp,%rbp 75 1d jne ffffffff8113c60d <ring_buffer_lock_reserve+0x2d> 65 ff 05 69 e3 ec 7e incl %gs:0x7eece369(%rip) # a960 <__preempt_count> 8b 47 08 mov 0x8(%rdi),%eax 85 c0 test %eax,%eax +---- 74 12 je ffffffff8113c610 <ring_buffer_lock_reserve+0x30> | 65 ff 0d 5b e3 ec 7e decl %gs:0x7eece35b(%rip) # a960 <__preempt_count> | 0f 84 85 00 00 00 je ffffffff8113c690 <ring_buffer_lock_reserve+0xb0> | 31 c0 xor %eax,%eax | 5d pop %rbp | c3 retq | 90 nop +---> 65 44 8b 05 48 e3 ec mov %gs:0x7eece348(%rip),%r8d # a960 <__preempt_count> 7e 41 81 e0 ff ff ff 7f and $0x7fffffff,%r8d b0 08 mov $0x8,%al 65 8b 0d 58 36 ed 7e mov %gs:0x7eed3658(%rip),%ecx # fc80 <current_context> 41 f7 c0 00 ff 1f 00 test $0x1fff00,%r8d 74 1e je ffffffff8113c64f <ring_buffer_lock_reserve+0x6f> 41 f7 c0 00 00 10 00 test $0x100000,%r8d b0 01 mov $0x1,%al 75 13 jne ffffffff8113c64f <ring_buffer_lock_reserve+0x6f> 41 81 e0 00 00 0f 00 and $0xf0000,%r8d 49 83 f8 01 cmp $0x1,%r8 19 c0 sbb %eax,%eax 83 e0 02 and $0x2,%eax 83 c0 02 add $0x2,%eax 85 c8 test %ecx,%eax 75 ab jne ffffffff8113c5fe <ring_buffer_lock_reserve+0x1e> 09 c8 or %ecx,%eax 65 89 05 24 36 ed 7e mov %eax,%gs:0x7eed3624(%rip) # fc80 <current_context> The arrow is the fast path. After adding the unlikely's, the fast path looks a bit better: <ring_buffer_lock_reserve>: 31 c0 xor %eax,%eax 48 83 3d 76 47 bd 00 cmpq $0x1,0xbd4776(%rip) # ffffffff81d10d60 <ring_buffer_flags> 01 55 push %rbp 48 89 e5 mov %rsp,%rbp 75 7b jne ffffffff8113c66b <ring_buffer_lock_reserve+0x8b> 65 ff 05 69 e3 ec 7e incl %gs:0x7eece369(%rip) # a960 <__preempt_count> 8b 47 08 mov 0x8(%rdi),%eax 85 c0 test %eax,%eax 0f 85 9f 00 00 00 jne ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1> 65 8b 0d 57 e3 ec 7e mov %gs:0x7eece357(%rip),%ecx # a960 <__preempt_count> 81 e1 ff ff ff 7f and $0x7fffffff,%ecx b0 08 mov $0x8,%al 65 8b 15 68 36 ed 7e mov %gs:0x7eed3668(%rip),%edx # fc80 <current_context> f7 c1 00 ff 1f 00 test $0x1fff00,%ecx 75 50 jne ffffffff8113c670 <ring_buffer_lock_reserve+0x90> 85 d0 test %edx,%eax 75 7d jne ffffffff8113c6a1 <ring_buffer_lock_reserve+0xc1> 09 d0 or %edx,%eax 65 89 05 53 36 ed 7e mov %eax,%gs:0x7eed3653(%rip) # fc80 <current_context> 65 8b 05 fc da ec 7e mov %gs:0x7eecdafc(%rip),%eax # a130 <cpu_number> 89 c2 mov %eax,%edx Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-05-13tracing: Rename ftrace_event.h to trace_events.hSteven Rostedt (Red Hat)
The term "ftrace" is really the infrastructure of the function hooks, and not the trace events. Rename ftrace_event.h to trace_events.h to represent the trace_event infrastructure and decouple the term ftrace from it. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-30ring-buffer: Remove duplicate use of '&' in recursive codeSteven Rostedt (Red Hat)
A clean up of the recursive protection code changed val = this_cpu_read(current_context); val--; val &= this_cpu_read(current_context); to val = this_cpu_read(current_context); val &= val & (val - 1); Which has a duplicate use of '&' as the above is the same as val = val & (val - 1); Actually, it would be best to remove that line altogether and just add it to where it is used. And Christoph even mentioned that it can be further compacted to just a single line: __this_cpu_and(current_context, __this_cpu_read(current_context) - 1); Link: http://lkml.kernel.org/alpine.DEB.2.11.1503271423580.23114@gentwo.org Suggested-by: Christoph Lameter <cl@linux.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-03-25ring-buffer: Replace this_cpu_*() with __this_cpu_*()Steven Rostedt
It has come to my attention that this_cpu_read/write are horrible on architectures other than x86. Worse yet, they actually disable preemption or interrupts! This caused some unexpected tracing results on ARM. 101.356868: preempt_count_add <-ring_buffer_lock_reserve 101.356870: preempt_count_sub <-ring_buffer_lock_reserve The ring_buffer_lock_reserve has recursion protection that requires accessing a per cpu variable. But since preempt_disable() is traced, it too got traced while accessing the variable that is suppose to prevent recursion like this. The generic version of this_cpu_read() and write() are: #define this_cpu_generic_read(pcp) \ ({ typeof(pcp) ret__; \ preempt_disable(); \ ret__ = *this_cpu_ptr(&(pcp)); \ preempt_enable(); \ ret__; \ }) #define this_cpu_generic_to_op(pcp, val, op) \ do { \ unsigned long flags; \ raw_local_irq_save(flags); \ *__this_cpu_ptr(&(pcp)) op val; \ raw_local_irq_restore(flags); \ } while (0) Which is unacceptable for locations that know they are within preempt disabled or interrupt disabled locations. Paul McKenney stated that __this_cpu_() versions produce much better code on other architectures than this_cpu_() does, if we know that the call is done in a preempt disabled location. I also changed the recursive_unlock() to use two local variables instead of accessing the per_cpu variable twice. Link: http://lkml.kernel.org/r/20150317114411.GE3589@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20150317104038.312e73d1@gandalf.local.home Cc: stable@vger.kernel.org Acked-by: Christoph Lameter <cl@linux.com> Reported-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de> Tested-by: Uwe Kleine-Koenig <u.kleine-koenig@pengutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-02-11ring-buffer: Do not wake up a splice waiter when page is not fullSteven Rostedt (Red Hat)
When an application connects to the ring buffer via splice, it can only read full pages. Splice does not work with partial pages. If there is not enough data to fill a page, the splice command will either block or return -EAGAIN (if set to nonblock). Code was added where if the page is not full, to just sleep again. The problem is, it will get woken up again on the next event. That is, when something is written into the ring buffer, if there is a waiter it will wake it up. The waiter would then check the buffer, see that it still does not have enough data to fill a page and go back to sleep. To make matters worse, when the waiter goes back to sleep, it could cause another event, which would wake it back up again to see it doesn't have enough data and sleep again. This produces a tremendous overhead and fills the ring buffer with noise. For example, recording sched_switch on an idle system for 10 seconds produces 25,350,475 events!!! Create another wait queue for those waiters wanting full pages. When an event is written, it only wakes up waiters if there's a full page of data. It does not wake up the waiter if the page is not yet full. After this change, recording sched_switch on an idle system for 10 seconds produces only 800 events. Getting rid of 25,349,675 useless events (99.9969% of events!!), is something to take seriously. Cc: stable@vger.kernel.org # 3.16+ Cc: Rabin Vincent <rabin@rab.in> Fixes: e30f53aad220 "tracing: Do not busy wait in buffer splice" Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-01-22tracing: Remove unneeded includes of debugfs.h and fs.hSteven Rostedt (Red Hat)
The creation of tracing files and directories is for the most part encapsulated in helper functions in trace.c. Other files do not need to include debugfs.h or fs.h, as they may have needed to in the past. Remove them from the files that do not need them. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2014-12-10Merge tag 'trace-3.19' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: "There was a lot of clean ups and minor fixes. One of those clean ups was to the trace_seq code. It also removed the return values to the trace_seq_*() functions and use trace_seq_has_overflowed() to see if the buffer filled up or not. This is similar to work being done to the seq_file code as well in another tree. Some of the other goodies include: - Added some "!" (NOT) logic to the tracing filter. - Fixed the frame pointer logic to the x86_64 mcount trampolines - Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems. That is, the ftrace trampoline can be dynamically allocated and be called directly by functions that only have a single hook to them" * tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits) tracing: Truncated output is better than nothing tracing: Add additional marks to signal very large time deltas Documentation: describe trace_buf_size parameter more accurately tracing: Allow NOT to filter AND and OR clauses tracing: Add NOT to filtering logic ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter ftrace/x86: Get rid of ftrace_caller_setup ftrace/x86: Have save_mcount_regs macro also save stack frames if needed ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs ftrace/x86: Simplify save_mcount_regs on getting RIP ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file ftrace/x86: Have static tracing also use ftrace_caller_setup ftrace/x86: Have static function tracing always test for function graph kprobes: Add IPMODIFY flag to kprobe_ftrace_ops ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict kprobes/ftrace: Recover original IP if pre_handler doesn't change it tracing/trivial: Fix typos and make an int into a bool tracing: Deletion of an unnecessary check before iput() ...