summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2023-04-27ring-buffer: Sync IRQ works before buffer destructionJohannes Berg
If something was written to the buffer just before destruction, it may be possible (maybe not in a real system, but it did happen in ARCH=um with time-travel) to destroy the ringbuffer before the IRQ work ran, leading this KASAN report (or a crash without KASAN): BUG: KASAN: slab-use-after-free in irq_work_run_list+0x11a/0x13a Read of size 8 at addr 000000006d640a48 by task swapper/0 CPU: 0 PID: 0 Comm: swapper Tainted: G W O 6.3.0-rc1 #7 Stack: 60c4f20f 0c203d48 41b58ab3 60f224fc 600477fa 60f35687 60c4f20f 601273dd 00000008 6101eb00 6101eab0 615be548 Call Trace: [<60047a58>] show_stack+0x25e/0x282 [<60c609e0>] dump_stack_lvl+0x96/0xfd [<60c50d4c>] print_report+0x1a7/0x5a8 [<603078d3>] kasan_report+0xc1/0xe9 [<60308950>] __asan_report_load8_noabort+0x1b/0x1d [<60232844>] irq_work_run_list+0x11a/0x13a [<602328b4>] irq_work_tick+0x24/0x34 [<6017f9dc>] update_process_times+0x162/0x196 [<6019f335>] tick_sched_handle+0x1a4/0x1c3 [<6019fd9e>] tick_sched_timer+0x79/0x10c [<601812b9>] __hrtimer_run_queues.constprop.0+0x425/0x695 [<60182913>] hrtimer_interrupt+0x16c/0x2c4 [<600486a3>] um_timer+0x164/0x183 [...] Allocated by task 411: save_stack_trace+0x99/0xb5 stack_trace_save+0x81/0x9b kasan_save_stack+0x2d/0x54 kasan_set_track+0x34/0x3e kasan_save_alloc_info+0x25/0x28 ____kasan_kmalloc+0x8b/0x97 __kasan_kmalloc+0x10/0x12 __kmalloc+0xb2/0xe8 load_elf_phdrs+0xee/0x182 [...] The buggy address belongs to the object at 000000006d640800 which belongs to the cache kmalloc-1k of size 1024 The buggy address is located 584 bytes inside of freed 1024-byte region [000000006d640800, 000000006d640c00) Add the appropriate irq_work_sync() so the work finishes before the buffers are destroyed. Prior to the commit in the Fixes tag below, there was only a single global IRQ work, so this issue didn't exist. Link: https://lore.kernel.org/linux-trace-kernel/20230427175920.a76159263122.I8295e405c44362a86c995e9c2c37e3e03810aa56@changeid Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Fixes: 15693458c4bc ("tracing/ring-buffer: Move poll wake ups into ring buffer code") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-26tracing: Add missing spaces in trace_print_hex_seq()Ken Lin
If the buffer length is larger than 16 and concatenate is set to false, there would be missing spaces every 16 bytes. Example: Before: c5 11 10 50 05 4d 31 40 00 40 00 40 00 4d 31 4000 40 00 After: c5 11 10 50 05 4d 31 40 00 40 00 40 00 4d 31 40 00 40 00 Link: https://lore.kernel.org/linux-trace-kernel/20230426032257.3157247-1-lyenting@google.com Signed-off-by: Ken Lin <lyenting@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-26ring-buffer: Ensure proper resetting of atomic variables in ↵Tze-nan Wu
ring_buffer_reset_online_cpus In ring_buffer_reset_online_cpus, the buffer_size_kb write operation may permanently fail if the cpu_online_mask changes between two for_each_online_buffer_cpu loops. The number of increases and decreases on both cpu_buffer->resize_disabled and cpu_buffer->record_disabled may be inconsistent, causing some CPUs to have non-zero values for these atomic variables after the function returns. This issue can be reproduced by "echo 0 > trace" while hotplugging cpu. After reproducing success, we can find out buffer_size_kb will not be functional anymore. To prevent leaving 'resize_disabled' and 'record_disabled' non-zero after ring_buffer_reset_online_cpus returns, we ensure that each atomic variable has been set up before atomic_sub() to it. Link: https://lore.kernel.org/linux-trace-kernel/20230426062027.17451-1-Tze-nan.Wu@mediatek.com Cc: stable@vger.kernel.org Cc: <mhiramat@kernel.org> Cc: npiggin@gmail.com Fixes: b23d7a5f4a07 ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU") Reviewed-by: Cheng-Jui Wang <cheng-jui.wang@mediatek.com> Signed-off-by: Tze-nan Wu <Tze-nan.Wu@mediatek.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25recordmcount: Fix memory leaks in the uwrite functionHao Zeng
Common realloc mistake: 'file_append' nulled but not freed upon failure Link: https://lkml.kernel.org/r/20230426010527.703093-1-zenghao@kylinos.cn Signed-off-by: Hao Zeng <zenghao@kylinos.cn> Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25tracing/user_events: Limit max fault-in attemptsBeau Belgrave
When event enablement changes, user_events attempts to update a bit in the user process. If a fault is hit, an attempt to fault-in the page and the write is retried if the page made it in. While this normally requires a couple attempts, it is possible a bad user process could attempt to cause infinite loops. Ensure fault-in attempts either sync or async are limited to a max of 10 attempts for each update. When the max is hit, return -EFAULT so another attempt is not made in all cases. Link: https://lkml.kernel.org/r/20230425225107.8525-5-beaub@linux.microsoft.com Suggested-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25tracing/user_events: Prevent same address and bit per processBeau Belgrave
User processes register an address and bit pair for events. If the same address and bit pair are registered multiple times in the same process, it can cause undefined behavior when events are enabled/disabled. When more than one are used, the bit could be turned off by another event being disabled, while the original event is still enabled. Prevent undefined behavior by checking the current mm to see if any event has already been registered for the address and bit pair. Return EADDRINUSE back to the user process if it's already being used. Update ftrace self-test to ensure this occurs properly. Link: https://lkml.kernel.org/r/20230425225107.8525-4-beaub@linux.microsoft.com Suggested-by: Doug Cook <dcook@linux.microsoft.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25tracing/user_events: Ensure bit is cleared on unregisterBeau Belgrave
If an event is enabled and a user process unregisters user_events, the bit is left set. Fix this by always clearing the bit in the user process if unregister is successful. Update abi self-test to ensure this occurs properly. Link: https://lkml.kernel.org/r/20230425225107.8525-3-beaub@linux.microsoft.com Suggested-by: Doug Cook <dcook@linux.microsoft.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25tracing/user_events: Ensure write index cannot be negativeBeau Belgrave
The write index indicates which event the data is for and accesses a per-file array. The index is passed by user processes during write() calls as the first 4 bytes. Ensure that it cannot be negative by returning -EINVAL to prevent out of bounds accesses. Update ftrace self-test to ensure this occurs properly. Link: https://lkml.kernel.org/r/20230425225107.8525-2-beaub@linux.microsoft.com Fixes: 7f5a08c79df3 ("user_events: Add minimal support for trace_event into ftrace") Reported-by: Doug Cook <dcook@linux.microsoft.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25seq_buf: Add seq_buf_do_printk() helperSergey Senozhatsky
Sometimes we use seq_buf to format a string buffer, which we then pass to printk(). However, in certain situations the seq_buf string buffer can get too big, exceeding the PRINTKRB_RECORD_MAX bytes limit, and causing printk() to truncate the string. Add a new seq_buf helper. This helper prints the seq_buf string buffer line by line, using \n as a delimiter, rather than passing the whole string buffer to printk() at once. Link: https://lkml.kernel.org/r/20230415100110.1419872-1-senozhatsky@chromium.org Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25tracing: Fix print_fields() for __dyn_loc/__rel_locBeau Belgrave
Both print_fields() and print_array() do not handle if dynamic data ends at the last byte of the payload for both __dyn_loc and __rel_loc field types. For __rel_loc, the offset was off by 4 bytes, leading to incorrect strings and data being printed out. In print_array() the buffer pos was missed from being advanced, which results in the first payload byte being used as the offset base instead of the field offset. Advance __rel_loc offset by 4 to ensure correct offset and advance pos to the field offset to ensure correct data is displayed when printing arrays. Change >= to > when checking if data is in-bounds, since it's valid for dynamic data to include the last byte of the payload. Example outputs for event format: field:unsigned short common_type; offset:0; size:2; signed:0; field:unsigned char common_flags; offset:2; size:1; signed:0; field:unsigned char common_preempt_count; offset:3; size:1; signed:0; field:int common_pid; offset:4; size:4; signed:1; field:__rel_loc char text[]; offset:8; size:4; signed:1; Output before: tp_rel_loc: text=<OVERFLOW> Output after: tp_rel_loc: text=Test Link: https://lkml.kernel.org/r/20230419214140.4158-3-beaub@linux.microsoft.com Fixes: 80a76994b2d8 ("tracing: Add "fields" option to show raw trace event fields") Reported-by: Doug Cook <dcook@linux.microsoft.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25tracing/user_events: Set event filter_type from typeBeau Belgrave
Users expect that events can be filtered by the kernel. User events currently sets all event fields as FILTER_OTHER which limits to binary filters only. When strings are being used, functionality is reduced. Use filter_assign_type() to find the most appropriate filter type for each field in user events to ensure full kernel capabilities. Link: https://lkml.kernel.org/r/20230419214140.4158-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-04-25ring-buffer: Clearly check null ptr returned by rb_set_head_page()Zheng Yejian
In error case, 'buffer_page' returned by rb_set_head_page() is NULL, currently check '&buffer_page->list' is equivalent to check 'buffer_page' due to 'list' is the first member of 'buffer_page', but suppose it is not some time, 'head_page' would be wild memory while check would be bypassed. Link: https://lore.kernel.org/linux-trace-kernel/20230414071729.57312-1-zhengyejian1@huawei.com Cc: <mhiramat@kernel.org> Signed-off-by: Zheng Yejian <zhengyejian1@huawei.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing: Unbreak user eventsSteven Rostedt (Google)
The user events was added a bit prematurely, and there were a few kernel developers that had issues with it. The API also needed a bit of work to make sure it would be stable. It was decided to make user events "broken" until this was settled. Now it has a new API that appears to be as stable as it will be without the use of a crystal ball. It's being used within Microsoft as is, which means the API has had some testing in real world use cases. It went through many discussions in the bi-weekly tracing meetings, and there's been no more comments about updates. I feel this is good to go. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Use print_format_fields() for trace outputSteven Rostedt (Google)
Currently, user events are shown using the "hex" output for "safety" reasons as one cannot trust user events behaving nicely. But the hex output is not the only utility for safe outputting of trace events. The print_event_fields() is just as safe and gives user readable output. Before: example-839 [001] ..... 43.222244: 00000000: b1 06 00 00 47 03 00 00 00 00 00 00 ....G....... example-839 [001] ..... 43.564433: 00000000: b1 06 00 00 47 03 00 00 01 00 00 00 ....G....... example-839 [001] ..... 43.763917: 00000000: b1 06 00 00 47 03 00 00 02 00 00 00 ....G....... example-839 [001] ..... 43.967929: 00000000: b1 06 00 00 47 03 00 00 03 00 00 00 ....G....... After: example-837 [006] ..... 55.739249: test: count=0x0 (0) example-837 [006] ..... 111.104784: test: count=0x1 (1) example-837 [006] ..... 111.268444: test: count=0x2 (2) example-837 [006] ..... 111.416533: test: count=0x3 (3) example-837 [006] ..... 111.542859: test: count=0x4 (4) Link: https://lore.kernel.org/linux-trace-kernel/20230328151413.4770b8d7@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Align structs with tabs for readabilityBeau Belgrave
Add tabs to make struct members easier to read and unify the style of the code. Link: https://lkml.kernel.org/r/20230328235219.203-13-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Limit global user_event countBeau Belgrave
Operators want to be able to ensure enough tracepoints exist on the system for kernel components as well as for user components. Since there are only up to 64K events, by default allow up to half to be used by user events. Add a kernel sysctl parameter (kernel.user_events_max) to set a global limit that is honored among all groups on the system. This ensures hard limits can be setup to prevent user processes from consuming all event IDs on the system. Link: https://lkml.kernel.org/r/20230328235219.203-12-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Charge event allocs to cgroupsBeau Belgrave
Operators need a way to limit how much memory cgroups use. User events need to be included into that accounting. Fix this by using GFP_KERNEL_ACCOUNT for allocations generated by user programs for user_event tracing. Link: https://lkml.kernel.org/r/20230328235219.203-11-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Update documentation for ABIBeau Belgrave
The ABI for user_events has changed from mmap() based to remote writes. Update the documentation to reflect these changes, add new section for unregistering events since lifetime is now tied to tasks instead of files. Link: https://lkml.kernel.org/r/20230328235219.203-10-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Use write ABI in exampleBeau Belgrave
The ABI has changed to use a remote write approach. Update the example to show the expected use of this new ABI. Also remove debugfs path and use tracefs to ensure example works in more environments. Link: https://lkml.kernel.org/r/20230328235219.203-9-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Add ABI self-testBeau Belgrave
Add ABI specific self-test to ensure enablements work in various scenarios such as fork, VM_CLONE, and basic event enable/disable. Ensure ABI contracts/limits are also being upheld, such as bit limits and data size limits. Link: https://lkml.kernel.org/r/20230328235219.203-8-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Update self-tests to write ABIBeau Belgrave
ABI has been changed to remote writes, update existing test cases to use this new ABI to ensure existing functionality continues to work. Link: https://lkml.kernel.org/r/20230328235219.203-7-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Add ioctl for disabling addressesBeau Belgrave
Enablements are now tracked by the lifetime of the task/mm. User processes need to be able to disable their addresses if tracing is requested to be turned off. Before unmapping the page would suffice. However, we now need a stronger contract. Add an ioctl to enable this. A new flag bit is added, freeing, to user_event_enabler to ensure that if the event is attempted to be removed while a fault is being handled that the remove is delayed until after the fault is reattempted. Link: https://lkml.kernel.org/r/20230328235219.203-6-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Fixup enable faults asynclyBeau Belgrave
When events are enabled within the various tracing facilities, such as ftrace/perf, the event_mutex is held. As events are enabled pages are accessed. We do not want page faults to occur under this lock. Instead queue the fault to a workqueue to be handled in a process context safe way without the lock. The enable address is marked faulting while the async fault-in occurs. This ensures that we don't attempt to fault-in more than is necessary. Once the page has been faulted in, an address write is re-attempted. If the page couldn't fault-in, then we wait until the next time the event is enabled to prevent any potential infinite loops. Link: https://lkml.kernel.org/r/20230328235219.203-5-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Use remote writes for event enablementBeau Belgrave
As part of the discussions for user_events aligned with user space tracers, it was determined that user programs should register a aligned value to set or clear a bit when an event becomes enabled. Currently a shared page is being used that requires mmap(). Remove the shared page implementation and move to a user registered address implementation. In this new model during the event registration from user programs 3 new values are specified. The first is the address to update when the event is either enabled or disabled. The second is the bit to set/clear to reflect the event being enabled. The third is the size of the value at the specified address. This allows for a local 32/64-bit value in user programs to support both kernel and user tracers. As an example, setting bit 31 for kernel tracers when the event becomes enabled allows for user tracers to use the other bits for ref counts or other flags. The kernel side updates the bit atomically, user programs need to also update these values atomically. User provided addresses must be aligned on a natural boundary, this allows for single page checking and prevents odd behaviors such as a enable value straddling 2 pages instead of a single page. Currently page faults are only logged, future patches will handle these. Link: https://lkml.kernel.org/r/20230328235219.203-4-beaub@linux.microsoft.com Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Track fork/exec/exit for mm lifetimeBeau Belgrave
During tracefs discussions it was decided instead of requiring a mapping within a user-process to track the lifetime of memory descriptors we should hook the appropriate calls. Do this by adding the minimal stubs required for task fork, exec, and exit. Currently this is just a NOP. Future patches will implement these calls fully. Link: https://lkml.kernel.org/r/20230328235219.203-3-beaub@linux.microsoft.com Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing/user_events: Split header into uapi and kernelBeau Belgrave
The UAPI parts need to be split out from the kernel parts of user_events now that other parts of the kernel will reference it. Do so by moving the existing include/linux/user_events.h into include/uapi/linux/user_events.h. Link: https://lkml.kernel.org/r/20230328235219.203-2-beaub@linux.microsoft.com Signed-off-by: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tracing: Add "fields" option to show raw trace event fieldsSteven Rostedt (Google)
The hex, raw and bin formats come from the old PREEMPT_RT patch set latency tracer. That actually gave real alternatives to reading the ascii buffer. But they have started to bit rot and they do not give a good representation of the tracing data. Add "fields" option that will read the trace event fields and parse the data from how the fields are defined: With "fields" = 0 (default) echo 1 > events/sched/sched_switch/enable cat trace <idle>-0 [003] d..2. 540.078653: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:1 next_pid=83 next_prio=120 kworker/3:1-83 [003] d..2. 540.078860: sched_switch: prev_comm=kworker/3:1 prev_pid=83 prev_prio=120 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120 <idle>-0 [003] d..2. 540.206423: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=sshd next_pid=807 next_prio=120 sshd-807 [003] d..2. 540.206531: sched_switch: prev_comm=sshd prev_pid=807 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 <idle>-0 [001] d..2. 540.206597: sched_switch: prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/u16:4 next_pid=58 next_prio=120 kworker/u16:4-58 [001] d..2. 540.206617: sched_switch: prev_comm=kworker/u16:4 prev_pid=58 prev_prio=120 prev_state=I ==> next_comm=bash next_pid=830 next_prio=120 bash-830 [001] d..2. 540.206678: sched_switch: prev_comm=bash prev_pid=830 prev_prio=120 prev_state=R ==> next_comm=kworker/u16:4 next_pid=58 next_prio=120 kworker/u16:4-58 [001] d..2. 540.206696: sched_switch: prev_comm=kworker/u16:4 prev_pid=58 prev_prio=120 prev_state=I ==> next_comm=bash next_pid=830 next_prio=120 bash-830 [001] d..2. 540.206713: sched_switch: prev_comm=bash prev_pid=830 prev_prio=120 prev_state=R ==> next_comm=kworker/u16:4 next_pid=58 next_prio=120 echo 1 > options/fields <...>-998 [002] d..2. 538.643732: sched_switch: next_prio=0x78 (120) next_pid=0x0 (0) next_comm=swapper/2 prev_state=0x20 (32) prev_prio=0x78 (120) prev_pid=0x3e6 (998) prev_comm=trace-cmd <idle>-0 [001] d..2. 538.643806: sched_switch: next_prio=0x78 (120) next_pid=0x33e (830) next_comm=bash prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x0 (0) prev_comm=swapper/1 bash-830 [001] d..2. 538.644106: sched_switch: next_prio=0x78 (120) next_pid=0x3a (58) next_comm=kworker/u16:4 prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x33e (830) prev_comm=bash kworker/u16:4-58 [001] d..2. 538.644130: sched_switch: next_prio=0x78 (120) next_pid=0x33e (830) next_comm=bash prev_state=0x80 (128) prev_prio=0x78 (120) prev_pid=0x3a (58) prev_comm=kworker/u16:4 bash-830 [001] d..2. 538.644180: sched_switch: next_prio=0x78 (120) next_pid=0x3a (58) next_comm=kworker/u16:4 prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x33e (830) prev_comm=bash kworker/u16:4-58 [001] d..2. 538.644185: sched_switch: next_prio=0x78 (120) next_pid=0x33e (830) next_comm=bash prev_state=0x80 (128) prev_prio=0x78 (120) prev_pid=0x3a (58) prev_comm=kworker/u16:4 bash-830 [001] d..2. 538.644204: sched_switch: next_prio=0x78 (120) next_pid=0x0 (0) next_comm=swapper/1 prev_state=0x1 (1) prev_prio=0x78 (120) prev_pid=0x33e (830) prev_comm=bash <idle>-0 [003] d..2. 538.644211: sched_switch: next_prio=0x78 (120) next_pid=0x327 (807) next_comm=sshd prev_state=0x0 (0) prev_prio=0x78 (120) prev_pid=0x0 (0) prev_comm=swapper/3 sshd-807 [003] d..2. 538.644340: sched_switch: next_prio=0x78 (120) next_pid=0x0 (0) next_comm=swapper/3 prev_state=0x1 (1) prev_prio=0x78 (120) prev_pid=0x327 (807) prev_comm=sshd It traces the data safely without using the trace print formatting. Link: https://lore.kernel.org/linux-trace-kernel/20230328145156.497651be@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Beau Belgrave <beaub@linux.microsoft.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29tools/kvm_stat: use canonical ftrace pathRoss Zwisler
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing A comment in kvm_stat still refers to this older debugfs path, so let's update it to avoid confusion. Link: https://lkml.kernel.org/r/20230313211746.1541525-3-zwisler@kernel.org Cc: "Tobin C. Harding" <me@tobin.cc> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Tycho Andersen <tycho@tycho.pizza> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29leaking_addresses: also skip canonical ftrace pathRoss Zwisler
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing scripts/leaking_addresses.pl only skipped this older debugfs path, so let's add the canonical path as well. Link: https://lkml.kernel.org/r/20230313211746.1541525-2-zwisler@kernel.org Cc: "Tobin C. Harding" <me@tobin.cc> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Acked-by: Tycho Andersen <tycho@tycho.pizza> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-29selftests: use canonical ftrace pathRoss Zwisler
The canonical location for the tracefs filesystem is at /sys/kernel/tracing. But, from Documentation/trace/ftrace.rst: Before 4.1, all ftrace tracing control files were within the debugfs file system, which is typically located at /sys/kernel/debug/tracing. For backward compatibility, when mounting the debugfs file system, the tracefs file system will be automatically mounted at: /sys/kernel/debug/tracing A few spots in tools/testing/selftests still refer to this older debugfs path, so let's update them to avoid confusion. Link: https://lkml.kernel.org/r/20230313211746.1541525-1-zwisler@kernel.org Cc: "Tobin C. Harding" <me@tobin.cc> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Tycho Andersen <tycho@tycho.pizza> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Ross Zwisler <zwisler@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28docs: tracing: Update fprobe documentationMasami Hiramatsu (Google)
Update fprobe.rst for - the private entry_data argument - the return value of the entry handler - the nr_rethook_node field. Link: https://lkml.kernel.org/r/167526701579.433354.3057889264263546659.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28lib/test_fprobe: Add a testcase for skipping exit_handlerMasami Hiramatsu (Google)
Add a testcase for skipping exit_handler if entry_handler returns !0. Link: https://lkml.kernel.org/r/167526700658.433354.12922388040490848613.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28fprobe: Skip exit_handler if entry_handler returns !0Masami Hiramatsu (Google)
Skip hooking function return and calling exit_handler if the entry_handler() returns !0. Link: https://lkml.kernel.org/r/167526699798.433354.10998365726830117303.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28lib/test_fprobe: Add a test case for nr_maxactiveMasami Hiramatsu (Google)
Add a test case for nr_maxactive. If the number of active functions is more than nr_maxactive, it must be skipped. Link: https://lkml.kernel.org/r/167526698856.433354.4430007340787176666.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28fprobe: Add nr_maxactive to specify rethook_node pool sizeMasami Hiramatsu (Google)
Add nr_maxactive to specify rethook_node pool size. This means the maximum number of actively running target functions concurrently for probing by exit_handler. Note that if the running function is preempted or sleep, it is still counted as 'active'. Link: https://lkml.kernel.org/r/167526697917.433354.17779774988245113106.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28lib/test_fprobe: Add private entry_data testcasesMasami Hiramatsu (Google)
Add test cases for checking whether private entry_data is correctly passed or not. Link: https://lkml.kernel.org/r/167526697074.433354.17790288501657876219.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-28fprobe: Pass entry_data to handlersMasami Hiramatsu (Google)
Pass the private entry_data to the entry and exit handlers so that they can share the context data, something like saved function arguments etc. User must specify the private entry_data size by @entry_data_size field before registering the fprobe. Link: https://lkml.kernel.org/r/167526696173.433354.17408372048319432574.stgit@mhiramat.roam.corp.google.com Cc: Florent Revest <revest@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Show a list of all functions that have ever been enabledSteven Rostedt (Google)
When debugging a crash that appears to be related to ftrace, but not for sure, it is useful to know if a function was ever enabled by ftrace or not. It could be that a BPF program was attached to it, or possibly a live patch. We are having crashes in the field where this information is not always known. But having ftrace set a flag if a function has ever been attached since boot up helps tremendously in trying to know if a crash had to do with something using ftrace. For analyzing crashes, the use of a kdump image can have access to the flags. When looking at issues where the kernel did not panic, the touched_functions file can simply be used. Link: https://lore.kernel.org/linux-trace-kernel/20230124095653.6fd1640e@gandalf.local.home Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Chris Li <chriscli@google.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ring_buffer: Use try_cmpxchg instead of cmpxchgUros Bizjak
Use try_cmpxchg instead of cmpxchg (*ptr, old, new) == old. x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. No functional change intended. Link: https://lkml.kernel.org/r/20230305155532.5549-4-ubizjak@gmail.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Acked-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ring_buffer: Change some static functions to boolUros Bizjak
The return values of some functions are of boolean type. Change the type of these function to bool and adjust their return values. Also change type of some internal varibles to bool. No functional change intended. Link: https://lkml.kernel.org/r/20230305155532.5549-3-ubizjak@gmail.com Cc: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ring_buffer: Change some static functions to voidUros Bizjak
The results of some static functions are not used. Change the type of these function to void and remove unnecessary returns. No functional change intended. Link: https://lkml.kernel.org/r/20230305155532.5549-2-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org> Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: selftest: remove broken trace_direct_trampMark Rutland
The ftrace selftest code has a trace_direct_tramp() function which it uses as a direct call trampoline. This happens to work on x86, since the direct call's return address is in the usual place, and can be returned to via a RET, but in general the calling convention for direct calls is different from regular function calls, and requires a trampoline written in assembly. On s390, regular function calls place the return address in %r14, and an ftrace patch-site in an instrumented function places the trampoline's return address (which is within the instrumented function) in %r0, preserving the original %r14 value in-place. As a regular C function will return to the address in %r14, using a C function as the trampoline results in the trampoline returning to the caller of the instrumented function, skipping the body of the instrumented function. Note that the s390 issue is not detcted by the ftrace selftest code, as the instrumented function is trivial, and returning back into the caller happens to be equivalent. On arm64, regular function calls place the return address in x30, and an ftrace patch-site in an instrumented function saves this into r9 and places the trampoline's return address (within the instrumented function) in x30. A regular C function will return to the address in x30, but will not restore x9 into x30. Consequently, using a C function as the trampoline results in returning to the trampoline's return address having corrupted x30, such that when the instrumented function returns, it will return back into itself. To avoid future issues in this area, remove the trace_direct_tramp() function, and require that each architecture with direct calls provides a stub trampoline, named ftrace_stub_direct_tramp. This can be written to handle the architecture's trampoline calling convention, and in future could be used elsewhere (e.g. in the ftrace ops sample, to measure the overhead of direct calls), so we may as well always build it in. Link: https://lkml.kernel.org/r/20230321140424.345218-8-revest@chromium.org Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Li Huafei <lihuafei1@huawei.com> Cc: Xu Kuohai <xukuohai@huawei.com> Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Make DIRECT_CALLS work WITH_ARGS and !WITH_REGSFlorent Revest
Direct called trampolines can be called in two ways: - either from the ftrace callsite. In this case, they do not access any struct ftrace_regs nor pt_regs - Or, if a ftrace ops is also attached, from the end of a ftrace trampoline. In this case, the call_direct_funcs ops is in charge of setting the direct call trampoline's address in a struct ftrace_regs Since: commit 9705bc709604 ("ftrace: pass fregs to arch_ftrace_set_direct_caller()") The later case no longer requires a full pt_regs. It only needs a struct ftrace_regs so DIRECT_CALLS can work with both WITH_ARGS or WITH_REGS. With architectures like arm64 already abandoning WITH_REGS in favor of WITH_ARGS, it's important to have DIRECT_CALLS work WITH_ARGS only. Link: https://lkml.kernel.org/r/20230321140424.345218-7-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Co-developed-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Store direct called addresses in their opsFlorent Revest
All direct calls are now registered using the register_ftrace_direct API so each ops can jump to only one direct-called trampoline. By storing the direct called trampoline address directly in the ops we can save one hashmap lookup in the direct call ops and implement arm64 direct calls on top of call ops. Link: https://lkml.kernel.org/r/20230321140424.345218-6-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Rename _ftrace_direct_multi APIs to _ftrace_direct APIsFlorent Revest
Now that the original _ftrace_direct APIs are gone, the "_multi" suffixes only add confusion. Link: https://lkml.kernel.org/r/20230321140424.345218-5-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Remove the legacy _ftrace_direct APIFlorent Revest
This API relies on a single global ops, used for all direct calls registered with it. However, to implement arm64 direct calls, we need each ops to point to a single direct call trampoline. Link: https://lkml.kernel.org/r/20230321140424.345218-4-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Replace uses of _ftrace_direct APIs with _ftrace_direct_multiFlorent Revest
The _multi API requires that users keep their own ops but can enforce that an op is only associated to one direct call. Link: https://lkml.kernel.org/r/20230321140424.345218-3-revest@chromium.org Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Tested-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-21ftrace: Let unregister_ftrace_direct_multi() call ftrace_free_filter()Florent Revest
A common pattern when using the ftrace_direct_multi API is to unregister the ops and also immediately free its filter. We've noticed it's very easy for users to miss calling ftrace_free_filter(). This adds a "free_filters" argument to unregister_ftrace_direct_multi() to both remind the user they should free filters and also to make their life easier. Link: https://lkml.kernel.org/r/20230321140424.345218-2-revest@chromium.org Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Florent Revest <revest@chromium.org> Acked-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2023-03-19Linux 6.3-rc3v6.3-rc3Linus Torvalds
2023-03-19Merge tag 'trace-v6.3-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix setting affinity of hwlat threads in containers Using sched_set_affinity() has unwanted side effects when being called within a container. Use set_cpus_allowed_ptr() instead - Fix per cpu thread management of the hwlat tracer: - Do not start per_cpu threads if one is already running for the CPU - When starting per_cpu threads, do not clear the kthread variable as it may already be set to running per cpu threads - Fix return value for test_gen_kprobe_cmd() On error the return value was overwritten by being set to the result of the call from kprobe_event_delete(), which would likely succeed, and thus have the function return success - Fix splice() reads from the trace file that was broken by commit 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops") - Remove obsolete and confusing comment in ring_buffer.c The original design of the ring buffer used struct page flags for tricks to optimize, which was shortly removed due to them being tricks. But a comment for those tricks remained - Set local functions and variables to static * tag 'trace-v6.3-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr ring-buffer: remove obsolete comment for free_buffer_page() tracing: Make splice_read available again ftrace: Set direct_ops storage-class-specifier to static trace/hwlat: Do not start per-cpu thread if it is already running trace/hwlat: Do not wipe the contents of per-cpu thread data tracing/osnoise: set several trace_osnoise.c variables storage-class-specifier to static tracing: Fix wrong return in kprobe_event_gen_test.c