summaryrefslogtreecommitdiff
path: root/kernel/trace
AgeCommit message (Collapse)Author
2025-03-15tracing: tprobe-events: Fix leakage of module refcountMasami Hiramatsu (Google)
When enabling the tracepoint at loading module, the target module refcount is incremented by find_tracepoint_in_module(). But it is unnecessary because the module is not unloaded while processing module loading callbacks. Moreover, the refcount is not decremented in that function. To be clear the module refcount handling, move the try_module_get() callsite to trace_fprobe_create_internal(), where it is actually required. Link: https://lore.kernel.org/all/174182761071.83274.18334217580449925882.stgit@devnote2/ Fixes: 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: stable@vger.kernel.org
2025-03-15tracing: tprobe-events: Fix to clean up tprobe correctly when module unloadMasami Hiramatsu (Google)
When unloading module, the tprobe events are not correctly cleaned up. Thus it becomes `fprobe-event` and never be enabled again even if loading the same module again. For example; # cd /sys/kernel/tracing # modprobe trace_events_sample # echo 't:my_tprobe foo_bar' >> dynamic_events # cat dynamic_events t:tracepoints/my_tprobe foo_bar # rmmod trace_events_sample # cat dynamic_events f:tracepoints/my_tprobe foo_bar As you can see, the second time my_tprobe starts with 'f' instead of 't'. This unregisters the fprobe and tracepoint callback when module is unloaded but marks the fprobe-event is tprobe-event. Link: https://lore.kernel.org/all/174158724946.189309.15826571379395619524.stgit@mhiramat.tok.corp.google.com/ Fixes: 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-03-14tracing: Correct the refcount if the hist/hist_debug file fails to openTengda Wu
The function event_{hist,hist_debug}_open() maintains the refcount of 'file->tr' and 'file' through tracing_open_file_tr(). However, it does not roll back these counts on subsequent failure paths, resulting in a refcount leak. A very obvious case is that if the hist/hist_debug file belongs to a specific instance, the refcount leak will prevent the deletion of that instance, as it relies on the condition 'tr->ref == 1' within __remove_instance(). Fix this by calling tracing_release_file_tr() on all failure paths in event_{hist,hist_debug}_open() to correct the refcount. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Zheng Yejian <zhengyejian1@huawei.com> Link: https://lore.kernel.org/20250314065335.1202817-1-wutengda@huaweicloud.com Fixes: 1cc111b9cddc ("tracing: Fix uaf issue when open the hist or hist_debug file") Signed-off-by: Tengda Wu <wutengda@huaweicloud.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-10bpf: Use RCU in all users of __module_text_address().Sebastian Andrzej Siewior
__module_address() can be invoked within a RCU section, there is no requirement to have preemption disabled. Replace the preempt_disable() section around __module_address() with RCU. Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eduard Zingerman <eddyz87@gmail.com> Cc: Hao Luo <haoluo@google.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: KP Singh <kpsingh@kernel.org> Cc: Martin KaFai Lau <martin.lau@linux.dev> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matt Bobrowski <mattbobrowski@google.com> Cc: Song Liu <song@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Yonghong Song <yonghong.song@linux.dev> Cc: bpf@vger.kernel.org Cc: linux-trace-kernel@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/r/20250129084751.tH6iidUO@linutronix.de Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-03-10module: Use RCU in find_module_all().Sebastian Andrzej Siewior
The modules list and module::kallsyms can be accessed under RCU assumption. Remove module_assert_mutex_or_preempt() from find_module_all() so it can be used under RCU protection without warnings. Update its callers to use RCU protection instead of preempt_disable(). Cc: Jiri Kosina <jikos@kernel.org> Cc: Joe Lawrence <joe.lawrence@redhat.com> Cc: Josh Poimboeuf <jpoimboe@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Miroslav Benes <mbenes@suse.cz> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: linux-trace-kernel@vger.kernel.org Cc: live-patching@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Petr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20250108090457.512198-7-bigeasy@linutronix.de Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-03-07function_graph: Remove the unused variable funcJiapeng Chong
Variable func is not effectively used, so delete it. kernel/trace/trace_functions_graph.c:925:16: warning: variable ‘func’ set but not used. This happened because the variable "func" which came from "call->func" was replaced by "ret_func" coming from "graph_ret->func" but "func" wasn't removed after the replacement. Link: https://lore.kernel.org/20250307021412.119107-1-jiapeng.chong@linux.alibaba.com Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=19250 Fixes: ff5c9c576e754 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-06tracing/user_events: Slightly simplify user_seq_show()Christophe JAILLET
2 seq_puts() calls can be merged. It saves a few lines of code and a few cycles, should it matter. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/845caa94b74cea8d72c158bf1994fe250beee28c.1739979791.git.christophe.jaillet@wanadoo.fr Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-06tracing/user_events: Don't use %pK through printkThomas Weißschuh
Restricted pointers ("%pK") are not meant to be used through printk(). It can unintentionally expose security sensitive, raw pointer values. Use regular pointer formatting instead. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250217-restricted-pointers-trace-v1-1-bbe9ea279848@linutronix.de Link: https://lore.kernel.org/lkml/20250113171731-dc10e3c1-da64-4af0-b767-7c7070468023@linutronix.de/ Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-06ring-buffer: Fix typo in comment about header page pointerZhouyi Zhou
Fix typo in comment about header page pointer in function rb_get_reader_page. Link: https://lore.kernel.org/20250118012352.3430519-1-zhouzhouyi@gmail.com Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04osnoise: provide quiescent statesAnkur Arora
To reduce RCU noise for nohz_full configurations, osnoise depends on cond_resched() providing quiescent states for PREEMPT_RCU=n configurations. For PREEMPT_RCU=y configurations -- where cond_resched() is a stub -- we do this by directly calling rcu_momentary_eqs(). With (PREEMPT_LAZY=y, PREEMPT_DYNAMIC=n), however, we have a configuration with (PREEMPTION=y, PREEMPT_RCU=n) where neither of the above can help. Handle that by providing an explicit quiescent state here for all configurations. As mentioned above this is not needed for non-stubbed cond_resched(), but, providing a quiescent state here just pulls in one that a future cond_resched() would provide, so doesn't cause any extra work for this configuration. Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Daniel Bristot de Oliveira <bristot@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Suggested-by: Paul E. McKenney <paulmck@kernel.org> Acked-by: Daniel Bristot de Oliveira <bristot@kernel.org> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
2025-03-04rv: Add license identifiers to monitor filesGabriele Monaco
Some monitor files like the main header and the Kconfig are missing the license identifier. Add it to those and make sure the automatic generation script includes the line in newly created monitors. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lore.kernel.org/20250218123121.253551-3-gmonaco@redhat.com Signed-off-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04ftrace: Add arguments to function tracerSven Schnelle
Wire up the code to print function arguments in the function tracer. This functionality can be enabled/disabled during runtime with options/func-args. ping-689 [004] b.... 77.170220: dummy_xmit(skb = 0x82904800, dev = 0x882d0000) <-dev_hard_start_xmit Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185823.154996172@goodmis.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04ftrace: Have funcgraph-args take affect during tracingSteven Rostedt
Currently, when function_graph is started, it looks at the option funcgraph-args, and if it is set, it will enable tracing of the arguments. But if tracing is already running, and the user enables funcgraph-args, it will have no effect. Instead, it should enable argument tracing when it is enabled, even if it means disabling the function graph tracing for a short time in order to do the transition. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185822.978998710@goodmis.org Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04ftrace: Add support for function argument to graph tracerSven Schnelle
Wire up the code to print function arguments in the function graph tracer. This functionality can be enabled/disabled during runtime with options/funcgraph-args. Example usage: 6) | dummy_xmit [dummy](skb = 0x8887c100, dev = 0x872ca000) { 6) | consume_skb(skb = 0x8887c100) { 6) | skb_release_head_state(skb = 0x8887c100) { 6) 0.178 us | sock_wfree(skb = 0x8887c100) 6) 0.627 us | } Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185822.810321199@goodmis.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04ftrace: Add print_function_args()Sven Schnelle
Add a function to decode argument types with the help of BTF. Will be used to display arguments in the function and function graph tracer. It can only handle simply arguments and up to FTRACE_REGS_MAX_ARGS number of arguments. When it hits a max, it will print ", ...": page_to_skb(vi=0xffff8d53842dc980, rq=0xffff8d53843a0800, page=0xfffffc2e04337c00, offset=6160, len=64, truesize=1536, ...) And if it hits an argument that is not recognized, it will print the raw value and the type of argument it is: make_vfsuid(idmap=0xffffffff87f99db8, fs_userns=0xffffffff87e543c0, kuid=0x0 (STRUCT)) __pti_set_user_pgtbl(pgdp=0xffff8d5384ab47f8, pgd=0x110e74067 (STRUCT)) Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Guo Ren <guoren@kernel.org> Cc: Donglin Peng <dolinux.peng@gmail.com> Cc: Zheng Yejian <zhengyejian@huaweicloud.com> Link: https://lore.kernel.org/20250227185822.639418500@goodmis.org Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Co-developed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04ftrace: Have ftrace_free_filter() WARN and exit if ops is activeSteven Rostedt
The ftrace_free_filter() is used to reset the ops filters. But it must be done if the ops is not currently active (tracing). If it is, it will mess up the ftrace accounting of what functions are attached and what is not. WARN and exit the ftrace_free_filter() if the ops is active when it is called. Currently, it doesn't seem if anything does this, but it may in the future. Link: https://lore.kernel.org/all/20250219095330.2e9f171c@gandalf.local.home/ Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250219135040.3a9fbe00@gandalf.local.home Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-04fgraph: Correct typo in ftrace_return_to_handler commentHaiyue Wang
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250218122052.58348-1-haiyuewa@163.com Signed-off-by: Haiyue Wang <haiyuewa@163.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-03-03Merge tag 'v6.14-rc5' into x86/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-03-03Merge tag 'probes-fixes-v6.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probe events fixes from Masami Hiramatsu: - probe-events: Remove unused MAX_ARG_BUF_LEN macro - it is not used - fprobe-events: Log error for exceeding the number of entry args. Since the max number of entry args is limited, it should be checked and rejected when the parser detects it. - tprobe-events: Reject invalid tracepoint name If a user specifies an invalid tracepoint name (e.g. including '/') then the new event is not defined correctly in the eventfs. - tprobe-events: Fix a memory leak when tprobe defined with $retval There is a memory leak if tprobe is defined with $retval. * tag 'probes-fixes-v6.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macro tracing: fprobe-events: Log error for exceeding the number of entry args tracing: tprobe-events: Reject invalid tracepoint name tracing: tprobe-events: Fix a memory leak when tprobe with $retval
2025-03-03tracing: probe-events: Remove unused MAX_ARG_BUF_LEN macroMasami Hiramatsu (Google)
Commit 18b1e870a496 ("tracing/probes: Add $arg* meta argument for all function args") introduced MAX_ARG_BUF_LEN but it is not used. Remove it. Link: https://lore.kernel.org/all/174055075876.4079315.8805416872155957588.stgit@mhiramat.tok.corp.google.com/ Fixes: 18b1e870a496 ("tracing/probes: Add $arg* meta argument for all function args") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27ftrace: Avoid potential division by zero in function_stat_show()Nikolay Kuratov
Check whether denominator expression x * (x - 1) * 1000 mod {2^32, 2^64} produce zero and skip stddev computation in that case. For now don't care about rec->counter * rec->counter overflow because rec->time * rec->time overflow will likely happen earlier. Cc: stable@vger.kernel.org Cc: Wen Yang <wenyang@linux.alibaba.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250206090156.1561783-1-kniv@yandex-team.ru Fixes: e31f7939c1c27 ("ftrace: Avoid potential division by zero in function profiler") Signed-off-by: Nikolay Kuratov <kniv@yandex-team.ru> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27tracing: Fix bad hist from corrupting named_triggers listSteven Rostedt
The following commands causes a crash: ~# cd /sys/kernel/tracing/events/rcu/rcu_callback ~# echo 'hist:name=bad:keys=common_pid:onmax(bogus).save(common_pid)' > trigger bash: echo: write error: Invalid argument ~# echo 'hist:name=bad:keys=common_pid' > trigger Because the following occurs: event_trigger_write() { trigger_process_regex() { event_hist_trigger_parse() { data = event_trigger_alloc(..); event_trigger_register(.., data) { cmd_ops->reg(.., data, ..) [hist_register_trigger()] { data->ops->init() [event_hist_trigger_init()] { save_named_trigger(name, data) { list_add(&data->named_list, &named_triggers); } } } } ret = create_actions(); (return -EINVAL) if (ret) goto out_unreg; [..] ret = hist_trigger_enable(data, ...) { list_add_tail_rcu(&data->list, &file->triggers); <<<---- SKIPPED!!! (this is important!) [..] out_unreg: event_hist_unregister(.., data) { cmd_ops->unreg(.., data, ..) [hist_unregister_trigger()] { list_for_each_entry(iter, &file->triggers, list) { if (!hist_trigger_match(data, iter, named_data, false)) <- never matches continue; [..] test = iter; } if (test && test->ops->free) <<<-- test is NULL test->ops->free(test) [event_hist_trigger_free()] { [..] if (data->name) del_named_trigger(data) { list_del(&data->named_list); <<<<-- NEVER gets removed! } } } } [..] kfree(data); <<<-- frees item but it is still on list The next time a hist with name is registered, it causes an u-a-f bug and the kernel can crash. Move the code around such that if event_trigger_register() succeeds, the next thing called is hist_trigger_enable() which adds it to the list. A bunch of actions is called if get_named_trigger_data() returns false. But that doesn't need to be called after event_trigger_register(), so it can be moved up, allowing event_trigger_register() to be called just before hist_trigger_enable() keeping them together and allowing the file->triggers to be properly populated. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250227163944.1c37f85f@gandalf.local.home Fixes: 067fe038e70f6 ("tracing: Add variable reference handling to hist triggers") Reported-by: Tomas Glozar <tglozar@redhat.com> Tested-by: Tomas Glozar <tglozar@redhat.com> Reviewed-by: Tom Zanussi <zanussi@kernel.org> Closes: https://lore.kernel.org/all/CAP4=nvTsxjckSBTz=Oe_UYh8keD9_sZC4i++4h72mJLic4_W4A@mail.gmail.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-26trace/osnoise: Add trace events for samplesTomas Glozar
Add trace events that fire at osnoise and timerlat sample generation, in addition to the already existing noise and threshold events. This allows processing the samples directly in the kernel, either with ftrace triggers or with BPF. Cc: John Kacur <jkacur@redhat.com> Cc: Luis Goncalves <lgoncalv@redhat.com> Link: https://lore.kernel.org/20250203090418.1458923-1-tglozar@redhat.com Signed-off-by: Tomas Glozar <tglozar@redhat.com> Tested-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27tracing: fprobe-events: Log error for exceeding the number of entry argsMasami Hiramatsu (Google)
Add error message when the number of entry argument exceeds the maximum size of entry data. This is currently checked when registering fprobe, but in this case no error message is shown in the error_log file. Link: https://lore.kernel.org/all/174055074269.4079315.17809232650360988538.stgit@mhiramat.tok.corp.google.com/ Fixes: 25f00e40ce79 ("tracing/probes: Support $argN in return probe (kprobe and fprobe)") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-27tracing: tprobe-events: Reject invalid tracepoint nameMasami Hiramatsu (Google)
Commit 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") allows user to set a tprobe on non-exist tracepoint but it does not check the tracepoint name is acceptable. So it leads tprobe has a wrong character for events (e.g. with subsystem prefix). In this case, the event is not shown in the events directory. Reject such invalid tracepoint name. The tracepoint name must consist of alphabet or digit or '_'. Link: https://lore.kernel.org/all/174055073461.4079315.15875502830565214255.stgit@mhiramat.tok.corp.google.com/ Fixes: 57a7e6de9e30 ("tracing/fprobe: Support raw tracepoints on future loaded modules") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: stable@vger.kernel.org
2025-02-27tracing: tprobe-events: Fix a memory leak when tprobe with $retvalMasami Hiramatsu (Google)
Fix a memory leak when a tprobe is defined with $retval. This combination is not allowed, but the parse_symbol_and_return() does not free the *symbol which should not be used if it returns the error. Thus, it leaks the *symbol memory in that error path. Link: https://lore.kernel.org/all/174055072650.4079315.3063014346697447838.stgit@mhiramat.tok.corp.google.com/ Fixes: ce51e6153f77 ("tracing: fprobe-event: Fix to check tracepoint event and return") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: stable@vger.kernel.org
2025-02-26perf: Remove unnecessary parameter of security checkLuo Gengkun
It seems that the attr parameter was never been used in security checks since it was first introduced by: commit da97e18458fb ("perf_event: Add support for LSM and SELinux checks") so remove it. Signed-off-by: Luo Gengkun <luogengkun@huaweicloud.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-02-26bpf: Fix deadlock between rcu_tasks_trace and event_mutex.Alexei Starovoitov
Fix the following deadlock: CPU A _free_event() perf_kprobe_destroy() mutex_lock(&event_mutex) perf_trace_event_unreg() synchronize_rcu_tasks_trace() There are several paths where _free_event() grabs event_mutex and calls sync_rcu_tasks_trace. Above is one such case. CPU B bpf_prog_test_run_syscall() rcu_read_lock_trace() bpf_prog_run_pin_on_cpu() bpf_prog_load() bpf_tracing_func_proto() trace_set_clr_event() mutex_lock(&event_mutex) Delegate trace_set_clr_event() to workqueue to avoid such lock dependency. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250224221637.4780-1-alexei.starovoitov@gmail.com
2025-02-25tracing: Add traceoff_after_boot optionSteven Rostedt
Sometimes tracing is used to debug issues during the boot process. Since the trace buffer has a limited amount of storage, it may be prudent to disable tracing after the boot is finished, otherwise the critical information may be overwritten. With this option, the main tracing buffer will be turned off at the end of the boot process. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/20250208103017.48a7ec83@batman.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25ftrace: Check against is_kernel_text() instead of kaslr_offset()Steven Rostedt
As kaslr_offset() is architecture dependent and also may not be defined by all architectures, when zeroing out unused weak functions, do not check against kaslr_offset(), but instead check if the address is within the kernel text sections. If KASLR added a shift to the zeroed out function, it would still not be located in the kernel text. This is a more robust way to test if the text is valid or not. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: "Arnd Bergmann" <arnd@arndb.de> Link: https://lore.kernel.org/20250225182054.471759017@goodmis.org Fixes: ef378c3b8233 ("scripts/sorttable: Zero out weak functions in mcount_loc table") Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: Mark Brown <broonie@kernel.org> Tested-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20250224180805.GA1536711@ax162/ Closes: https://lore.kernel.org/all/5225b07b-a9b2-4558-9d5f-aa60b19f6317@sirena.org.uk/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-25ftrace: Test mcount_loc addr before calling ftrace_call_addr()Steven Rostedt
The addresses in the mcount_loc can be zeroed and then moved by KASLR making them invalid addresses. ftrace_call_addr() for ARM 64 expects a valid address to kernel text. If the addr read from the mcount_loc section is invalid, it must not call ftrace_call_addr(). Move the addr check before calling ftrace_call_addr() in ftrace_process_locs(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Mark Brown <broonie@kernel.org> Link: https://lore.kernel.org/20250225182054.290128736@goodmis.org Fixes: ef378c3b8233 ("scripts/sorttable: Zero out weak functions in mcount_loc table") Reported-by: Nathan Chancellor <nathan@kernel.org> Reported-by: "Arnd Bergmann" <arnd@arndb.de> Tested-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20250225025631.GA271248@ax162/ Closes: https://lore.kernel.org/all/91523154-072b-437b-bbdc-0b70e9783fd0@app.fastmail.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21tracing: Fix memory leak when reading set_event fileAdrian Huang
kmemleak reports the following memory leak after reading set_event file: # cat /sys/kernel/tracing/set_event # cat /sys/kernel/debug/kmemleak unreferenced object 0xff110001234449e0 (size 16): comm "cat", pid 13645, jiffies 4294981880 hex dump (first 16 bytes): 01 00 00 00 00 00 00 00 a8 71 e7 84 ff ff ff ff .........q...... backtrace (crc c43abbc): __kmalloc_cache_noprof+0x3ca/0x4b0 s_start+0x72/0x2d0 seq_read_iter+0x265/0x1080 seq_read+0x2c9/0x420 vfs_read+0x166/0xc30 ksys_read+0xf4/0x1d0 do_syscall_64+0x79/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e The issue can be reproduced regardless of whether set_event is empty or not. Here is an example about the valid content of set_event. # cat /sys/kernel/tracing/set_event sched:sched_process_fork sched:sched_switch sched:sched_wakeup *:*:mod:trace_events_sample The root cause is that s_next() returns NULL when nothing is found. This results in s_stop() attempting to free a NULL pointer because its parameter is NULL. Fix the issue by freeing the memory appropriately when s_next() fails to find anything. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250220031528.7373-1-ahuang12@lenovo.com Fixes: b355247df104 ("tracing: Cache ":mod:" events for modules not loaded yet") Signed-off-by: Adrian Huang <ahuang12@lenovo.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21ftrace: Correct preemption accounting for function tracing.Sebastian Andrzej Siewior
The function tracer should record the preemption level at the point when the function is invoked. If the tracing subsystem decrement the preemption counter it needs to correct this before feeding the data into the trace buffer. This was broken in the commit cited below while shifting the preempt-disabled section. Use tracing_gen_ctx_dec() which properly subtracts one from the preemption counter on a preemptible kernel. Cc: stable@vger.kernel.org Cc: Wander Lairson Costa <wander@redhat.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/20250220140749.pfw8qoNZ@linutronix.de Fixes: ce5e48036c9e7 ("ftrace: disable preemption when recursion locked") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Tested-by: Wander Lairson Costa <wander@redhat.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21fprobe: Fix accounting of when to unregister from function graphSteven Rostedt
When adding a new fprobe, it will update the function hash to the functions the fprobe is attached to and register with function graph to have it call the registered functions. The fprobe_graph_active variable keeps track of the number of fprobes that are using function graph. If two fprobes attach to the same function, it increments the fprobe_graph_active for each of them. But when they are removed, the first fprobe to be removed will see that the function it is attached to is also used by another fprobe and it will not remove that function from function_graph. The logic will skip decrementing the fprobe_graph_active variable. This causes the fprobe_graph_active variable to not go to zero when all fprobes are removed, and in doing so it does not unregister from function graph. As the fgraph ops hash will now be empty, and an empty filter hash means all functions are enabled, this triggers function graph to add a callback to the fprobe infrastructure for every function! # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events # echo "f:myevent2 kernel_clone%return" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc0024000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # > /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions trace_initcall_start_cb (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 run_init_process (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 try_to_run_init_process (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 x86_pmu_show_pmu_cap (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 cleanup_rapl_pmus (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 uncore_free_pcibus_map (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 uncore_types_exit (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 uncore_pci_exit.part.0 (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 kvm_shutdown (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 vmx_dump_msrs (1) tramp: 0xffffffffc0026000 (function_trace_call+0x0/0x170) ->function_trace_call+0x0/0x170 [..] # cat /sys/kernel/tracing/enabled_functions | wc -l 54702 If a fprobe is being removed and all its functions are also traced by other fprobes, still decrement the fprobe_graph_active counter. Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.565129766@goodmis.org Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Closes: https://lore.kernel.org/all/20250217114918.10397-A-hca@linux.ibm.com/ Reported-by: Heiko Carstens <hca@linux.ibm.com> Tested-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21fprobe: Always unregister fgraph function from opsSteven Rostedt
When the last fprobe is removed, it calls unregister_ftrace_graph() to remove the graph_ops from function graph. The issue is when it does so, it calls return before removing the function from its graph ops via ftrace_set_filter_ips(). This leaves the last function lingering in the fprobe's fgraph ops and if a probe is added it also enables that last function (even though the callback will just drop it, it does add unneeded overhead to make that call). # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 schedule_timeout (1) tramp: 0xffffffffc02f3000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # > /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions # echo "f:myevent3 kmem_cache_free" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kmem_cache_free (1) tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 schedule_timeout (1) tramp: 0xffffffffc0219000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 The above enabled a fprobe on kernel_clone, and then on schedule_timeout. The content of the enabled_functions shows the functions that have a callback attached to them. The fprobe attached to those functions properly. Then the fprobes were cleared, and enabled_functions was empty after that. But after adding a fprobe on kmem_cache_free, the enabled_functions shows that the schedule_timeout was attached again. This is because it was still left in the fprobe ops that is used to tell function graph what functions it wants callbacks from. Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.393254452@goodmis.org Fixes: 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer") Tested-by: Heiko Carstens <hca@linux.ibm.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21ftrace: Do not add duplicate entries in subops manager opsSteven Rostedt
Check if a function is already in the manager ops of a subops. A manager ops contains multiple subops, and if two or more subops are tracing the same function, the manager ops only needs a single entry in its hash. Cc: stable@vger.kernel.org Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.226762894@goodmis.org Fixes: 4f554e955614f ("ftrace: Add ftrace_set_filter_ips function") Tested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-21ftrace: Fix accounting of adding subops to a manager opsSteven Rostedt
Function graph uses a subops and manager ops mechanism to attach to ftrace. The manager ops connects to ftrace and the functions it connects to is defined by a list of subops that it manages. The function hash that defines what the above ops attaches to limits the functions to attach if the hash has any content. If the hash is empty, it means to trace all functions. The creation of the manager ops hash is done by iterating over all the subops hashes. If any of the subops hashes is empty, it means that the manager ops hash must trace all functions as well. The issue is in the creation of the manager ops. When a second subops is attached, a new hash is created by starting it as NULL and adding the subops one at a time. But the NULL ops is mistaken as an empty hash, and once an empty hash is found, it stops the loop of subops and just enables all functions. # echo "f:myevent1 kernel_clone" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions kernel_clone (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 # echo "f:myevent2 schedule_timeout" >> /sys/kernel/tracing/dynamic_events # cat /sys/kernel/tracing/enabled_functions trace_initcall_start_cb (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 run_init_process (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 try_to_run_init_process (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 x86_pmu_show_pmu_cap (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 cleanup_rapl_pmus (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 uncore_free_pcibus_map (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 uncore_types_exit (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 uncore_pci_exit.part.0 (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 kvm_shutdown (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 vmx_dump_msrs (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 vmx_cleanup_l1d_flush (1) tramp: 0xffffffffc0309000 (ftrace_graph_func+0x0/0x60) ->ftrace_graph_func+0x0/0x60 [..] Fix this by initializing the new hash to NULL and if the hash is NULL do not treat it as an empty hash but instead allocate by copying the content of the first sub ops. Then on subsequent iterations, the new hash will not be NULL, but the content of the previous subops. If that first subops attached to all functions, then new hash may assume that the manager ops also needs to attach to all functions. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250220202055.060300046@goodmis.org Fixes: 5fccc7552ccbc ("ftrace: Add subops logic to allow one ops to manage many") Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-20bpf: Use preempt_count() directly in bpf_send_signal_common()Hou Tao
bpf_send_signal_common() uses preemptible() to check whether or not the current context is preemptible. If it is preemptible, it will use irq_work to send the signal asynchronously instead of trying to hold a spin-lock, because spin-lock is sleepable under PREEMPT_RT. However, preemptible() depends on CONFIG_PREEMPT_COUNT. When CONFIG_PREEMPT_COUNT is turned off (e.g., CONFIG_PREEMPT_VOLUNTARY=y), !preemptible() will be evaluated as 1 and bpf_send_signal_common() will use irq_work unconditionally. Fix it by unfolding "!preemptible()" and using "preempt_count() != 0 || irqs_disabled()" instead. Fixes: 87c544108b61 ("bpf: Send signals asynchronously if !preemptible") Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20250220042259.1583319-1-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-02-18ftrace: Have ftrace pages output reflect freed pagesSteven Rostedt
The amount of memory that ftrace uses to save the descriptors to manage the functions it can trace is shown at output. But if there are a lot of functions that are skipped because they were weak or the architecture added holes into the tables, then the extra pages that were allocated are freed. But these freed pages are not reflected in the numbers shown, and they can even be inconsistent with what is reported: ftrace: allocating 57482 entries in 225 pages ftrace: allocated 224 pages with 3 groups The above shows the number of original entries that are in the mcount_loc section and the pages needed to save them (225), but the second output reflects the number of pages that were actually used. The two should be consistent as: ftrace: allocating 56739 entries in 224 pages ftrace: allocated 224 pages with 3 groups The above also shows the accurate number of entires that were actually stored and does not include the entries that were removed. Cc: bpf <bpf@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Martin Kelly <martin.kelly@crowdstrike.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250218200023.221100846@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-18ftrace: Update the mcount_loc check of skipped entriesSteven Rostedt
Now that weak functions turn into skipped entries, update the check to make sure the amount that was allocated would fit both the entries that were allocated as well as those that were skipped. Cc: bpf <bpf@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Martin Kelly <martin.kelly@crowdstrike.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250218200023.055162048@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-18scripts/sorttable: Zero out weak functions in mcount_loc tableSteven Rostedt
When a function is annotated as "weak" and is overridden, the code is not removed. If it is traced, the fentry/mcount location in the weak function will be referenced by the "__mcount_loc" section. This will then be added to the available_filter_functions list. Since only the address of the functions are listed, to find the name to show, a search of kallsyms is used. Since kallsyms will return the function by simply finding the function that the address is after but before the next function, an address of a weak function will show up as the function before it. This is because kallsyms does not save names of weak functions. This has caused issues in the past, as now the traced weak function will be listed in available_filter_functions with the name of the function before it. At best, this will cause the previous function's name to be listed twice. At worse, if the previous function was marked notrace, it will now show up as a function that can be traced. Note that it only shows up that it can be traced but will not be if enabled, which causes confusion. https://lore.kernel.org/all/20220412094923.0abe90955e5db486b7bca279@kernel.org/ The commit b39181f7c6907 ("ftrace: Add FTRACE_MCOUNT_MAX_OFFSET to avoid adding weak function") was a workaround to this by checking the function address before printing its name. If the address was too far from the function given by the name then instead of printing the name it would print: __ftrace_invalid_address___<invalid-offset> The real issue is that these invalid addresses are listed in the ftrace table look up which available_filter_functions is derived from. A place holder must be listed in that file because set_ftrace_filter may take a series of indexes into that file instead of names to be able to do O(1) lookups to enable filtering (many tools use this method). Even if kallsyms saved the size of the function, it does not remove the need of having these place holders. The real solution is to not add a weak function into the ftrace table in the first place. To solve this, the sorttable.c code that sorts the mcount regions during the build is modified to take a "nm -S vmlinux" input, sort it, and any function listed in the mcount_loc section that is not within a boundary of the function list given by nm is considered a weak function and is zeroed out. Note, this does not mean they will remain zero when booting as KASLR will still shift those addresses. To handle this, the entries in the mcount_loc section will be ignored if they are zero or match the kaslr_offset() value. Before: ~# grep __ftrace_invalid_address___ /sys/kernel/tracing/available_filter_functions | wc -l 551 After: ~# grep __ftrace_invalid_address___ /sys/kernel/tracing/available_filter_functions | wc -l 0 Cc: bpf <bpf@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nicolas Schier <nicolas@fjasle.eu> Cc: Zheng Yejian <zhengyejian1@huawei.com> Cc: Martin Kelly <martin.kelly@crowdstrike.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250218200022.883095980@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-18Merge tag 'v6.14-rc3' into x86/core, to pick up fixesIngo Molnar
Pick up upstream x86 fixes before applying new patches. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-02-18tracing/osnoise: Switch to use hrtimer_setup()Nam Cao
hrtimer_setup() takes the callback function pointer as argument and initializes the timer completely. Replace hrtimer_init() and the open coded initialization of hrtimer::function with the new setup mechanism. Patch was created by using Coccinelle. Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/all/ff8e6e11df5f928b2b97619ac847b4fa045376a1.1738746821.git.namcao@linutronix.de
2025-02-15ring-buffer: Update pages_touched to reflect persistent buffer contentSteven Rostedt
The pages_touched field represents the number of subbuffers in the ring buffer that have content that can be read. This is used in accounting of "dirty_pages" and "buffer_percent" to allow the user to wait for the buffer to be filled to a certain amount before it reads the buffer in blocking mode. The persistent buffer never updated this value so it was set to zero, and this accounting would take it as it had no content. This would cause user space to wait for content even though there's enough content in the ring buffer that satisfies the buffer_percent. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214123512.0631436e@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-15tracing: Do not allow mmap() of persistent ring bufferSteven Rostedt
When trying to mmap a trace instance buffer that is attached to reserve_mem, it would crash: BUG: unable to handle page fault for address: ffffe97bd00025c8 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 2862f3067 P4D 2862f3067 PUD 0 Oops: Oops: 0000 [#1] PREEMPT_RT SMP PTI CPU: 4 UID: 0 PID: 981 Comm: mmap-rb Not tainted 6.14.0-rc2-test-00003-g7f1a5e3fbf9e-dirty #233 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:validate_page_before_insert+0x5/0xb0 Code: e2 01 89 d0 c3 cc cc cc cc 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 <48> 8b 46 08 a8 01 75 67 66 90 48 89 f0 8b 50 34 85 d2 74 76 48 89 RSP: 0018:ffffb148c2f3f968 EFLAGS: 00010246 RAX: ffff9fa5d3322000 RBX: ffff9fa5ccff9c08 RCX: 00000000b879ed29 RDX: ffffe97bd00025c0 RSI: ffffe97bd00025c0 RDI: ffff9fa5ccff9c08 RBP: ffffb148c2f3f9f0 R08: 0000000000000004 R09: 0000000000000004 R10: 0000000000000000 R11: 0000000000000200 R12: 0000000000000000 R13: 00007f16a18d5000 R14: ffff9fa5c48db6a8 R15: 0000000000000000 FS: 00007f16a1b54740(0000) GS:ffff9fa73df00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffe97bd00025c8 CR3: 00000001048c6006 CR4: 0000000000172ef0 Call Trace: <TASK> ? __die_body.cold+0x19/0x1f ? __die+0x2e/0x40 ? page_fault_oops+0x157/0x2b0 ? search_module_extables+0x53/0x80 ? validate_page_before_insert+0x5/0xb0 ? kernelmode_fixup_or_oops.isra.0+0x5f/0x70 ? __bad_area_nosemaphore+0x16e/0x1b0 ? bad_area_nosemaphore+0x16/0x20 ? do_kern_addr_fault+0x77/0x90 ? exc_page_fault+0x22b/0x230 ? asm_exc_page_fault+0x2b/0x30 ? validate_page_before_insert+0x5/0xb0 ? vm_insert_pages+0x151/0x400 __rb_map_vma+0x21f/0x3f0 ring_buffer_map+0x21b/0x2f0 tracing_buffers_mmap+0x70/0xd0 __mmap_region+0x6f0/0xbd0 mmap_region+0x7f/0x130 do_mmap+0x475/0x610 vm_mmap_pgoff+0xf2/0x1d0 ksys_mmap_pgoff+0x166/0x200 __x64_sys_mmap+0x37/0x50 x64_sys_call+0x1670/0x1d70 do_syscall_64+0xbb/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f The reason was that the code that maps the ring buffer pages to user space has: page = virt_to_page((void *)cpu_buffer->subbuf_ids[s]); And uses that in: vm_insert_pages(vma, vma->vm_start, pages, &nr_pages); But virt_to_page() does not work with vmap()'d memory which is what the persistent ring buffer has. It is rather trivial to allow this, but for now just disable mmap() of instances that have their ring buffer from the reserve_mem option. If an mmap() is performed on a persistent buffer it will return -ENODEV just like it would if the .mmap field wasn't defined in the file_operations structure. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214115547.0d7287d3@gandalf.local.home Fixes: 9b7bdf6f6ece6 ("tracing: Have trace_printk not use binary prints if boot buffer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14ring-buffer: Validate the persistent meta data subbuf arraySteven Rostedt
The meta data for a mapped ring buffer contains an array of indexes of all the subbuffers. The first entry is the reader page, and the rest of the entries lay out the order of the subbuffers in how the ring buffer link list is to be created. The validator currently makes sure that all the entries are within the range of 0 and nr_subbufs. But it does not check if there are any duplicates. While working on the ring buffer, I corrupted this array, where I added duplicates. The validator did not catch it and created the ring buffer link list on top of it. Luckily, the corruption was only that the reader page was also in the writer path and only presented corrupted data but did not crash the kernel. But if there were duplicates in the writer side, then it could corrupt the ring buffer link list and cause a crash. Create a bitmask array with the size of the number of subbuffers. Then clear it. When walking through the subbuf array checking to see if the entries are within the range, test if its bit is already set in the subbuf_mask. If it is, then there is duplicates and fail the validation. If not, set the corresponding bit and continue. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250214102820.7509ddea@gandalf.local.home Fixes: c76883f18e59b ("ring-buffer: Add test if range of boot buffer is valid") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14tracing: Have the error of __tracing_resize_ring_buffer() passed to userSteven Rostedt
Currently if __tracing_resize_ring_buffer() returns an error, the tracing_resize_ringbuffer() returns -ENOMEM. But it may not be a memory issue that caused the function to fail. If the ring buffer is memory mapped, then the resizing of the ring buffer will be disabled. But if the user tries to resize the buffer, it will get an -ENOMEM returned, which is confusing because there is plenty of memory. The actual error returned was -EBUSY, which would make much more sense to the user. Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250213134132.7e4505d7@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2025-02-14ring-buffer: Unlock resize on mmap errorSteven Rostedt
Memory mapping the tracing ring buffer will disable resizing the buffer. But if there's an error in the memory mapping like an invalid parameter, the function exits out without re-enabling the resizing of the ring buffer, preventing the ring buffer from being resized after that. Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20250213131957.530ec3c5@gandalf.local.home Fixes: 117c39200d9d7 ("ring-buffer: Introducing ring-buffer mapping functions") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-02-14x86/ibt: Clean up is_endbr()Peter Zijlstra
Pretty much every caller of is_endbr() actually wants to test something at an address and ends up doing get_kernel_nofault(). Fold the lot into a more convenient helper. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Link: https://lore.kernel.org/r/20250207122546.181367417@infradead.org
2025-02-08fgraph: Fix set_graph_notrace with setting TRACE_GRAPH_NOTRACE_BITSteven Rostedt
The code was restructured where the function graph notrace code, that would not trace a function and all its children is done by setting a NOTRACE flag when the function that is not to be traced is hit. There's a TRACE_GRAPH_NOTRACE_BIT which defines the bit in the flags and a TRACE_GRAPH_NOTRACE which is the mask with that bit set. But the restructuring used TRACE_GRAPH_NOTRACE_BIT when it should have used TRACE_GRAPH_NOTRACE. For example: # cd /sys/kernel/tracing # echo set_track_prepare stack_trace_save > set_graph_notrace # echo function_graph > current_tracer # cat trace [..] 0) | __slab_free() { 0) | free_to_partial_list() { 0) | arch_stack_walk() { 0) | __unwind_start() { 0) 0.501 us | get_stack_info(); Where a non filter trace looks like: # echo > set_graph_notrace # cat trace 0) | free_to_partial_list() { 0) | set_track_prepare() { 0) | stack_trace_save() { 0) | arch_stack_walk() { 0) | __unwind_start() { Where the filter should look like: # cat trace 0) | free_to_partial_list() { 0) | _raw_spin_lock_irqsave() { 0) 0.350 us | preempt_count_add(); 0) 0.351 us | do_raw_spin_lock(); 0) 2.440 us | } Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250208001511.535be150@batman.local.home Fixes: b84214890a9bc ("function_graph: Move graph notrace bit to shadow stack global var") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>