summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-04-22PM: sleep: Use two lines for "Restarting..." / "done" messagesAndrew Sayers
Other messages are occasionally printed between these two, for example: [203104.106534] Restarting tasks ... [203104.106559] mei_hdcp 0000:00:16.0-b638ab7e-94e2-4ea2-a552-d1c54b627f04: bound 0000:00:02.0 (ops i915_hdcp_ops [i915]) [203104.112354] done. This seems to be a timing issue, seen in two of the eleven hibernation exits in my current `dmesg` output. When printed on its own, the "done" message has the default log level. This makes the output of `dmesg --level=warn` quite misleading. Add enough context for the "done" messages to make sense on their own, and use the same log level for all messages. Change the messages to "<event>..." / "Done <event>.", unlike a449dfbfc089 which uses "<event>..." / "<event> completed.". Front-loading the unique part of the message makes it easier to scan the log, and reduces ambiguity for users who aren't confident in their English comprehension. Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Signed-off-by: Andrew Sayers <kernel.org@pileofstuff.org> Link: https://patch.msgid.link/20250411152632.2806038-1-kernel.org@pileofstuff.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-04-21Merge tag 'sched_ext-for-6.15-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Use kvzalloc() so that large exit_dump buffer allocations don't fail easily - Remove cpu.weight / cpu.idle unimplemented warnings which are more annoying than helpful. This makes SCX_OPS_HAS_CGROUP_WEIGHT unnecessary. Mark it for deprecation * tag 'sched_ext-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Mark SCX_OPS_HAS_CGROUP_WEIGHT for deprecation sched_ext: Remove cpu.weight / cpu.idle unimplemented warnings sched_ext: Use kvzalloc for large exit_dump allocation
2025-04-21Merge tag 'cgroup-for-6.15-rc3-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: - Fix compilation in CONFIG_LOCKDEP && !CONFIG_PROVE_RCU configurations - Allow "cpuset_v2_mode" mount option for "cpuset" filesystem type to make life easier for android * tag 'cgroup-for-6.15-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset-v1: Add missing support for cpuset_v2_mode cgroup: Fix compilation issue due to cgroup_mutex not being exported
2025-04-21Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Martin KaFai Lau says: ==================== pull-request: bpf-next 2025-04-17 We've added 12 non-merge commits during the last 9 day(s) which contain a total of 18 files changed, 1748 insertions(+), 19 deletions(-). The main changes are: 1) bpf qdisc support, from Amery Hung. A qdisc can be implemented in bpf struct_ops programs and can be used the same as other existing qdiscs in the "tc qdisc" command. 2) Add xsk tail adjustment tests, from Tushar Vyavahare. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: selftests/bpf: Test attaching bpf qdisc to mq and non root selftests/bpf: Add a bpf fq qdisc to selftest selftests/bpf: Add a basic fifo qdisc test libbpf: Support creating and destroying qdisc bpf: net_sched: Disable attaching bpf qdisc to non root bpf: net_sched: Support updating bstats bpf: net_sched: Add a qdisc watchdog timer bpf: net_sched: Add basic bpf qdisc kfuncs bpf: net_sched: Support implementation of Qdisc_ops in bpf bpf: Prepare to reuse get_ctx_arg_idx selftests/xsk: Add tail adjustment tests and support check selftests/xsk: Add packet stream replacement function ==================== Link: https://patch.msgid.link/20250417184338.3152168-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-21cgroup: fix pointer check in css_rstat_init()JP Kobryn
In css_rstat_init() allocations are done for the cgroup's pointers rstat_cpu and rstat_base_cpu. Make sure the allocation checks are consistent with what they are allocating. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc3Alexei Starovoitov
Cross-merge bpf and other fixes after downstream PRs. No conflicts. Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-04-19Merge tag 'vfs-6.15-rc3.fixes.2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Revert the hfs{plus} deprecation warning that's also included in this pull request. The commit introducing the deprecation warning resides rather early in this branch. So simply dropping it would've rebased all other commits which I decided to avoid. Hence the revert in the same branch [ Background - the deprecation warning discussion resulted in people stepping up, and so hfs{plus} will have a maintainer taking care of it after all.. - Linus ] - Switch CONFIG_SYSFS_SYCALL default to n and decouple from CONFIG_EXPERT - Fix an audit bug caused by changes to our kernel path lookup helpers this cycle. Audit needs the parent path even if the dentry it tried to look up is negative - Ensure that the kernel path lookup helpers leave the passed in path argument clean when they return an error. This is consistent with all our other helpers - Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant information is available to kernel consumers as well - Don't set a timer and call schedule() if the timer will expire immediately in epoll - Make netfs lookup tables with __nonstring * tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: Revert "hfs{plus}: add deprecation warning" fs: move the bdex_statx call to vfs_getattr_nosec netfs: Mark __nonstring lookup tables eventpoll: Set epoll timeout if it's in the future fs: ensure that *path_locked*() helpers leave passed path pristine fs: add kern_path_locked_negative() hfs{plus}: add deprecation warning Kconfig: switch CONFIG_SYSFS_SYCALL default to n
2025-04-19Merge tag 'trace-v6.15-rc2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Initialize hash variables in ftrace subops logic The fix that simplified the ftrace subops logic opened a path where some variables could be used without being initialized, and done subtly where the compiler did not catch it. Initialize those variables to the EMPTY_HASH, which is the default hash. - Reinitialize the hash pointers after they are freed Some of the hash pointers in the subop logic were freed but may still be referenced later. To prevent use-after-free bugs, initialize them back to the EMPTY_HASH. - Free the ftrace hashes when they are replaced The fix that simplified the subops logic updated some hash pointers, but left the original hash that they were pointing to where they are no longer used. This caused a memory leak. Free the hashes that are pointed to by the pointers when they are replaced. - Fix size initialization of ftrace direct function hash The ftrace direct function hash used by BPF initialized the hash size incorrectly. It checked the size of items to a hard coded 32, which made the hash bit size of 5. The hash size is supposed to be limited by the bit size of the hash, as the bitmask is allowed to be greater than 5. Rework the size check to first pass the number of elements to fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating the hash. - Fix format output of ftrace_graph_ent_entry event The field depth of the ftrace_graph_ent_entry event is of size 4 but the output showed it as unsigned long and use "%lu". Change it to unsigned int and use "%u" in the print format that is displayed to user space. - Fix the trace event filter on strings Events can be filtered on numbers or string values. The return value checked from strncpy_from_kernel_nofault() and strncpy_from_user_nofault() was used to determine if reading the strings would fault or not. It would return fault if the value was non zero, which is basically meant that it was always considering the read as a fault. - Add selftest to test trace event string filtering In order to catch the breakage of the string filtering, add a self test to make sure that it continues to work. * tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: selftests: Add testing a user string to filters tracing: Fix filter string testing ftrace: Fix type of ftrace_graph_ent_entry.depth ftrace: fix incorrect hash size in register_ftrace_direct() ftrace: Free ftrace hashes after they are replaced in the subops code ftrace: Reinitialize hash to EMPTY_HASH after freeing ftrace: Initialize variables for ftrace_startup/shutdown_subops()
2025-04-18sched_ext: add helper for refill task with default sliceHonglei Wang
Add helper for refilling task with default slice and event statistics accordingly. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18sched_ext: change the variable name for slice refill eventHonglei Wang
SCX_EV_ENQ_SLICE_DFL gives the impression that the event only occurs when the tasks were enqueued, which seems not accurate. What it actually means is the refilling with defalt slice, and this can occur either when enqueue or pick_task. Let's change the variable to SCX_EV_REFILL_SLICE_DFL. Signed-off-by: Honglei Wang <jameshongleiwang@126.com> Acked-by: Changwoo Min <changwoo@igalia.com> Acked-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-18vhost_task: fix vhost_task_create() documentationStefano Garzarella
Commit cb380909ae3b ("vhost: return task creation error instead of NULL") changed the return value of vhost_task_create(), but did not update the documentation. Reflect the change in the documentation: on an error, vhost_task_create() returns an ERR_PTR() and no longer NULL. Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Message-Id: <20250327124435.142831-1-sgarzare@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2025-04-17tracing: Fix filter string testingSteven Rostedt
The filter string testing uses strncpy_from_kernel/user_nofault() to retrieve the string to test the filter against. The if() statement was incorrect as it considered 0 as a fault, when it is only negative that it faulted. Running the following commands: # cd /sys/kernel/tracing # echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter # echo 1 > events/syscalls/sys_enter_openat/enable # ls /proc/$$/maps # cat trace Would produce nothing, but with the fix it will produce something like: ls-1192 [007] ..... 8169.828333: sys_openat(dfd: ffffffffffffff9c, filename: 7efc18359904, flags: 80000, mode: 0) Link: https://lore.kernel.org/all/CAEf4BzbVPQ=BjWztmEwBPRKHUwNfKBkS3kce-Rzka6zvbQeVpg@mail.gmail.com/ Cc: stable@vger.kernel.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/20250417183003.505835fb@gandalf.local.home Fixes: 77360f9bbc7e5 ("tracing: Add test for user space strings when filtering on string pointers") Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Reported-by: Mykyta Yatsenko <mykyta.yatsenko5@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc3). No conflicts. Adjacent changes: tools/net/ynl/pyynl/ynl_gen_c.py 4d07bbf2d456 ("tools: ynl-gen: don't declare loop iterator in place") 7e8ba0c7de2b ("tools: ynl: don't use genlmsghdr in classic netlink") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17ftrace: Fix type of ftrace_graph_ent_entry.depthIlya Leoshkevich
ftrace_graph_ent.depth is int, but ftrace_graph_ent_entry.depth is unsigned long. This confuses trace-cmd on 64-bit big-endian systems and makes it print a huge amount of spaces. Fix this by using unsigned int, which has a matching size, instead. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Link: https://lore.kernel.org/20250412221847.17310-2-iii@linux.ibm.com Fixes: ff5c9c576e75 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17ftrace: fix incorrect hash size in register_ftrace_direct()Menglong Dong
The maximum of the ftrace hash bits is made fls(32) in register_ftrace_direct(), which seems illogical. So, we fix it by making the max hash bits FTRACE_HASH_MAX_BITS instead. Link: https://lore.kernel.org/20250413014444.36724-1-dongml2@chinatelecom.cn Fixes: d05cb470663a ("ftrace: Fix modification of direct_function hash while in use") Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17ftrace: Free ftrace hashes after they are replaced in the subops codeSteven Rostedt
The subops processing creates new hashes when adding and removing subops. There were some places that the old hashes that were replaced were not freed and this caused some memory leaks. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250417135939.245b128d@gandalf.local.home Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17ftrace: Reinitialize hash to EMPTY_HASH after freeingSteven Rostedt
There's several locations that free a ftrace hash pointer but may be referenced again. Reset them to EMPTY_HASH so that a u-a-f bug doesn't happen. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250417110933.20ab718b@gandalf.local.home Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17ftrace: Initialize variables for ftrace_startup/shutdown_subops()Steven Rostedt
The reworking to fix and simplify the ftrace_startup_subops() and the ftrace_shutdown_subops() made it possible for the filter_hash and notrace_hash variables to be used uninitialized in a way that the compiler did not catch it. Initialize both filter_hash and notrace_hash to the EMPTY_HASH as that is what they should be if they never are used. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250417104017.3aea66c2@gandalf.local.home Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Fixes: 0ae6b8ce200d ("ftrace: Fix accounting of subop hashes") Closes: https://lore.kernel.org/all/1db64a42-626d-4b3a-be08-c65e47333ce2@linux.ibm.com/ Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-17bpf: Prepare to reuse get_ctx_arg_idxAmery Hung
Rename get_ctx_arg_idx to bpf_ctx_arg_idx, and allow others to call it. No functional change. Signed-off-by: Amery Hung <ameryhung@gmail.com> Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250409214606.2000194-2-ameryhung@gmail.com
2025-04-17cgroup/cpuset-v1: Add missing support for cpuset_v2_modeT.J. Mercier
Android has mounted the v1 cpuset controller using filesystem type "cpuset" (not "cgroup") since 2015 [1], and depends on the resulting behavior where the controller name is not added as a prefix for cgroupfs files. [2] Later, a problem was discovered where cpu hotplug onlining did not affect the cpuset/cpus files, which Android carried an out-of-tree patch to address for a while. An attempt was made to upstream this patch, but the recommendation was to use the "cpuset_v2_mode" mount option instead. [3] An effort was made to do so, but this fails with "cgroup: Unknown parameter 'cpuset_v2_mode'" because commit e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup") did not update the special cased cpuset_mount(), and only the cgroup (v1) filesystem type was updated. Add parameter parsing to the cpuset filesystem type so that cpuset_v2_mode works like the cgroup filesystem type: $ mkdir /dev/cpuset $ mount -t cpuset -ocpuset_v2_mode none /dev/cpuset $ mount|grep cpuset none on /dev/cpuset type cgroup (rw,relatime,cpuset,noprefix,cpuset_v2_mode,release_agent=/sbin/cpuset_release_agent) [1] https://cs.android.com/android/_/android/platform/system/core/+/b769c8d24fd7be96f8968aa4c80b669525b930d3 [2] https://cs.android.com/android/platform/superproject/main/+/main:system/core/libprocessgroup/setup/cgroup_map_write.cpp;drc=2dac5d89a0f024a2d0cc46a80ba4ee13472f1681;l=192 [3] https://lore.kernel.org/lkml/f795f8be-a184-408a-0b5a-553d26061385@redhat.com/T/ Fixes: e1cba4b85daa ("cgroup: Add mount flag to enable cpuset to use v2 behavior in v1 cgroup") Signed-off-by: T.J. Mercier <tjmercier@google.com> Acked-by: Waiman Long <longman@redhat.com> Reviewed-by: Kamalesh Babulal <kamalesh.babulal@oracle.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-17cgroup: Fix compilation issue due to cgroup_mutex not being exportedgaoxu
When adding folio_memcg function call in the zram module for Android16-6.12, the following error occurs during compilation: ERROR: modpost: "cgroup_mutex" [../soc-repo/zram.ko] undefined! This error is caused by the indirect call to lockdep_is_held(&cgroup_mutex) within folio_memcg. The export setting for cgroup_mutex is controlled by the CONFIG_PROVE_RCU macro. If CONFIG_LOCKDEP is enabled while CONFIG_PROVE_RCU is not, this compilation error will occur. To resolve this issue, add a parallel macro CONFIG_LOCKDEP control to ensure cgroup_mutex is properly exported when needed. Signed-off-by: gao xu <gaoxu2@honor.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-17cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()Rafael J. Wysocki
Notice that ignore_dl_rate_limit() need not piggy back on the limits_changed handling to achieve its goal (which is to enforce a frequency update before its due time). Namely, if sugov_should_update_freq() is updated to check sg_policy->need_freq_update and return 'true' if it is set when sg_policy->limits_changed is not set, ignore_dl_rate_limit() may set the former directly instead of setting the latter, so it can avoid hitting the memory barrier in sugov_should_update_freq(). Update the code accordingly. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/10666429.nUPlyArG6x@rjwysocki.net
2025-04-17cpufreq/sched: Explicitly synchronize limits_changed flag handlingRafael J. Wysocki
The handling of the limits_changed flag in struct sugov_policy needs to be explicitly synchronized to ensure that cpufreq policy limits updates will not be missed in some cases. Without that synchronization it is theoretically possible that the limits_changed update in sugov_should_update_freq() will be reordered with respect to the reads of the policy limits in cpufreq_driver_resolve_freq() and in that case, if the limits_changed update in sugov_limits() clobbers the one in sugov_should_update_freq(), the new policy limits may not take effect for a long time. Likewise, the limits_changed update in sugov_limits() may theoretically get reordered with respect to the updates of the policy limits in cpufreq_set_policy() and if sugov_should_update_freq() runs between them, the policy limits change may be missed. To ensure that the above situations will not take place, add memory barriers preventing the reordering in question from taking place and add READ_ONCE() and WRITE_ONCE() annotations around all of the limits_changed flag updates to prevent the compiler from messing up with that code. Fixes: 600f5badb78c ("cpufreq: schedutil: Don't skip freq update when limits change") Cc: 5.3+ <stable@vger.kernel.org> # 5.3+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3376719.44csPzL39Z@rjwysocki.net
2025-04-17cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITSRafael J. Wysocki
Commit 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update") modified sugov_should_update_freq() to set the need_freq_update flag only for drivers with CPUFREQ_NEED_UPDATE_LIMITS set, but that flag generally needs to be set when the policy limits change because the driver callback may need to be invoked for the new limits to take effect. However, if the return value of cpufreq_driver_resolve_freq() after applying the new limits is still equal to the previously selected frequency, the driver callback needs to be invoked only in the case when CPUFREQ_NEED_UPDATE_LIMITS is set (which means that the driver specifically wants its callback to be invoked every time the policy limits change). Update the code accordingly to avoid missing policy limits changes for drivers without CPUFREQ_NEED_UPDATE_LIMITS. Fixes: 8e461a1cb43d ("cpufreq: schedutil: Fix superfluous updates caused by need_freq_update") Closes: https://lore.kernel.org/lkml/Z_Tlc6Qs-tYpxWYb@linaro.org/ Reported-by: Stephan Gerhold <stephan.gerhold@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Christian Loehle <christian.loehle@arm.com> Link: https://patch.msgid.link/3010358.e9J7NaK4W3@rjwysocki.net
2025-04-17perf/core: Fix event timekeeping mergePeter Zijlstra
Due to an oversight in merging: da916e96e2de ("perf: Make perf_pmu_unregister() useable") on top of: a3c3c66670ce ("perf/core: Fix child_total_time_enabled accounting bug at task exit") the timekeeping fix from this latter patch got undone. Redo it. Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20250417080815.GI38216@noisy.programming.kicks-ass.net
2025-04-17perf/core: Fix event->parent life-time issuePeter Zijlstra
Due to an oversight in merging: da916e96e2de ("perf: Make perf_pmu_unregister() useable") on top of: 56799bc03565 ("perf: Fix hang while freeing sigtrap event") .. it is now possible to hit put_event(EVENT_TOMBSTONE), which makes the computer sad. This also means that for the event->parent == EVENT_TOMBSTONE, the put_event() matching inherit_event() has gone missing. Previously this was done in perf_event_release_kernel() after calling perf_remove_from_context(), but with it delegated to put_event(), this case is now entirely missed, leading to leaks. Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable") Reported-by: kernel test robot <oliver.sang@intel.com> Tested-by: kernel test robot <oliver.sang@intel.com> Tested-by: James Clark <james.clark@linaro.org> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Closes: https://lore.kernel.org/oe-lkp/202504131701.941039cd-lkp@intel.com Link: https://lkml.kernel.org/r/20250415131446.GN5600@noisy.programming.kicks-ass.net
2025-04-17perf/core: Fix put_ctx() orderingFrederic Weisbecker
So there are three situations: * If perf_event_free_task() has removed all the children from the parent list before perf_event_release_kernel() got a chance to even iterate them, then it's all good as there is no get_ctx() pending. * If perf_event_release_kernel() iterates a child event, but it gets freed meanwhile by perf_event_free_task() while the mutexes are temporarily unlocked, it's all good because while locking again the ctx mutex, perf_event_release_kernel() observes TASK_TOMBSTONE. * But if perf_event_release_kernel() frees the child event before perf_event_free_task() got a chance, we may face this scenario: perf_event_release_kernel() perf_event_free_task() -------------------------- ------------------------ mutex_lock(&event->child_mutex) get_ctx(child->ctx) mutex_unlock(&event->child_mutex) mutex_lock(ctx->mutex) mutex_lock(&event->child_mutex) perf_remove_from_context(child) mutex_unlock(&event->child_mutex) mutex_unlock(ctx->mutex) // This lock acquires ctx->refcount == 2 // visibility mutex_lock(ctx->mutex) ctx->task = TASK_TOMBSTONE mutex_unlock(ctx->mutex) wait_var_event() // enters prepare_to_wait() since // ctx->refcount == 2 // is guaranteed to be seen set_current_state(TASK_INTERRUPTIBLE) smp_mb() if (ctx->refcount != 1) schedule() put_ctx() // NOT fully ordered! Only RELEASE semantics refcount_dec_and_test() atomic_fetch_sub_release() // So TASK_TOMBSTONE is not guaranteed to be seen if (ctx->task == TASK_TOMBSTONE) wake_up_var() Basically it's a broken store buffer: perf_event_release_kernel() perf_event_free_task() -------------------------- ------------------------ ctx->task = TASK_TOMBSTONE smp_store_release(&ctx->refcount, ctx->refcount - 1) smp_mb() READ_ONCE(ctx->refcount) READ_ONCE(ctx->task) So we need a smp_mb__after_atomic() before looking at ctx->task. Fixes: 59f3aa4a3ee2 ("perf: Simplify perf_event_free_task() wait") Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/Z_ZvmEhjkAhplCBE@localhost.localdomain
2025-04-17perf/core: Fix perf-stat / read()Peter Zijlstra
In the zeal to adjust all event->state checks to include the new REVOKED state, one adjustment was made in error. Notably it resulted in read() on the perf filedesc to stop working for any state lower than ERROR, specifically EXIT. This leads to problems with (among others) perf-stat, which wants to read the counts after a program has finished execution. Fixes: da916e96e2de ("perf: Make perf_pmu_unregister() useable") Reported-by: "Mi, Dapeng" <dapeng1.mi@linux.intel.com> Reported-by: James Clark <james.clark@linaro.org> Tested-by: James Clark <james.clark@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/77036114-8723-4af9-a068-1d535f4e2e81@linaro.org Link: https://lore.kernel.org/r/20250417080725.GH38216@noisy.programming.kicks-ass.net
2025-04-17Merge branch 'perf/urgent' into perf/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-16sched/fair: Adhere to place_entity() constraintsPeter Zijlstra
Mike reports that commit 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") relies on commit 4423af84b297 ("sched/fair: optimize the PLACE_LAG when se->vlag is zero") to not trip a WARN in place_entity(). What happens is that the lag of the very last entity is 0 per definition -- the average of one element matches the value of that element. Therefore place_entity() will match the condition skipping the lag adjustment: if (sched_feat(PLACE_LAG) && cfs_rq->nr_queued && se->vlag) { Without the 'se->vlag' condition -- it will attempt to adjust the zero lag even though we're inserting into an empty tree. Notably, we should have failed the 'cfs_rq->nr_queued' condition, but don't because they didn't get updated. Additionally, move update_load_add() after placement() as is consistent with other place_entity() users -- this change is non-functional, place_entity() does not use cfs_rq->load. Fixes: 6d71a9c61604 ("sched/fair: Fix EEVDF entity placement bug causing scheduling lag") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: "Peter Zijlstra (Intel)" <peterz@infradead.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/c216eb4ef0e0e0029c600aefc69d56681cee5581.camel@gmx.de
2025-04-16sched/debug: Print the local group's asym_prefer_cpuK Prateek Nayak
Add a file to read local group's "asym_prefer_cpu" from debugfs. This information was useful when debugging issues where "asym_prefer_cpu" was incorrectly set to a CPU with a lower asym priority. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409053446.23367-5-kprateek.nayak@amd.com
2025-04-16sched/topology: Introduce sched_update_asym_prefer_cpu()K Prateek Nayak
A subset of AMD Processors supporting Preferred Core Rankings also feature the ability to dynamically switch these rankings at runtime to bias load balancing towards or away from the LLC domain with larger cache. To support dynamically updating "sg->asym_prefer_cpu" without needing to rebuild the sched domain, introduce sched_update_asym_prefer_cpu() which recomutes the "asym_prefer_cpu" when the core-ranking of a CPU changes. sched_update_asym_prefer_cpu() swaps the "sg->asym_prefer_cpu" with the CPU whose ranking has changed if the new ranking is greater than that of the "asym_prefer_cpu". If CPU whose ranking has changed is the current "asym_prefer_cpu", it scans the CPUs of the sched groups to find the new "asym_prefer_cpu" and sets it accordingly. get_group() for non-overlapping sched domains returns the sched group for the first CPU in the sched_group_span() which ensures all CPUs in the group see the updated value of "asym_prefer_cpu". Overlapping groups are allocated differently and will require moving the "asym_prefer_cpu" to "sg->sgc" but since the current implementations do not set "SD_ASYM_PACKING" at NUMA domains, skip additional indirection and place a SCHED_WARN_ON() to alert any future users. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409053446.23367-3-kprateek.nayak@amd.com
2025-04-16sched/fair: Use READ_ONCE() to read sg->asym_prefer_cpuK Prateek Nayak
Subsequent commits add the support to dynamically update the sched_group struct's "asym_prefer_cpu" member from a remote CPU. Use READ_ONCE() when reading the "sg->asym_prefer_cpu" to ensure load balancer always reads the latest value. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250409053446.23367-2-kprateek.nayak@amd.com
2025-04-16kernel: globalize lookup_or_create_module_kobject()Shyam Saini
lookup_or_create_module_kobject() is marked as static and __init, to make it global drop static keyword. Since this function can be called from non-init code, use __modinit instead of __init, __modinit marker will make it __init if CONFIG_MODULES is not defined. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com> Link: https://lore.kernel.org/r/20250227184930.34163-4-shyamsaini@linux.microsoft.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-04-16kernel: refactor lookup_or_create_module_kobject()Shyam Saini
In the unlikely event of the allocation failing, it is better to let the machine boot with a not fully populated sysfs than to kill it with this BUG_ON(). All callers are already prepared for lookup_or_create_module_kobject() returning NULL. This is also preparation for calling this function from non __init code, where using BUG_ON for allocation failure handling is not acceptable. Since we are here, also start using IS_ENABLED instead of #ifdef construct. Suggested-by: Thomas Weißschuh <linux@weissschuh.net> Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com> Link: https://lore.kernel.org/r/20250227184930.34163-3-shyamsaini@linux.microsoft.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-04-16kernel: param: rename locate_module_kobjectShyam Saini
The locate_module_kobject() function looks up an existing module_kobject for a given module name. If it cannot find the corresponding module_kobject, it creates one for the given name. This commit renames locate_module_kobject() to lookup_or_create_module_kobject() to better describe its operations. This doesn't change anything functionality wise. Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Shyam Saini <shyamsaini@linux.microsoft.com> Link: https://lore.kernel.org/r/20250227184930.34163-2-shyamsaini@linux.microsoft.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
2025-04-16genirq/irqdesc: Use sysfs_emit() to instead of s*printf()Andy Shevchenko
Follow the advice of the Documentation/filesystems/sysfs.rst that show() should only use sysfs_emit() or sysfs_emit_at() when formatting the value to be returned to user space. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250416101651.2128688-1-andriy.shevchenko@linux.intel.com
2025-04-15fs: add kern_path_locked_negative()Christian Brauner
The audit code relies on the fact that kern_path_locked() returned a path even for a negative dentry. If it doesn't find a valid dentry it immediately calls: audit_find_parent(d_backing_inode(parent_path.dentry)); which assumes that parent_path.dentry is still valid. But it isn't since kern_path_locked() has been changed to path_put() also for a negative dentry. Fix this by adding a helper that implements the required audit semantics and allows us to fix the immediate bleeding. We can find a unified solution for this afterwards. Link: https://lore.kernel.org/20250414-rennt-wimmeln-f186c3a780f1@brauner Fixes: 1c3cb50b58c3 ("VFS: change kern_path_locked() and user_path_locked_at() to never return negative dentry") Reported-and-tested-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-14release_task: kill the no longer needed get/put_pid(thread_pid)Oleg Nesterov
After the commit 7903f907a2260 ("pid: perform free_pid() calls outside of tasklist_lock") __unhash_process() -> detach_pid() no longer calls free_pid(), proc_flush_pid() can just use p->thread_pid without the now pointless get_pid() + put_pid(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Link: https://lore.kernel.org/20250411121857.GA10550@redhat.com Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-14workqueue: Better document teardown for delayed_workPhilipp Stanner
destroy_workqueue() does not ensure that non-pending work submitted with queue_delayed_work() gets cancelled. The caller has to ensure that manually. Add this information about delayed_work in destroy_workqueue()'s docstring. Add a TODO for destroy_workqueue() to wait for all delayed_work. Signed-off-by: Philipp Stanner <phasta@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-04-14dma/mapping.c: dev_dbg support for dma_addressing_limitedBalbir Singh
In the debug and resolution of an issue involving forced use of bounce buffers, 7170130e4c72 ("x86/mm/init: Handle the special case of device private pages in add_pages(), to not increase max_pfn and trigger dma_addressing_limited() bounce buffers"). It would have been easier to debug the issue if dma_addressing_limited() had debug information about the device not being able to address all of memory and thus forcing all accesses through a bounce buffer. Please see[2] Implement dev_dbg to debug the potential use of bounce buffers when we hit the condition. When swiotlb is used, dma_addressing_limited() is used to determine the size of maximum dma buffer size in dma_direct_max_mapping_size(). The debug prints could be triggered in that check as well (when enabled). Link: https://lore.kernel.org/lkml/20250401000752.249348-1-balbirs@nvidia.com/ [1] Link: https://lore.kernel.org/lkml/20250310112206.4168-1-spasswolf@web.de/ [2] Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <kees@kernel.org> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Bert Karwatzki <spasswolf@web.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Balbir Singh <balbirs@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com> Link: https://lore.kernel.org/r/20250414113752.3298276-1-balbirs@nvidia.com
2025-04-14sysctl: move u8 register test to lib/test_sysctl.cJoel Granados
If the test added in commit b5ffbd139688 ("sysctl: move the extra1/2 boundary check of u8 to sysctl_check_table_array") is run as a module, a lingering reference to the module is left behind, and a 'sysctl -a' leads to a panic. To reproduce CONFIG_KUNIT=y CONFIG_SYSCTL_KUNIT_TEST=m Then run these commands: modprobe sysctl-test rmmod sysctl-test sysctl -a The panic varies but generally looks something like this: BUG: unable to handle page fault for address: ffffa4571c0c7db4 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 100000067 P4D 100000067 PUD 100351067 PMD 114f5e067 PTE 0 Oops: Oops: 0000 [#1] SMP NOPTI ... ... ... RIP: 0010:proc_sys_readdir+0x166/0x2c0 ... ... ... Call Trace: <TASK> iterate_dir+0x6e/0x140 __se_sys_getdents+0x6e/0x100 do_syscall_64+0x70/0x150 entry_SYSCALL_64_after_hwframe+0x76/0x7e Move the test to lib/test_sysctl.c where the registration reference is handled on module exit Fixes: b5ffbd139688 ("sysctl: move the extra1/2 boundary check of u8 to sysctl_check_table_array") Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-04-12Merge tag 'trace-v6.15-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Hide get_vm_area() from MMUless builds The function get_vm_area() is not defined when CONFIG_MMU is not defined. Hide that function within #ifdef CONFIG_MMU. - Fix output of synthetic events when they have dynamic strings The print fmt of the synthetic event's format file use to have "%.*s" for dynamic size strings even though the user space exported arguments had only __get_str() macro that provided just a nul terminated string. This was fixed so that user space could parse this properly. But the reason that it had "%.*s" was because internally it provided the maximum size of the string as one of the arguments. The fix that replaced "%.*s" with "%s" caused the trace output (when the kernel reads the event) to write "(efault)" as it would now read the length of the string as "%s". As the string provided is always nul terminated, there's no reason for the internal code to use "%.*s" anyway. Just remove the length argument to match the "%s" that is now in the format. - Fix the ftrace subops hash logic of the manager ops hash The function_graph uses the ftrace subops code. The subops code is a way to have a single ftrace_ops registered with ftrace to determine what functions will call the ftrace_ops callback. More than one user of function graph can register a ftrace_ops with it. The function graph infrastructure will then add this ftrace_ops as a subops with the main ftrace_ops it registers with ftrace. This is because the functions will always call the function graph callback which in turn calls the subops ftrace_ops callbacks. The main ftrace_ops must add a callback to all the functions that the subops want a callback from. When a subops is registered, it will update the main ftrace_ops hash to include the functions it wants. This is the logic that was broken. The ftrace_ops hash has a "filter_hash" and a "notrace_hash" where all the functions in the filter_hash but not in the notrace_hash are attached by ftrace. The original logic would have the main ftrace_ops filter_hash be a union of all the subops filter_hashes and the main notrace_hash would be a intersect of all the subops filter hashes. But this was incorrect because the notrace hash depends on the filter_hash it is associated to and not the union of all filter_hashes. Instead, when a subops is added, just include all the functions of the subops hash that are in its filter_hash but not in its notrace_hash. The main subops hash should not use its notrace hash, unless all of its subops hashes have an empty filter_hash (which means to attach to all functions), and then, and only then, the main ftrace_ops notrace hash can be the intersect of all the subops hashes. This not only fixes the bug, but also simplifies the code. - Add a selftest to better test the subops filtering Add a selftest that would catch the bug fixed by the above change. - Fix extra newline printed in function tracing with retval The function parameter code changed the output logic slightly and called print_graph_retval() and also printed a newline. The print_graph_retval() also prints a newline which caused blank lines to be printed in the function graph tracer when retval was added. This caused one of the selftests to fail if retvals were enabled. Instead remove the new line output from print_graph_retval() and have the callers always print the new line so that it doesn't have to do special logic if it calls print_graph_retval() or not. - Fix out-of-bound memory access in the runtime verifier When rv_is_container_monitor() is called on the last entry on the link list it references the next entry, which is the list head and causes an out-of-bound memory access. * tag 'trace-v6.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Fix out-of-bound memory access in rv_is_container_monitor() ftrace: Do not have print_graph_retval() add a newline tracing/selftest: Add test to better test subops filtering of function graph ftrace: Fix accounting of subop hashes ftrace: Properly merge notrace hashes tracing: Do not add length to print format in synthetic events tracing: Hide get_vm_area() from MMUless builds
2025-04-12Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds
Pull bpf fixes from Alexei Starovoitov: - Followup fixes for resilient spinlock (Kumar Kartikeya Dwivedi): - Make res_spin_lock test less verbose, since it was spamming BPF CI on failure, and make the check for AA deadlock stronger - Fix rebasing mistake and use architecture provided res_smp_cond_load_acquire - Convert BPF maps (queue_stack and ringbuf) to resilient spinlock to address long standing syzbot reports - Make sure that classic BPF load instruction from SKF_[NET|LL]_OFF offsets works when skb is fragmeneted (Willem de Bruijn) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Convert ringbuf map to rqspinlock bpf: Convert queue_stack map to rqspinlock bpf: Use architecture provided res_smp_cond_load_acquire selftests/bpf: Make res_spin_lock AA test condition stronger selftests/net: test sk_filter support for SKF_NET_OFF on frags bpf: support SKF_NET_OFF and SKF_LL_OFF on skb frags selftests/bpf: Make res_spin_lock test less verbose
2025-04-12rv: Fix out-of-bound memory access in rv_is_container_monitor()Nam Cao
When rv_is_container_monitor() is called on the last monitor in rv_monitors_list, KASAN yells: BUG: KASAN: global-out-of-bounds in rv_is_container_monitor+0x101/0x110 Read of size 8 at addr ffffffff97c7c798 by task setup/221 The buggy address belongs to the variable: rv_monitors_list+0x18/0x40 This is due to list_next_entry() is called on the last entry in the list. It wraps around to the first list_head, and the first list_head is not embedded in struct rv_monitor_def. Fix it by checking if the monitor is last in the list. Cc: stable@vger.kernel.org Cc: Gabriele Monaco <gmonaco@redhat.com> Fixes: cb85c660fcd4 ("rv: Add option for nested monitors and include sched") Link: https://lore.kernel.org/e85b5eeb7228bfc23b8d7d4ab5411472c54ae91b.1744355018.git.namcao@linutronix.de Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-12ftrace: Do not have print_graph_retval() add a newlineSteven Rostedt
The retval and retaddr options for function_graph tracer will add a comment at the end of a function for both leaf and non leaf functions that looks like: __wake_up_common(); /* ret=0x1 */ } /* pick_next_task_fair ret=0x0 */ The function print_graph_retval() adds a newline after the "*/". But if that's not called, the caller function needs to make sure there's a newline added. This is confusing and when the function parameters code was added, it added a newline even when calling print_graph_retval() as the fact that the print_graph_retval() function prints a newline isn't obvious. This caused an extra newline to be printed and that made it fail the selftests when the retval option was set, as the selftests were not expecting blank lines being injected into the trace. Instead of having print_graph_retval() print a newline, just have the caller always print the newline regardless if it calls print_graph_retval() or not. This not only fixes this bug, but it also simplifies the code. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250411133015.015ca393@gandalf.local.home Reported-by: Mark Brown <broonie@kernel.org> Tested-by: Mark Brown <broonie@kernel.org> Closes: https://lore.kernel.org/all/ccc40f2b-4b9e-4abd-8daf-d22fce2a86f0@sirena.org.uk/ Fixes: ff5c9c576e754 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-12pidfs: ensure consistent ENOENT/ESRCH reportingChristian Brauner
In a prior patch series we tried to cleanly differentiate between: (1) The task has already been reaped. (2) The caller requested a pidfd for a thread-group leader but the pid actually references a struct pid that isn't used as a thread-group leader. as this was causing issues for non-threaded workloads. But there's cases where the current simple logic is wrong. Specifically, if the pid was a leader pid and the check races with __unhash_process(). Stabilize this by using the pidfd waitqueue lock. Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-2-60b2d3bb545f@kernel.org Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-12exit: move wake_up_all() pidfd waiters into __unhash_process()Christian Brauner
Move the pidfd notification out of __change_pid() and into __unhash_process(). The only valid call to __change_pid() with a NULL argument and PIDTYPE_PID is from __unhash_process(). This is a lot more obvious than calling it from __change_pid(). Link: https://lore.kernel.org/20250411-work-pidfs-enoent-v2-1-60b2d3bb545f@kernel.org Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-04-11ftrace: Fix accounting of subop hashesSteven Rostedt
The function graph infrastructure uses ftrace to hook to functions. It has a single ftrace_ops to manage all the users of function graph. Each individual user (tracing, bpf, fprobes, etc) has its own ftrace_ops to track the functions it will have its callback called from. These ftrace_ops are "subops" to the main ftrace_ops of the function graph infrastructure. Each ftrace_ops has a filter_hash and a notrace_hash that is defined as: Only trace functions that are in the filter_hash but not in the notrace_hash. If the filter_hash is empty, it means to trace all functions. If the notrace_hash is empty, it means do not disable any function. The function graph main ftrace_ops needs to be a superset containing all the functions to be traced by all the subops it has. The algorithm to perform this merge was incorrect. When the first subops was added to the main ops, it simply made the main ops a copy of the subops (same filter_hash and notrace_hash). When a second ops was added, it joined the new subops filter_hash with the main ops filter_hash as a union of the two sets. The intersect between the new subops notrace_hash and the main ops notrace_hash was created as the new notrace_hash of the main ops. The issue here is that it would then start tracing functions than no subops were tracing. For example if you had two subops that had: subops 1: filter_hash = '*sched*' # trace all functions with "sched" in it notrace_hash = '*time*' # except do not trace functions with "time" subops 2: filter_hash = '*lock*' # trace all functions with "lock" in it notrace_hash = '*clock*' # except do not trace functions with "clock" The intersect of '*time*' functions with '*clock*' functions could be the empty set. That means the main ops will be tracing all functions with '*time*' and all "*clock*" in it! Instead, modify the algorithm to be a bit simpler and correct. First, when adding a new subops, even if it's the first one, do not add the notrace_hash if the filter_hash is not empty. Instead, just add the functions that are in the filter_hash of the subops but not in the notrace_hash of the subops into the main ops filter_hash. There's no reason to add anything to the main ops notrace_hash. The notrace_hash of the main ops should only be non empty iff all subops filter_hashes are empty (meaning to trace all functions) and all subops notrace_hashes include the same functions. That is, the main ops notrace_hash is empty if any subops filter_hash is non empty. The main ops notrace_hash only has content in it if all subops filter_hashes are empty, and the content are only functions that intersect all the subops notrace_hashes. If any subops notrace_hash is empty, then so is the main ops notrace_hash. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Andy Chiu <andybnac@gmail.com> Link: https://lore.kernel.org/20250409152720.216356767@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-04-11ftrace: Properly merge notrace hashesAndy Chiu
The global notrace hash should be jointly decided by the intersection of each subops's notrace hash, but not the filter hash. Cc: stable@vger.kernel.org Link: https://lore.kernel.org/20250408160258.48563-1-andybnac@gmail.com Fixes: 5fccc7552ccb ("ftrace: Add subops logic to allow one ops to manage many") Signed-off-by: Andy Chiu <andybnac@gmail.com> [ fixed removing of freeing of filter_hash ] Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>