summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-05-19cgroup: compare css to cgroup::self in helper for distingushing cssJP Kobryn
Adjust the implementation of css_is_cgroup() so that it compares the given css to cgroup::self. Rename the function to css_is_self() in order to reflect that. Change the existing css->ss NULL check to a warning in the true branch. Finally, adjust call sites to use the new function name. Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19cgroup: warn on rstat usage by early init subsystemsJP Kobryn
An early init subsystem that attempts to make use of rstat can lead to failures during early boot. The reason for this is the timing in which the css's of the root cgroup have css_online() invoked on them. At the point of this call, there is a stated assumption that a cgroup has "successfully completed all allocations" [0]. An example of a subsystem that relies on the previously mentioned assumption [0] is the memory subsystem. Within its implementation of css_online(), work is queued to asynchronously begin flushing via rstat. In the early init path for a given subsystem, having rstat enabled leads to this sequence: cgroup_init_early() for_each_subsys(ss, ssid) if (ss->early_init) cgroup_init_subsys(ss, true) cgroup_init_subsys(ss, early_init) css = ss->css_alloc(...) init_and_link_css(css, ss, ...) ... online_css(css) online_css(css) ss = css->ss ss->css_online(css) Continuing to use the memory subsystem as an example, the issue with this sequence is that css_rstat_init() has not been called yet. This means there is now a race between the pending async work to flush rstat and the call to css_rstat_init(). So a flush can occur within the given cgroup while the rstat fields are not initialized. Since we are in the early init phase, the rstat fields cannot be initialized because they require per-cpu allocations. So it's not possible to have css_rstat_init() called early enough (before online_css()). This patch treats the combination of early init and rstat the same as as other invalid conditions. [0] Documentation/admin-guide/cgroup-v1/cgroups.rst (section: css_online) Signed-off-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-05-19bpf: WARN_ONCE on verifier bugsPaul Chaignon
Throughout the verifier's logic, there are multiple checks for inconsistent states that should never happen and would indicate a verifier bug. These bugs are typically logged in the verifier logs and sometimes preceded by a WARN_ONCE. This patch reworks these checks to consistently emit a verifier log AND a warning when CONFIG_DEBUG_KERNEL is enabled. The consistent use of WARN_ONCE should help fuzzers (ex. syzkaller) expose any situation where they are actually able to reach one of those buggy verifier states. Acked-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/aCs1nYvNNMq8dAWP@mail.gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-19padata: do not leak refcount in reorder_workDominik Grzegorzek
A recent patch that addressed a UAF introduced a reference count leak: the parallel_data refcount is incremented unconditionally, regardless of the return value of queue_work(). If the work item is already queued, the incremented refcount is never decremented. Fix this by checking the return value of queue_work() and decrementing the refcount when necessary. Resolves: Unreferenced object 0xffff9d9f421e3d80 (size 192): comm "cryptomgr_probe", pid 157, jiffies 4294694003 hex dump (first 32 bytes): 80 8b cf 41 9f 9d ff ff b8 97 e0 89 ff ff ff ff ...A............ d0 97 e0 89 ff ff ff ff 19 00 00 00 1f 88 23 00 ..............#. backtrace (crc 838fb36): __kmalloc_cache_noprof+0x284/0x320 padata_alloc_pd+0x20/0x1e0 padata_alloc_shell+0x3b/0xa0 0xffffffffc040a54d cryptomgr_probe+0x43/0xc0 kthread+0xf6/0x1f0 ret_from_fork+0x2f/0x50 ret_from_fork_asm+0x1a/0x30 Fixes: dd7d37ccf6b1 ("padata: avoid UAF for reorder_work") Cc: <stable@vger.kernel.org> Signed-off-by: Dominik Grzegorzek <dominik.grzegorzek@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2025-05-17Merge tag 'mm-hotfixes-stable-2025-05-17-09-41' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "Nine singleton hotfixes, all MM. Four are cc:stable" * tag 'mm-hotfixes-stable-2025-05-17-09-41' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm: userfaultfd: correct dirty flags set for both present and swap pte zsmalloc: don't underflow size calculation in zs_obj_write() mm/page_alloc: fix race condition in unaccepted memory handling mm/page_alloc: ensure try_alloc_pages() plays well with unaccepted memory MAINTAINERS: add mm GUP section mm/codetag: move tag retrieval back upfront in __free_pages() mm/memory: fix mapcount / refcount sanity check for mTHP reuse kernel/fork: only call untrack_pfn_clear() on VMAs duplicated for fork() mm: hugetlb: fix incorrect fallback for subpool
2025-05-17perf/core: Add the is_event_in_freq_mode() helper to simplify the codeKan Liang
Add a helper to check if an event is in freq mode to improve readability. No functional changes. Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250516182853.2610284-2-kan.liang@linux.intel.com
2025-05-16PM: freezer: Rewrite restarting tasks log to remove stray *done.*Paul Menzel
`pr_cont()` unfortunately does not work here, as other parts of the Linux kernel log between the two log lines: [18445.295056] r8152-cfgselector 4-1.1.3: USB disconnect, device number 5 [18445.295112] OOM killer enabled. [18445.295115] Restarting tasks ... [18445.295185] usb 3-1: USB disconnect, device number 2 [18445.295193] usb 3-1.1: USB disconnect, device number 3 [18445.296262] usb 3-1.5: USB disconnect, device number 4 [18445.297017] done. [18445.297029] random: crng reseeded on system resumption `pr_cont()` also uses the default log level, normally warning, if the corresponding log line is interrupted. Therefore, replace the `pr_cont()`, and explicitly log it as a separate line with log level info: Restarting tasks: Starting […] Restarting tasks: Done Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Link: https://patch.msgid.link/20250511174648.950430-1-pmenzel@molgen.mpg.de [ rjw: Rebase on top of an earlier analogous change ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-16genirq/msi: Add helper for creating MSI-parent irq domainsMarc Zyngier
Creating an irq domain that serves as an MSI parent requires a substantial amount of esoteric boiler-plate code, some of which is often provided twice (such as the bus token). To make things a bit simpler for the unsuspecting MSI tinkerer, provide a helper that does it for them, and serves as documentation of what needs to be provided. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250513172819.2216709-3-maz@kernel.org
2025-05-16irqdomain: Fix kernel-doc and add it to DocumentationJiri Slaby (SUSE)
irqdomain.c's kernel-doc exists, but is not plugged into Documentation/ yet. Before plugging it in, fix it first: irq_domain_get_irq_data() and irq_domain_set_info() were documented twice. Identically, by both definitions for CONFIG_IRQ_DOMAIN_HIERARCHY and !CONFIG_IRQ_DOMAIN_HIERARCHY. Therefore, switch the second kernel-doc into an ordinary comment -- change "/**" to simple "/*". This avoids sphinx's: WARNING: Duplicate C declaration Next, in commit b7b377332b96 ("irqdomain: Fix the kernel-doc and plug it into Documentation"), irqdomain.h's (header) kernel-doc was added into core-api/genericirq.rst. But given the amount of irqdomain functions and structures, move all these to core-api/irq/irq-domain.rst now. Finally, add these newly fixed irqdomain.c's (source) docs there as well. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-58-jirislaby@kernel.org
2025-05-16irqdomain: Drop irq_domain_add_*() functionsJiri Slaby (SUSE)
Most irq_domain_add_*() functions are unused now, so drop them. The remaining ones are moved to the deprecated section and will be removed during the merge window after the patches in various trees have been merged. Note: The Chinese docs are touched but unfinished. I cannot parse those. [ tglx: Remove the leftover in irq-domain.rst and handle merge logistics ] Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-41-jirislaby@kernel.org
2025-05-16irqdomain: Make irq_domain_create_hierarchy() an inlineJiri Slaby (SUSE)
There is no reason to export the function as an extra symbol. It is simple enough and is just a wrapper to already exported functions. Therefore, switch the exported function to an inline. Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-13-jirislaby@kernel.org
2025-05-16irqdomain: Drop of_node_to_fwnode()Jiri Slaby (SUSE)
All uses of of_node_to_fwnode() in non-irqdomain code were changed to "officially" defined of_fwnode_handle(). Therefore, the former can be dropped along with the last uses in the irqdomain code. Due to merge logistics the inline cannot be dropped immediately. Move it to a deprecated section, which will be removed during the merge window. [ tglx: Handle merge logistics ] Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250319092951.37667-12-jirislaby@kernel.org
2025-05-16Merge branches 'rcu/misc-for-6.16', 'rcu/seq-counters-for-6.16' and ↵Joel Fernandes
'rcu/torture-for-6.16' into rcu/for-next
2025-05-16rcutorture: Perform more frequent testing of ->gpwrapJoel Fernandes
Currently, the ->gpwrap is not tested (at all per my testing) due to the requirement of a large delta between a CPU's rdp->gp_seq and its node's rnp->gpseq. This results in no testing of ->gpwrap being set. This patch by default adds 5 minutes of testing with ->gpwrap forced by lowering the delta between rdp->gp_seq and rnp->gp_seq to just 8 GPs. All of this is configurable, including the active time for the setting and a full testing cycle. By default, the first 25 minutes of a test will have the _default_ behavior there is right now (ULONG_MAX / 4) delta. Then for 5 minutes, we switch to a smaller delta causing 1-2 wraps in 5 minutes. I believe this is reasonable since we at least add a little bit of testing for usecases where ->gpwrap is set. [ Apply fix for Dan Carpenter's bug report on init path cleanup. ] [ Apply kernel doc warning fix from Akira Yokosawa. ] Tested-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16rcutorture: Comment invocations of tick_dep_set_task()Paul E. McKenney
The rcu_torture_reader() and rcu_torture_fwd_prog_cr() functions run CPU-bound for extended periods of time (tens or even hundreds of milliseconds), so they invoke tick_dep_set_task() and tick_dep_clear_task() to ensure that the scheduling-clock tick helps move grace periods forward. So why doesn't rcu_torture_fwd_prog_nr() also invoke tick_dep_set_task() and tick_dep_clear_task()? Because the point of this function is to test RCU's ability to (eventually) force grace periods forward even when the tick has been disabled during long CPU-bound kernel execution. Signed-off-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16rcu/nocb: Add Safe checks for access offloaded rdpZqiang
For built with CONFIG_PROVE_RCU=y and CONFIG_PREEMPT_RT=y kernels, Disable BH does not change the SOFTIRQ corresponding bits in preempt_count(), but change current->softirq_disable_cnt, this resulted in the following splat: WARNING: suspicious RCU usage kernel/rcu/tree_plugin.h:36 Unsafe read of RCU_NOCB offloaded state! stack backtrace: CPU: 0 UID: 0 PID: 22 Comm: rcuc/0 Call Trace: [ 0.407907] <TASK> [ 0.407910] dump_stack_lvl+0xbb/0xd0 [ 0.407917] dump_stack+0x14/0x20 [ 0.407920] lockdep_rcu_suspicious+0x133/0x210 [ 0.407932] rcu_rdp_is_offloaded+0x1c3/0x270 [ 0.407939] rcu_core+0x471/0x900 [ 0.407942] ? lockdep_hardirqs_on+0xd5/0x160 [ 0.407954] rcu_cpu_kthread+0x25f/0x870 [ 0.407959] ? __pfx_rcu_cpu_kthread+0x10/0x10 [ 0.407966] smpboot_thread_fn+0x34c/0xa50 [ 0.407970] ? trace_preempt_on+0x54/0x120 [ 0.407977] ? __pfx_smpboot_thread_fn+0x10/0x10 [ 0.407982] kthread+0x40e/0x840 [ 0.407990] ? __pfx_kthread+0x10/0x10 [ 0.407994] ? rt_spin_unlock+0x4e/0xb0 [ 0.407997] ? rt_spin_unlock+0x4e/0xb0 [ 0.408000] ? __pfx_kthread+0x10/0x10 [ 0.408006] ? __pfx_kthread+0x10/0x10 [ 0.408011] ret_from_fork+0x40/0x70 [ 0.408013] ? __pfx_kthread+0x10/0x10 [ 0.408018] ret_from_fork_asm+0x1a/0x30 [ 0.408042] </TASK> Currently, triggering an rdp offloaded state change need the corresponding rdp's CPU goes offline, and at this time the rcuc kthreads has already in parking state. this means the corresponding rcuc kthreads can safely read offloaded state of rdp while it's corresponding cpu is online. This commit therefore add softirq_count() check for Preempt-RT kernels. Suggested-by: Joel Fernandes <joelagnelf@nvidia.com> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Zqiang <qiang.zhang1211@gmail.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16rcuscale: using kcalloc() to relpace kmalloc()Su Hui
It's safer to using kcalloc() because it can prevent overflow problem. Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Su Hui <suhui@nfschina.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16Revert "rcu/nocb: Fix rcuog wake-up from offline softirq"Frederic Weisbecker
This reverts commit f7345ccc62a4b880cf76458db5f320725f28e400. swake_up_one_online() has been removed because hrtimers can now assign a proper online target to hrtimers queued from offline CPUs. Therefore remove the related hackery. Link: https://lore.kernel.org/all/20241231170712.149394-4-frederic@kernel.org/ Reviewed-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16rcu/cpu_stall_cputime: fix the hardirq count for x86 architectureYongliang Gao
When counting the number of hardirqs in the x86 architecture, it is essential to add arch_irq_stat_cpu to ensure accuracy. For example, a CPU loop within the rcu_read_lock function. Before: [ 70.910184] rcu: INFO: rcu_preempt self-detected stall on CPU [ 70.910436] rcu: 3-....: (4999 ticks this GP) idle=*** [ 70.910711] rcu: hardirqs softirqs csw/system [ 70.910870] rcu: number: 0 657 0 [ 70.911024] rcu: cputime: 0 0 2498 ==> 2498(ms) [ 70.911278] rcu: (t=5001 jiffies g=3677 q=29 ncpus=8) After: [ 68.046132] rcu: INFO: rcu_preempt self-detected stall on CPU [ 68.046354] rcu: 2-....: (4999 ticks this GP) idle=*** [ 68.046628] rcu: hardirqs softirqs csw/system [ 68.046793] rcu: number: 2498 663 0 [ 68.046951] rcu: cputime: 0 0 2496 ==> 2496(ms) [ 68.047244] rcu: (t=5000 jiffies g=3825 q=4 ncpus=8) Fixes: be42f00b73a0 ("rcu: Add RCU stall diagnosis information") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202501090842.SfI6QPGS-lkp@intel.com/ Signed-off-by: Yongliang Gao <leonylgao@tencent.com> Reviewed-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Link: https://lore.kernel.org/r/20250216084109.3109837-1-leonylgao@gmail.com Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16rcu: Remove swake_up_one_online() bandaidFrederic Weisbecker
It's now ok to perform a wake-up from an offline CPU because the resulting armed scheduler bandwidth hrtimers are now correctly targeted by hrtimer infrastructure. Remove the obsolete hackerry. Link: https://lore.kernel.org/all/20241231170712.149394-3-frederic@kernel.org/ Reviewed-by: Usama Arif <usamaarif642@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-05-16futex: Fix kernel-doc commentsBorislav Petkov (AMD)
Fix those: ./kernel/futex/futex.h:208: warning: Function parameter or struct member 'drop_hb_ref' not described in 'futex_q' ./kernel/futex/waitwake.c:343: warning: expecting prototype for futex_wait_queue(). Prototype was for futex_do_wait() instead ./kernel/futex/waitwake.c:594: warning: Function parameter or struct member 'task' not described in 'futex_wait_setup' Fixes: 93f1b6d79a73 ("futex: Move futex_queue() into futex_wait_setup()") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20250512185641.0450a99b@canb.auug.org.au # report Link: https://lore.kernel.org/r/20250515171641.24073-1-bp@kernel.org # submission
2025-05-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.15-rc7). Conflicts: tools/testing/selftests/drivers/net/hw/ncdevmem.c 97c4e094a4b2 ("tests/ncdevmem: Fix double-free of queue array") 2f1a805f32ba ("selftests: ncdevmem: Implement devmem TCP TX") https://lore.kernel.org/20250514122900.1e77d62d@canb.auug.org.au Adjacent changes: net/core/devmem.c net/core/devmem.h 0afc44d8cdf6 ("net: devmem: fix kernel panic when netlink socket close after module unload") bd61848900bf ("net: devmem: Implement TX path") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-05-15perf/aux: Allocate non-contiguous AUX pages by defaultYabin Cui
perf always allocates contiguous AUX pages based on aux_watermark. However, this contiguous allocation doesn't benefit all PMUs. For instance, ARM SPE and TRBE operate with virtual pages, and Coresight ETR allocates a separate buffer. For these PMUs, allocating contiguous AUX pages unnecessarily exacerbates memory fragmentation. This fragmentation can prevent their use on long-running devices. This patch modifies the perf driver to be memory-friendly by default, by allocating non-contiguous AUX pages. For PMUs requiring contiguous pages (Intel BTS and some Intel PT), the existing PERF_PMU_CAP_AUX_NO_SG capability can be used. For PMUs that don't require but can benefit from contiguous pages (some Intel PT), a new capability, PERF_PMU_CAP_AUX_PREFER_LARGE, is added to maintain their existing behavior. Signed-off-by: Yabin Cui <yabinc@google.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: James Clark <james.clark@linaro.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: https://lore.kernel.org/r/20250508232642.148767-1-yabinc@google.com
2025-05-15genirq: Retain disable depth for managed interrupts across CPU hotplugBrian Norris
Affinity-managed interrupts can be shut down and restarted during CPU hotunplug/plug. Thereby the interrupt may be left in an unexpected state. Specifically: 1. Interrupt is affine to CPU N 2. disable_irq() -> depth is 1 3. CPU N goes offline 4. irq_shutdown() -> depth is set to 1 (again) 5. CPU N goes online 6. irq_startup() -> depth is set to 0 (BUG! driver expects that the interrupt still disabled) 7. enable_irq() -> depth underflow / unbalanced enable_irq() warning This is only a problem for managed interrupts and CPU hotplug, all other cases like request()/free()/request() truly needs to reset a possibly stale disable depth value. Provide a startup function, which takes the disable depth into account, and invoked it for the managed interrupts in the CPU hotplug path. This requires to change irq_shutdown() to do a depth increment instead of setting it to 1, which allows to retain the disable depth, but is harmless for the other code paths using irq_startup(), which will still reset the disable depth unconditionally to keep the original correct behaviour. A kunit tests will be added separately to cover some of these aspects. [ tglx: Massaged changelog ] Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Brian Norris <briannorris@chromium.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250514201353.3481400-2-briannorris@chromium.org
2025-05-15genirq: Bump the size of the local variable for sprintf()Andy Shevchenko
GCC is not happy about a sprintf() call on a buffer that might be too small for the given formatting string. kernel/irq/debugfs.c:233:26: warning: 'sprintf' may write a terminating nul past the end of the destination [-Wformat-overflow=] Fix this by bumping the size of the local variable for sprintf(). Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250515085516.2913290-1-andriy.shevchenko@linux.intel.com Closes: https://lore.kernel.org/oe-kbuild-all/202505151057.xbyXAbEn-lkp@intel.com/
2025-05-14Merge tag 'kbuild-fixes-v6.15' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild fixes from Masahiro Yamada: - Add proper pahole version dependency to CONFIG_GENDWARFKSYMS to avoid module loading errors - Fix UAPI header tests for the OpenRISC architecture - Add dependency on the libdw package in Debian and RPM packages - Disable -Wdefault-const-init-unsafe warnings on Clang - Make "make clean ARCH=um" also clean the arch/x86/ directory - Revert the use of -fmacro-prefix-map=, which causes issues with debugger usability * tag 'kbuild-fixes-v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: kbuild: fix typos "module.builtin" to "modules.builtin" Revert "kbuild, rust: use -fremap-path-prefix to make paths relative" Revert "kbuild: make all file references relative to source root" kbuild: fix dependency on sorttable init: remove unused CONFIG_CC_CAN_LINK_STATIC um: let 'make clean' properly clean underlying SUBARCH as well kbuild: Disable -Wdefault-const-init-unsafe kbuild: rpm-pkg: Add (elfutils-devel or libdw-devel) to BuildRequires kbuild: deb-pkg: Add libdw-dev:native to Build-Depends-Arch usr/include: openrisc: don't HDRTEST bpf_perf_event.h kbuild: Require pahole <v1.28 or >v1.29 with GENDWARFKSYMS on X86
2025-05-14bpf: Pass the same orig_call value to trampoline functionsIlya Leoshkevich
There is currently some confusion in the s390x JIT regarding whether orig_call can be NULL and what that means. Originally the NULL value was used to distinguish the struct_ops case, but this was superseded by BPF_TRAMP_F_INDIRECT (see commit 0c970ed2f87c ("s390/bpf: Fix indirect trampoline generation"). The remaining reason to have this check is that NULL can actually be passed to the arch_bpf_trampoline_size() call - but not to the respective arch_prepare_bpf_trampoline()! call - by bpf_struct_ops_prepare_trampoline(). Remove this asymmetry by passing stub_func to both functions, so that JITs may rely on orig_call never being NULL. Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: Martin KaFai Lau <martin.lau@kernel.org> Link: https://lore.kernel.org/r/20250512221911.61314-2-iii@linux.ibm.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-14Merge tag 'trace-v6.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix sample code that uses trace_array_printk() The sample code for in kernel use of trace_array (that creates an instance for use within the kernel) and shows how to use trace_array_printk() that writes into the created instance, used trace_printk_init_buffers(). But that function is used to initialize normal trace_printk() and produces the NOTICE banner which is not needed for use of trace_array_printk(). The function to initialize that is trace_array_init_printk() that takes the created trace array instance as a parameter. Update the sample code to reflect the proper usage. - Fix preemption count output for stacktrace event The tracing buffer shows the preempt count level when an event executes. Because writing the event itself disables preemption, this needs to be accounted for when recording. The stacktrace event did not account for this so the output of the stacktrace event showed preemption was disabled while the event that triggered the stacktrace shows preemption is enabled and this leads to confusion. Account for preemption being disabled for the stacktrace event. The same happened for stack traces triggered by function tracer. - Fix persistent ring buffer when trace_pipe is used The ring buffer swaps the reader page with the next page to read from the write buffer when trace_pipe is used. If there's only a page of data in the ring buffer, this swap will cause the "commit" pointer (last data written) to be on the reader page. If more data is written to the buffer, it is added to the reader page until it falls off back into the write buffer. If the system reboots and the commit pointer is still on the reader page, even if new data was written, the persistent buffer validator will miss finding the commit pointer because it only checks the write buffer and does not check the reader page. This causes the validator to fail the validation and clear the buffer, where the new data is lost. There was a check for this, but it checked the "head pointer", which was incorrect, because the "head pointer" always stays on the write buffer and is the next page to swap out for the reader page. Fix the logic to catch this case and allow the user to still read the data after reboot. * tag 'trace-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Fix persistent buffer when commit page is the reader page ftrace: Fix preemption accounting for stacktrace filter command ftrace: Fix preemption accounting for stacktrace trigger command tracing: samples: Initialize trace_array_printk() with the correct function
2025-05-14ring-buffer: Fix persistent buffer when commit page is the reader pageSteven Rostedt
The ring buffer is made up of sub buffers (sometimes called pages as they are by default PAGE_SIZE). It has the following "pages": "tail page" - this is the page that the next write will write to "head page" - this is the page that the reader will swap the reader page with. "reader page" - This belongs to the reader, where it will swap the head page from the ring buffer so that the reader does not race with the writer. The writer may end up on the "reader page" if the ring buffer hasn't written more than one page, where the "tail page" and the "head page" are the same. The persistent ring buffer has meta data that points to where these pages exist so on reboot it can re-create the pointers to the cpu_buffer descriptor. But when the commit page is on the reader page, the logic is incorrect. The check to see if the commit page is on the reader page checked if the head page was the reader page, which would never happen, as the head page is always in the ring buffer. The correct check would be to test if the commit page is on the reader page. If that's the case, then it can exit out early as the commit page is only on the reader page when there's only one page of data in the buffer. There's no reason to iterate the ring buffer pages to find the "commit page" as it is already found. To trigger this bug: # echo 1 > /sys/kernel/tracing/instances/boot_mapped/events/syscalls/sys_enter_fchownat/enable # touch /tmp/x # chown sshd /tmp/x # reboot On boot up, the dmesg will have: Ring buffer meta [0] is from previous boot! Ring buffer meta [1] is from previous boot! Ring buffer meta [2] is from previous boot! Ring buffer meta [3] is from previous boot! Ring buffer meta [4] commit page not found Ring buffer meta [5] is from previous boot! Ring buffer meta [6] is from previous boot! Ring buffer meta [7] is from previous boot! Where the buffer on CPU 4 had a "commit page not found" error and that buffer is cleared and reset causing the output to be empty and the data lost. When it works correctly, it has: # cat /sys/kernel/tracing/instances/boot_mapped/trace_pipe <...>-1137 [004] ..... 998.205323: sys_enter_fchownat: __syscall_nr=0x104 (260) dfd=0xffffff9c (4294967196) filename=(0xffffc90000a0002c) user=0x3e8 (1000) group=0xffffffff (4294967295) flag=0x0 (0 Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250513115032.3e0b97f7@gandalf.local.home Fixes: 5f3b6e839f3ce ("ring-buffer: Validate boot range memory events") Reported-by: Tasos Sahanidis <tasos@tasossah.com> Tested-by: Tasos Sahanidis <tasos@tasossah.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14ftrace: Fix preemption accounting for stacktrace filter commandpengdonglin
The preemption count of the stacktrace filter command to trace ksys_read is consistently incorrect: $ echo ksys_read:stacktrace > set_ftrace_filter <...>-453 [004] ...1. 38.308956: <stack trace> => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe The root cause is that the trace framework disables preemption when invoking the filter command callback in function_trace_probe_call: preempt_disable_notrace(); probe_ops->func(ip, parent_ip, probe_opsbe->tr, probe_ops, probe->data); preempt_enable_notrace(); Use tracing_gen_ctx_dec() to account for the preempt_disable_notrace(), which will output the correct preemption count: $ echo ksys_read:stacktrace > set_ftrace_filter <...>-410 [006] ..... 31.420396: <stack trace> => ksys_read => do_syscall_64 => entry_SYSCALL_64_after_hwframe Cc: stable@vger.kernel.org Fixes: 36590c50b2d07 ("tracing: Merge irqflags + preempt counter.") Link: https://lore.kernel.org/20250512094246.1167956-2-dolinux.peng@gmail.com Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14ftrace: Fix preemption accounting for stacktrace trigger commandpengdonglin
When using the stacktrace trigger command to trace syscalls, the preemption count was consistently reported as 1 when the system call event itself had 0 ("."). For example: root@ubuntu22-vm:/sys/kernel/tracing/events/syscalls/sys_enter_read $ echo stacktrace > trigger $ echo 1 > enable sshd-416 [002] ..... 232.864910: sys_read(fd: a, buf: 556b1f3221d0, count: 8000) sshd-416 [002] ...1. 232.864913: <stack trace> => ftrace_syscall_enter => syscall_trace_enter => do_syscall_64 => entry_SYSCALL_64_after_hwframe The root cause is that the trace framework disables preemption in __DO_TRACE before invoking the trigger callback. Use the tracing_gen_ctx_dec() that will accommodate for the increase of the preemption count in __DO_TRACE when calling the callback. The result is the accurate reporting of: sshd-410 [004] ..... 210.117660: sys_read(fd: 4, buf: 559b725ba130, count: 40000) sshd-410 [004] ..... 210.117662: <stack trace> => ftrace_syscall_enter => syscall_trace_enter => do_syscall_64 => entry_SYSCALL_64_after_hwframe Cc: stable@vger.kernel.org Fixes: ce33c845b030c ("tracing: Dump stacktrace trigger to the corresponding instance") Link: https://lore.kernel.org/20250512094246.1167956-1-dolinux.peng@gmail.com Signed-off-by: pengdonglin <dolinux.peng@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14sched_ext: Explain the temporary situation around scx_root dereferencesTejun Heo
Naked scx_root dereferences are being used as temporary markers to indicate that they need to be updated to point to the right scheduler instance. Explain the situation. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Andrea Righi <arighi@nvidia.com>
2025-05-14tracing: Record trace_clock and recover when rebootMasami Hiramatsu (Google)
Record trace_clock information in the trace_scratch area and recover the trace_clock when boot, so that reader can docode the timestamp correctly. Note that since most trace_clocks records the timestamp in nano- seconds, this is not a bug. But some trace_clock, like counter and tsc will record the counter value. Only for those trace_clock user needs this information. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/174720625803.1925039.1815089037443798944.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14tracing: Cleanup upper_empty() in pid_listYury Norov
Instead of find_first_bit() use the dedicated bitmap_empty(), and make upper_empty() a nice one-liner. While there, fix opencoded BITS_PER_TYPE(). Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/20250429195119.620204-1-yury.norov@gmail.com Signed-off-by: Yury Norov <yury.norov@gmail.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, add @sch to scx_exit() and friends: - scx_exit/error() updated to take explicit @sch instead of assuming scx_root. - scx_kf_exit/error() added. These are to be used from kfuncs, don't take @sch and internally determine the scx_sched instance to abort. Currently, it's always scx_root but once multiple scheduler support is in place, it will be the scx_sched instance that invoked the kfunc. This simplifies many callsites and defers scx_sched lookup until error is triggered. - @sch is propagated to ops_cpu_valid() and ops_sanitize_err(). The CPU validity conditions in ops_cpu_valid() are factored into __cpu_valid() to implement kf_cpu_valid() which is the counterpart to scx_kf_exit/error(). - All users are converted. Most conversions are straightforward. check_rq_for_timeouts() and scx_softlockup() are updated to use explicit rcu_dereference*(scx_root) for safety as they may execute asynchronous to the exit path. scx_tick() is also updated to use rcu_dereference(). While not strictly necessary due to the preceding scx_enabled() test and IRQ disabled context, this removes the subtlety at no noticeable cost. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Cleanup [__]scx_exit/error*()Tejun Heo
__scx_exit() is the base exit implementation and there are three wrappers on top of it - scx_exit(), __scx_error() and scx_error(). This is more confusing than helpful especially given that there are only a couple users of scx_exit() and __scx_error(). To simplify the situation: - Make __scx_exit() take va_list and rename it to scx_vexit(). This is to ease implementing more complex extensions on top. - Make scx_exit() a varargs wrapper around __scx_exit(). scx_exit() now takes both @kind and @exit_code. - Convert existing scx_exit() and __scx_error() users to use the new scx_exit(). - scx_error() remains unchanged. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Add @sch to SCX_CALL_OP*()Tejun Heo
In preparation of hierarchical scheduling support, make SCX_CALL_OP*() take explicit @sch instead of assuming scx_root. As scx_root is still the only scheduler instance, this patch doesn't make any functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched_ext: Clean up scx_root usagesTejun Heo
- Always cache scx_root into local variable sch before using. - Don't use scx_root if cached sch is available. - Wrap !sch test with unlikely(). - Pass @scx into scx_cgroup_init/exit(). No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
2025-05-14sched,livepatch: Untangle cond_resched() and live-patchingPeter Zijlstra
With the goal of deprecating / removing VOLUNTARY preempt, live-patch needs to stop relying on cond_resched() to make forward progress. Instead, rely on schedule() with TASK_FREEZABLE set. Just like live-patching, the freezer needs to be able to stop tasks in a safe / known state. [bigeasy: use likely() in __klp_sched_try_switch() and update comments] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Tested-by: Petr Mladek <pmladek@suse.com> Tested-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Miroslav Benes <mbenes@suse.cz> Acked-by: Josh Poimboeuf <jpoimboe@kernel.org> Link: https://lore.kernel.org/r/20250509113659.wkP_HJ5z@linutronix.de
2025-05-14genirq/msi: Engage the .msi_teardown() callback on domain removalMarc Zyngier
Kindly inform the MSI driver that the domain is torn down, providing the allocation context previously populated on domain creation. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250513163144.2215824-5-maz@kernel.org
2025-05-14genirq/msi: Move prepare() call to per-device allocationMarc Zyngier
The current device MSI infrastructure is subtly broken, as it will issue an .msi_prepare() callback into the MSI controller driver every time it needs to allocate an MSI. That's pretty wrong, as the contract (or unwarranted assumption, depending who you ask) between the MSI controller and the core code is that .msi_prepare() is called exactly once per device. This leads to some subtle breakage in some MSI controller drivers, as it gives the impression that there are multiple endpoints sharing a bus identifier (RID in PCI parlance, DID for GICv3+). It implies that whatever allocation the ITS driver (for example) has done on behalf of these devices cannot be undone, as there is no way to track the shared state. This is particularly bad for wire-MSI devices, for which .msi_prepare() is called for each input line. To address this issue, move the call to .msi_prepare() to take place at the point of irq domain allocation, which is the only place that makes sense. The msi_alloc_info_t structure is made part of the msi_domain_template, so that its life-cycle is that of the domain as well. Finally, the msi_info::alloc_data field is made to point at this allocation tracking structure, ensuring that it is carried around the block. This is all pretty straightforward, except for the non-device-MSI leftovers, which still have to call .msi_prepare() at the old spot. One day... Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250513163144.2215824-4-maz@kernel.org
2025-05-14irqchip/gic-v3-its: Implement .msi_teardown() callbackMarc Zyngier
The ITS driver currently nukes the structure representing an endpoint device translating via an ITS on freeing the last LPI allocated for it. That's an unfortunate state of affair, as it is pretty common for a driver to allocate a single MSI, do something clever, teardown this MSI, and reallocate a whole bunch of them. The NVME driver does exactly that, amongst others. What happens in that case is that the core code is accidentaly issuing another .msi_prepare() call, even if it shouldn't. This luckily cancels the above behaviour and hides the problem. In order to fix the core code, start by implementing the new .msi_teardown() callback. Nothing calls it yet, so a side effect is that the its_dev structure will not be freed and that the DID will stay mapped. Not a big deal, and this will be solved in following patches. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250513163144.2215824-3-maz@kernel.org
2025-05-14genirq/msi: Add .msi_teardown() callback as the reverse of .msi_prepare()Marc Zyngier
While the MSI ops do have a .msi_prepare() callback that is responsible for setting up the relevant (usually per-device) allocation, there is no callback reversing this setup. For this purpose, add .msi_teardown() callback. In order to avoid breaking the ITS driver that suffers from related issues, do not call the callback just yet. Signed-off-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250513163144.2215824-2-maz@kernel.org
2025-05-14genirq/manage: Use the correct lock guard in irq_set_irq_wake()Jon Hunter
Commit 8589e325ba4f ("genirq/manage: Rework irq_set_irq_wake()") updated the irq_set_irq_wake() to use the new guards for locking the interrupt descriptor. However, in doing so it inadvertently changed irq_set_irq_wake() such that the 'chip_bus_lock' is no longer acquired. This has caused system suspend tests to fail on some Tegra platforms. Fix this by correcting the guard used in irq_set_irq_wake() to ensure the 'chip_bus_lock' is held. Fixes: 8589e325ba4f ("genirq/manage: Rework irq_set_irq_wake()") Signed-off-by: Jon Hunter <jonathanh@nvidia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250514095041.1109783-1-jonathanh@nvidia.com
2025-05-13bpf: Add support for __prog argument suffix to pass in prog->auxKumar Kartikeya Dwivedi
Instead of hardcoding the list of kfuncs that need prog->aux passed to them with a combination of fixup_kfunc_call adjustment + __ign suffix, combine both in __prog suffix, which ignores the argument passed in, and fixes it up to the prog->aux. This allows kfuncs to have the prog->aux passed into them without having to touch the verifier. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250513142812.1021591-1-memxor@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-05-13PM: sleep: Introduce pm_sleep_transition_in_progress()Rafael J. Wysocki
The "suspend in progress" check in device_wakeup_enable() does not cover hibernation, but arguably it should do that, so introduce pm_sleep_transition_in_progress() covering transitions during both system suspend and hibernation to use in there and use it also in pm_debug_messages_should_print(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://patch.msgid.link/7820474.EvYhyI6sBW@rjwysocki.net [ rjw: Move the new function definition under CONFIG_PM_SLEEP ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-05-13Merge tag 'probes-fixes-v6.15-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull probes fixes from Masami Hiramatsu: - fprobe: Fix RCU warning message in list traversal fprobe_module_callback() using hlist_for_each_entry_rcu() traverse the fprobe list but it locks fprobe_mutex() instead of rcu lock because it is enough. So add lockdep_is_held() to avoid warning. - tracing: eprobe: Add missing trace_probe_log_clear for eprobe __trace_eprobe_create() uses trace_probe_log but forgot to clear it at exit. Add trace_probe_log_clear() calls. - tracing: probes: Fix possible race in trace_probe_log APIs trace_probe_log APIs are used in probe event (dynamic_events, kprobe_events and uprobe_events) creation. Only dynamic_events uses the dyn_event_ops_mutex mutex to serialize it. This makes kprobe and uprobe events to lock the same mutex to serialize its creation to avoid race in trace_probe_log APIs. * tag 'probes-fixes-v6.15-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing: probes: Fix a possible race in trace_probe_log APIs tracing: add missing trace_probe_log_clear for eprobes tracing: fprobe: Fix RCU warning message in list traversal
2025-05-13bpf: Fix WARN() in get_bpf_raw_tp_regsTao Chen
syzkaller reported an issue: WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 Modules linked in: CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014 RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861 RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005 RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003 R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004 R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900 FS: 0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline] bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline] __bpf_prog_run include/linux/filter.h:718 [inline] bpf_prog_run include/linux/filter.h:725 [inline] __bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline] bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405 __bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47 __traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47 __do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline] __mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline] mmap_read_trylock include/linux/mmap_lock.h:204 [inline] stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157 __bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483 ____bpf_get_stack kernel/bpf/stackmap.c:499 [inline] bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline] bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931 bpf_prog_ec3b2eefa702d8d3+0x43/0x47 Tracepoint like trace_mmap_lock_acquire_returned may cause nested call as the corner case show above, which will be resolved with more general method in the future. As a result, WARN_ON_ONCE will be triggered. As Alexei suggested, remove the WARN_ON_ONCE first. Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data") Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com
2025-05-13clocksource: Fix the CPUs' choice in the watchdog per CPU verificationGuilherme G. Piccoli
Right now, if the clocksource watchdog detects a clocksource skew, it might perform a per CPU check, for example in the TSC case on x86. In other words: supposing TSC is detected as unstable by the clocksource watchdog running at CPU1, as part of marking TSC unstable the kernel will also run a check of TSC readings on some CPUs to be sure it is synced between them all. But that check happens only on some CPUs, not all of them; this choice is based on the parameter "verify_n_cpus" and in some random cpumask calculation. So, the watchdog runs such per CPU checks on up to "verify_n_cpus" random CPUs among all online CPUs, with the risk of repeating CPUs (that aren't double checked) in the cpumask random calculation. But if "verify_n_cpus" > num_online_cpus(), it should skip the random calculation and just go ahead and check the clocksource sync between all online CPUs, without the risk of skipping some CPUs due to duplicity in the random cpumask calculation. Tests in a 4 CPU laptop with TSC skew detected led to some cases of the per CPU verification skipping some CPU even with verify_n_cpus=8, due to the duplicity on random cpumask generation. Skipping the randomization when the number of online CPUs is smaller than verify_n_cpus, solves that. Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Link: https://lore.kernel.org/all/20250323173857.372390-1-gpiccoli@igalia.com
2025-05-13tracing: probes: Fix a possible race in trace_probe_log APIsMasami Hiramatsu (Google)
Since the shared trace_probe_log variable can be accessed and modified via probe event create operation of kprobe_events, uprobe_events, and dynamic_events, it should be protected. In the dynamic_events, all operations are serialized by `dyn_event_ops_mutex`. But kprobe_events and uprobe_events interfaces are not serialized. To solve this issue, introduces dyn_event_create(), which runs create() operation under the mutex, for kprobe_events and uprobe_events. This also uses lockdep to check the mutex is held when using trace_probe_log* APIs. Link: https://lore.kernel.org/all/174684868120.551552.3068655787654268804.stgit@devnote2/ Reported-by: Paul Cacheux <paulcacheux@gmail.com> Closes: https://lore.kernel.org/all/20250510074456.805a16872b591e2971a4d221@kernel.org/ Fixes: ab105a4fb894 ("tracing: Use tracing error_log with probe events") Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>