summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-06-20sched_ext: Merge branch 'sched/core' of ↵Tejun Heo
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into for-6.17 Pull tip/sched/core to receive the following commits: d403a3689af5 ("sched/fair: Move max_cfs_quota_period decl and default_cfs_period() def from fair.c to sched.h") de4c80c6963e ("sched/core: Relocate tg_get_cfs_*() and cpu_cfs_*_read_*()") 43e33f53e256 ("sched/core: Reorganize cgroup bandwidth control interface file reads") 5bc34be478d0 ("sched/core: Reorganize cgroup bandwidth control interface file writes") These will be used to implement CPU bandwidth interface support in sched_ext. Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-20rcu: Return early if callback is not specifiedUladzislau Rezki (Sony)
Currently the call_rcu() API does not check whether a callback pointer is NULL. If NULL is passed, rcu_core() will try to invoke it, resulting in NULL pointer dereference and a kernel crash. To prevent this and improve debuggability, this patch adds a check for NULL and emits a kernel stack trace to help identify a faulty caller. Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Joel Fernandes <joelagnelf@nvidia.com> Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
2025-06-19kho: initialize tail pages for higher order folios properlyPratyush Yadav
Currently, when restoring higher order folios, kho_restore_folio() only calls prep_compound_page() on all the pages. That is not enough to properly initialize the folios. The managed page count does not get updated, the reserved flag does not get dropped, and page count does not get initialized properly. Restoring a higher order folio with it results in the following BUG with CONFIG_DEBUG_VM when attempting to free the folio: BUG: Bad page state in process test pfn:104e2b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffffffffffffffff pfn:0x104e2b flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) raw: 002fffff80000000 0000000000000000 00000000ffffffff 0000000000000000 raw: ffffffffffffffff 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount [...] Call Trace: <TASK> dump_stack_lvl+0x4b/0x70 bad_page.cold+0x97/0xb2 __free_frozen_pages+0x616/0x850 [...] Combine the path for 0-order and higher order folios, initialize the tail pages with a count of zero, and call adjust_managed_page_count() to account for all the pages instead of just missing them. In addition, since all the KHO-preserved pages get marked with MEMBLOCK_RSRV_NOINIT by deserialize_bitmap(), the reserved flag is not actually set (as can also be seen from the flags of the dumped page in the logs above). So drop the ClearPageReserved() calls. [ptyadav@amazon.de: declare i in the loop instead of at the top] Link: https://lkml.kernel.org/r/20250613125916.39272-1-pratyush@kernel.org Link: https://lkml.kernel.org/r/20250605171143.76963-1-pratyush@kernel.org Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation") Signed-off-by: Pratyush Yadav <ptyadav@amazon.de> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Changyuan Lyu <changyuanl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-06-19ntp: Use ktime_get_ntp_seconds()Thomas Gleixner
Use ktime_get_ntp_seconds() to prepare for auxiliary clocks so that the readout becomes per timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.472512636@linutronix.de
2025-06-19pidfs: persist informationChristian Brauner
Persist exit and coredump information independent of whether anyone currently holds a pidfd for the struct pid. The current scheme allocated pidfs dentries on-demand repeatedly. This scheme is reaching it's limits as it makes it impossible to pin information that needs to be available after the task has exited or coredumped and that should not be lost simply because the pidfd got closed temporarily. The next opener should still see the stashed information. This is also a prerequisite for supporting extended attributes on pidfds to allow attaching meta information to them. If someone opens a pidfd for a struct pid a pidfs dentry is allocated and stashed in pid->stashed. Once the last pidfd for the struct pid is closed the pidfs dentry is released and removed from pid->stashed. So if 10 callers create a pidfs dentry for the same struct pid sequentially, i.e., each closing the pidfd before the other creates a new one then a new pidfs dentry is allocated every time. Because multiple tasks acquiring and releasing a pidfd for the same struct pid can race with each another a task may still find a valid pidfs entry from the previous task in pid->stashed and reuse it. Or it might find a dead dentry in there and fail to reuse it and so stashes a new pidfs dentry. Multiple tasks may race to stash a new pidfs dentry but only one will succeed, the other ones will put their dentry. The current scheme aims to ensure that a pidfs dentry for a struct pid can only be created if the task is still alive or if a pidfs dentry already existed before the task was reaped and so exit information has been was stashed in the pidfs inode. That's great except that it's buggy. If a pidfs dentry is stashed in pid->stashed after pidfs_exit() but before __unhash_process() is called we will return a pidfd for a reaped task without exit information being available. The pidfds_pid_valid() check does not guard against this race as it doens't sync at all with pidfs_exit(). The pid_has_task() check might be successful simply because we're before __unhash_process() but after pidfs_exit(). Introduce a new scheme where the lifetime of information associated with a pidfs entry (coredump and exit information) isn't bound to the lifetime of the pidfs inode but the struct pid itself. The first time a pidfs dentry is allocated for a struct pid a struct pidfs_attr will be allocated which will be used to store exit and coredump information. If all pidfs for the pidfs dentry are closed the dentry and inode can be cleaned up but the struct pidfs_attr will stick until the struct pid itself is freed. This will ensure minimal memory usage while persisting relevant information. The new scheme has various advantages. First, it allows to close the race where we end up handing out a pidfd for a reaped task for which no exit information is available. Second, it minimizes memory usage. Third, it allows to remove complex lifetime tracking via dentries when registering a struct pid with pidfs. There's no need to get or put a reference. Instead, the lifetime of exit and coredump information associated with a struct pid is bound to the lifetime of struct pid itself. Link: https://lore.kernel.org/20250618-work-pidfs-persistent-v2-5-98f3456fd552@kernel.org Reviewed-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-19timekeeping: Provide ktime_get_ntp_seconds()Thomas Gleixner
ntp_adjtimex() requires access to the actual time keeper per timekeeper ID. Provide an interface. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.411809421@linutronix.de
2025-06-19timekeeping: Introduce auxiliary timekeepersAnna-Maria Behnsen
Provide timekeepers for auxiliary clocks and initialize them during boot. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.350061049@linutronix.de
2025-06-19timekeeping: Add clock_valid flag to timekeeperThomas Gleixner
In preparation for supporting independent auxiliary timekeepers, add a clock valid field and set it to true for the system timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.287145536@linutronix.de
2025-06-19timekeeping: Prepare timekeeping_update_from_shadow()Thomas Gleixner
Don't invoke the VDSO and paravirt updates when utilized for auxiliary clocks. This is a temporary workaround until the VDSO and paravirt interfaces have been worked out. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083026.223876435@linutronix.de
2025-06-19timekeeping: Make __timekeeping_advance() reusableAnna-Maria Behnsen
In __timekeeping_advance() the pointer to struct tk_data is hardcoded by the use of &tk_core. As long as there is only a single timekeeper (tk_core), this is not a problem. But when __timekeeping_advance() will be reused for per auxiliary timekeepers, __timekeeping_advance() needs to be generalized. Add a pointer to struct tk_data as function argument of __timekeeping_advance() and adapt all call sites. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.160967312@linutronix.de
2025-06-19ntp: Rename __do_adjtimex() to ntp_adjtimex()Thomas Gleixner
Clean up the name space. No functional change. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.095637820@linutronix.de
2025-06-19ntp: Add timekeeper ID arguments to public functionsThomas Gleixner
In preparation for supporting auxiliary POSIX clocks, add a timekeeper ID to the relevant functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083026.032425931@linutronix.de
2025-06-19ntp: Add support for auxiliary timekeepersThomas Gleixner
If auxiliary clocks are enabled, provide an array of NTP data so that the auxiliary timekeepers can be steered independently of the core timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.969000914@linutronix.de
2025-06-19time: Introduce auxiliary POSIX clocksAnna-Maria Behnsen
To support auxiliary timekeeping and the related user space interfaces, it's required to define a clock ID range for them. Reserve 8 auxiliary clock IDs after the regular timekeeping clock ID space. This is the maximum number of auxiliary clocks the kernel can support. The actual number of supported clocks depends obviously on the presence of related devices and might be constraint by the available VDSO space. Add the corresponding timekeeper IDs as well. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.905800695@linutronix.de
2025-06-19timekeeping: Introduce timekeeper IDAnna-Maria Behnsen
As long as there is only a single timekeeper, there is no need to clarify which timekeeper is used. But with the upcoming reusage of the timekeeper infrastructure for auxiliary clock timekeepers, an ID is required to differentiate. Introduce an enum for timekeeper IDs, introduce a field in struct tk_data to store this timekeeper id and add also initialization. The id struct field is added at the end of the second cachline, as there is a 4 byte hole anyway. Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.842476378@linutronix.de
2025-06-19timekeeping: Avoid double notification in do_adjtimex()Thomas Gleixner
Consolidate do_adjtimex() so that it does not notify about clock changes twice. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/20250519083025.779267274@linutronix.de
2025-06-19timekeeping: Cleanup kernel doc of __ktime_get_real_seconds()Thomas Gleixner
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.715836017@linutronix.de
2025-06-19timekeeping: Remove hardcoded access to tk_coreThomas Gleixner
This was overlooked in the initial conversion. Use the provided pointer to access the shadow timekeeper. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250519083025.652611452@linutronix.de
2025-06-18bpf: Adjust free target to avoid global starvation of LRU mapWillem de Bruijn
BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the map is full, due to percpu reservations and force shrink before neighbor stealing. Once a CPU is unable to borrow from the global map, it will once steal one elem from a neighbor and after that each time flush this one element to the global list and immediately recycle it. Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map with 79 CPUs. CPU 79 will observe this behavior even while its neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%). CPUs need not be active concurrently. The issue can appear with affinity migration, e.g., irqbalance. Each CPU can reserve and then hold onto its 128 elements indefinitely. Avoid global list exhaustion by limiting aggregate percpu caches to half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count. This change has no effect on sufficiently large tables. Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable lru->free_target. The extra field fits in a hole in struct bpf_lru. The cacheline is already warm where read in the hot path. The field is only accessed with the lru lock held. Tested-by: Anton Protopopov <a.s.protopopov@gmail.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://lore.kernel.org/r/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-18Merge tag 'cgroup-for-6.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fix from Tejun Heo: - In cgroup1 freezer, a task migrating into a frozen cgroup might not get frozen immediately due to the wrong operation order. Fix it. * tag 'cgroup-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup,freezer: fix incomplete freezing when attaching tasks
2025-06-18Merge tag 'wq-for-6.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq Pull workqueue fix from Tejun Heo: - Fix missed early init of wq_isolated_cpumask * tag 'wq-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: Initialize wq_isolated_cpumask in workqueue_init_early()
2025-06-18Merge tag 'sched_ext-for-6.16-rc2-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - Fix a couple bugs in cgroup cpu.weight support - Add the new sched-ext@lists.linux.dev to MAINTAINERS * tag 'sched_ext-for-6.16-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from sched_create_group() sched_ext: Make scx_group_set_weight() always update tg->scx.weight sched_ext: Update mailing list entry in MAINTAINERS
2025-06-18cgroup,freezer: fix incomplete freezing when attaching tasksChen Ridong
An issue was found: # cd /sys/fs/cgroup/freezer/ # mkdir test # echo FROZEN > test/freezer.state # cat test/freezer.state FROZEN # sleep 1000 & [1] 863 # echo 863 > test/cgroup.procs # cat test/freezer.state FREEZING When tasks are migrated to a frozen cgroup, the freezer fails to immediately freeze the tasks, causing the cgroup to remain in the "FREEZING". The freeze_task() function is called before clearing the CGROUP_FROZEN flag. This causes the freezing() check to incorrectly return false, preventing __freeze_task() from being invoked for the migrated task. To fix this issue, clear the CGROUP_FROZEN state before calling freeze_task(). Fixes: f5d39b020809 ("freezer,sched: Rewrite core freezer logic") Cc: stable@vger.kernel.org # v6.1+ Reported-by: Zhong Jiawei <zhongjiawei1@huawei.com> Signed-off-by: Chen Ridong <chenridong@huawei.com> Acked-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-18sched/core: Reorganize cgroup bandwidth control interface file writesTejun Heo
- Move input parameter validation from tg_set_cfs_bandwidth() to the new outer function tg_set_bandwidth(). The outer function handles parameters in usecs, validates them and calls tg_set_cfs_bandwidth() which converts them into nsecs. This matches tg_bandwidth() on the read side. - max/min_cfs_* consts are now used by tg_set_bandwidth(). Relocate, convert into usecs and drop "cfs" from the names. - Reimplement cpu_cfs_{period|quote|burst}_write_*() using tg_bandwidth() and tg_set_bandwidth() and replace "cfs" in the names with "bw". - Update cpu_max_write() to use tg_set_bandiwdth(). cpu_period_quota_parse() is updated to drop nsec conversion accordingly. This aligns the behavior with cfs_period_quota_print(). - Drop now unused tg_set_cfs_{period|quota|burst}(). - While at it, for consistency, rename default_cfs_period() to default_bw_period_us() and make it return usecs. This is to prepare for adding bandwidth control support to sched_ext. tg_set_bandwidth() will be used as the muxing point. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-5-tj@kernel.org
2025-06-18sched/core: Reorganize cgroup bandwidth control interface file readsTejun Heo
- Update tg_get_cfs_*() to return u64 values. These are now used as the low level accessors to the fair's bandwidth configuration parameters. Translation to usecs takes place in these functions. - Add tg_bandwidth() which reads all three bandwidth parameters using tg_get_cfs_*(). - Reimplement cgroup interface read functions using tg_bandwidth(). Drop cfs from the function names. This is to prepare for adding bandwidth control support to sched_ext. tg_bandwidth() will be used as the muxing point similar to tg_weight(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-4-tj@kernel.org
2025-06-18sched/core: Relocate tg_get_cfs_*() and cpu_cfs_*_read_*()Tejun Heo
Collect the getters, relocate the trivial interface file wrappers, and put all of them in period, quota, burst order to prepare for future changes. Pure reordering. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-3-tj@kernel.org
2025-06-18sched/fair: Move max_cfs_quota_period decl and default_cfs_period() def from ↵Tejun Heo
fair.c to sched.h max_cfs_quota_period is defined in core.c but has a declaration in fair.c. Move the declaration to kernel/sched/sched.h. Also, move default_cfs_period() from fair.c to sched.h. No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250614012346.2358261-2-tj@kernel.org
2025-06-18fgraph: Do not enable function_graph tracer when setting funcgraph-argsSteven Rostedt
When setting the funcgraph-args option when function graph tracer is net enabled, it incorrectly enables it. Worse, it unregisters itself when it was never registered. Then when it gets enabled again, it will register itself a second time causing a WARNing. ~# echo 1 > /sys/kernel/tracing/options/funcgraph-args ~# head -20 /sys/kernel/tracing/trace # tracer: nop # # entries-in-buffer/entries-written: 813/26317372 #P:8 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | <idle>-0 [007] d..4. 358.966010: 7) 1.692 us | fetch_next_timer_interrupt(basej=4294981640, basem=357956000000, base_local=0xffff88823c3ae040, base_global=0xffff88823c3af300, tevt=0xffff888100e47cb8); <idle>-0 [007] d..4. 358.966012: 7) | tmigr_cpu_deactivate(nextexp=357988000000) { <idle>-0 [007] d..4. 358.966013: 7) | _raw_spin_lock(lock=0xffff88823c3b2320) { <idle>-0 [007] d..4. 358.966014: 7) 0.981 us | preempt_count_add(val=1); <idle>-0 [007] d..5. 358.966017: 7) 1.058 us | do_raw_spin_lock(lock=0xffff88823c3b2320); <idle>-0 [007] d..4. 358.966019: 7) 5.824 us | } <idle>-0 [007] d..5. 358.966021: 7) | tmigr_inactive_up(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) { <idle>-0 [007] d..5. 358.966022: 7) | tmigr_update_events(group=0xffff888100cb9000, child=0x0, data=0xffff888100e47bc0) { Notice the "tracer: nop" at the top there. The current tracer is the "nop" tracer, but the content is obviously the function graph tracer. Enabling function graph tracing will cause it to register again and trigger a warning in the accounting: ~# echo function_graph > /sys/kernel/tracing/current_tracer -bash: echo: write error: Device or resource busy With the dmesg of: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 1095 at kernel/trace/ftrace.c:3509 ftrace_startup_subops+0xc1e/0x1000 Modules linked in: kvm_intel kvm irqbypass CPU: 7 UID: 0 PID: 1095 Comm: bash Not tainted 6.16.0-rc2-test-00006-gea03de4105d3 #24 PREEMPT Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:ftrace_startup_subops+0xc1e/0x1000 Code: 48 b8 22 01 00 00 00 00 ad de 49 89 84 24 88 01 00 00 8b 44 24 08 89 04 24 e9 c3 f7 ff ff c7 04 24 ed ff ff ff e9 b7 f7 ff ff <0f> 0b c7 04 24 f0 ff ff ff e9 a9 f7 ff ff c7 04 24 f4 ff ff ff e9 RSP: 0018:ffff888133cff948 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 1ffff1102679ff31 RCX: 0000000000000000 RDX: 1ffffffff0b27a60 RSI: ffffffff8593d2f0 RDI: ffffffff85941140 RBP: 00000000000c2041 R08: ffffffffffffffff R09: ffffed1020240221 R10: ffff88810120110f R11: ffffed1020240214 R12: ffffffff8593d2f0 R13: ffffffff8593d300 R14: ffffffff85941140 R15: ffffffff85631100 FS: 00007f7ec6f28740(0000) GS:ffff8882b5251000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f7ec6f181c0 CR3: 000000012f1d0005 CR4: 0000000000172ef0 Call Trace: <TASK> ? __pfx_ftrace_startup_subops+0x10/0x10 ? find_held_lock+0x2b/0x80 ? ftrace_stub_direct_tramp+0x10/0x10 ? ftrace_stub_direct_tramp+0x10/0x10 ? trace_preempt_on+0xd0/0x110 ? __pfx_trace_graph_entry_args+0x10/0x10 register_ftrace_graph+0x4d2/0x1020 ? tracing_reset_online_cpus+0x14b/0x1e0 ? __pfx_register_ftrace_graph+0x10/0x10 ? ring_buffer_record_enable+0x16/0x20 ? tracing_reset_online_cpus+0x153/0x1e0 ? __pfx_tracing_reset_online_cpus+0x10/0x10 ? __pfx_trace_graph_return+0x10/0x10 graph_trace_init+0xfd/0x160 tracing_set_tracer+0x500/0xa80 ? __pfx_tracing_set_tracer+0x10/0x10 ? lock_release+0x181/0x2d0 ? _copy_from_user+0x26/0xa0 tracing_set_trace_write+0x132/0x1e0 ? __pfx_tracing_set_trace_write+0x10/0x10 ? ftrace_graph_func+0xcc/0x140 ? ftrace_stub_direct_tramp+0x10/0x10 ? ftrace_stub_direct_tramp+0x10/0x10 ? ftrace_stub_direct_tramp+0x10/0x10 vfs_write+0x1d0/0xe90 ? __pfx_vfs_write+0x10/0x10 Have the setting of the funcgraph-args check if function_graph tracer is the current tracer of the instance, and if not, do nothing, as there's nothing to do (the option is checked when function_graph tracing starts). Cc: stable@vger.kernel.org Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250618073801.057ea636@gandalf.local.home Fixes: c7a60a733c373 ("ftrace: Have funcgraph-args take affect during tracing") Closes: https://lore.kernel.org/all/4ab1a7bdd0174ab09c7b0d68cdbff9a4@huawei.com/ Reported-by: Changbin Du <changbin.du@huawei.com> Tested-by: Changbin Du <changbin.du@huawei.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-06-17bpf: Fix key serial argument of bpf_lookup_user_key()James Bottomley
The underlying lookup_user_key() function uses a signed 32 bit integer for key serial numbers because legitimate serial numbers are positive (and > 3) and keyrings are negative. Using a u32 for the keyring in the bpf function doesn't currently cause any conversion problems but will start to trip the signed to unsigned conversion warnings when the kernel enables them, so convert the argument to signed (and update the tests accordingly) before it acquires more users. Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> Reviewed-by: Roberto Sassu <roberto.sassu@huawei.com> Link: https://lore.kernel.org/r/84cdb0775254d297d75e21f577089f64abdfbd28.camel@HansenPartnership.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-17bpf: Get rid of redundant 3rd argument of prepare_seq_file()Al Viro
Remove 3rd argument in prepare_seq_file() to clean up the code a bit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Yonghong Song <yonghong.song@linux.dev> Link: https://lore.kernel.org/r/20250615004719.GE3011112@ZenIV Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-17cgroup: remove per-cpu per-subsystem locksShakeel Butt
The rstat update side used to insert the cgroup whose stats are updated in the update tree and the read side flush the update tree to get the latest uptodate stats. The per-cpu per-subsystem locks were used to synchronize the update and flush side. However now the update side does not access update tree but uses per-cpu lockless lists. So there is no need for locks to synchronize update and flush side. Let's remove them. Suggested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17cgroup: make css_rstat_updated nmi safeShakeel Butt
To make css_rstat_updated() able to safely run in nmi context, let's move the rstat update tree creation at the flush side and use per-cpu lockless lists in struct cgroup_subsys to track the css whose stats are updated on that cpu. The struct cgroup_subsys_state now has per-cpu lnode which needs to be inserted into the corresponding per-cpu lhead of struct cgroup_subsys. Since we want the insertion to be nmi safe, there can be multiple inserters on the same cpu for the same lnode. Here multiple inserters are from stacked contexts like softirq, hardirq and nmi. The current llist does not provide function to protect against the scenario where multiple inserters can use the same lnode. So, using llist_node() out of the box is not safe for this scenario. However we can protect against multiple inserters using the same lnode by using the fact llist node points to itself when not on the llist and atomically reset it and select the winner as the single inserter. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17cgroup: support to enable nmi-safe css_rstat_updatedShakeel Butt
Add necessary infrastructure to enable the nmi-safe execution of css_rstat_updated(). Currently css_rstat_updated() takes a per-cpu per-css raw spinlock to add the given css in the per-cpu per-css update tree. However the kernel can not spin in nmi context, so we need to remove the spinning on the raw spinlock in css_rstat_updated(). To support lockless css_rstat_updated(), let's add necessary data structures in the css and ss structures. Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Tested-by: JP Kobryn <inwardvessel@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17workqueue: Initialize wq_isolated_cpumask in workqueue_init_early()Chuyi Zhou
Now when isolcpus is enabled via the cmdline, wq_isolated_cpumask does not include these isolated CPUs, even wq_unbound_cpumask has already excluded them. It is only when we successfully configure an isolate cpuset partition that wq_isolated_cpumask gets overwritten by workqueue_unbound_exclude_cpumask(), including both the cmdline-specified isolated CPUs and the isolated CPUs within the cpuset partitions. Fix this issue by initializing wq_isolated_cpumask properly in workqueue_init_early(). Fixes: fe28f631fa94 ("workqueue: Add workqueue_unbound_exclude_cpumask() to exclude CPUs from wq_unbound_cpumask") Signed-off-by: Chuyi Zhou <zhouchuyi@bytedance.com> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17Merge branch 'WQ_PERCPU' into for-6.17Tejun Heo
2025-06-17workqueue: Add system_percpu_wq and system_dfl_wqMarco Crivellari
Currently, if a user enqueue a work item using schedule_delayed_work() the used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to schedule_work() that is using system_wq and queue_work(), that makes use again of WORK_CPU_UNBOUND. This lack of consistentcy cannot be addressed without refactoring the API. system_wq is a per-CPU worqueue, yet nothing in its name tells about that CPU affinity constraint, which is very often not required by users. Make it clear by adding a system_percpu_wq. system_unbound_wq should be the default workqueue so as not to enforce locality constraints for random work whenever it's not required. Adding system_dfl_wq to encourage its use when unbound work should be used. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-17sched_ext, sched/core: Don't call scx_group_set_weight() prematurely from ↵Tejun Heo
sched_create_group() During task_group creation, sched_create_group() calls scx_group_set_weight() with CGROUP_WEIGHT_DFL to initialize the sched_ext portion. This is premature and ends up calling ops.cgroup_set_weight() with an incorrect @cgrp before ops.cgroup_init() is called. sched_create_group() should just initialize SCX related fields in the new task_group. Fix it by factoring out scx_tg_init() from sched_init() and making sched_create_group() call that function instead of scx_group_set_weight(). v2: Retain CONFIG_EXT_GROUP_SCHED ifdef in sched_init() as removing it leads to build failures on !CONFIG_GROUP_SCHED configs. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+
2025-06-17sched_ext: Make scx_group_set_weight() always update tg->scx.weightTejun Heo
Otherwise, tg->scx.weight can go out of sync while scx_cgroup is not enabled and ops.cgroup_init() may be called with a stale weight value. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 819513666966 ("sched_ext: Add cgroup support") Cc: stable@vger.kernel.org # v6.12+
2025-06-17bpf: Mark dentry->d_inode as trusted_or_nullSong Liu
LSM hooks such as security_path_mknod() and security_inode_rename() have access to newly allocated negative dentry, which has NULL d_inode. Therefore, it is necessary to do the NULL pointer check for d_inode. Also add selftests that checks the verifier enforces the NULL pointer check. Signed-off-by: Song Liu <song@kernel.org> Reviewed-by: Matt Bobrowski <mattbobrowski@google.com> Link: https://lore.kernel.org/r/20250613052857.1992233-1-song@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-06-17sysfs: treewide: switch back to attribute_group::bin_attrsThomas Weißschuh
The normal bin_attrs field can now handle const pointers. This makes the _new variant unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-4-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-17sysfs: treewide: switch back to bin_attribute::read()/write()Thomas Weißschuh
The bin_attribute argument of bin_attribute::read() is now const. This makes the _new() callbacks unnecessary. Switch all users back. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Link: https://lore.kernel.org/r/20250530-sysfs-const-bin_attr-final-v3-3-724bfcf05b99@weissschuh.net Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-06-16audit,module: restore audit logging in load failure caseRichard Guy Briggs
The move of the module sanity check to earlier skipped the audit logging call in the case of failure and to a place where the previously used context is unavailable. Add an audit logging call for the module loading failure case and get the module name when possible. Link: https://issues.redhat.com/browse/RHEL-52839 Fixes: 02da2cbab452 ("module: move check_modinfo() early to early_mod_check()") Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-06-16workqueue: Basic memory allocation profiling supportKent Overstreet
Hook alloc_workqueue and alloc_workqueue_attrs() so that they're accounted to the callsite. Since we're doing allocations on behalf of another subsystem, this helps when using memory allocation profiling to check for leaks. Cc: Tejun Heo <tj@kernel.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-16sched_ext: Return NULL in llc_spanCheng-Yang Chou
Use NULL instead of 0 to signal no LLC domain, matching numa_span() and the function comment. No functional change. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-16coredump: rename do_coredump() to vfs_coredump()Christian Brauner
Align the naming with the rest of our helpers exposed outside of core vfs. Link: https://lore.kernel.org/20250612-work-coredump-massage-v1-9-315c0c34ba94@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-06-14clocksource: Use cpumask_next_wrap() in clocksource_watchdog()Yury Norov [NVIDIA]
cpumask_next_wrap() is more verbose and efficient comparing to cpumask_next() followed by cpumask_first(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250614155031.340988-3-yury.norov@gmail.com
2025-06-14clocksource: Use cpumask_any_but() in clocksource_verify_choose_cpus()Yury Norov [NVIDIA]
cpumask_any_but() is more verbose than cpumask_first() followed by cpumask_next(). Use it in clocksource_verify_choose_cpus(). Signed-off-by: Yury Norov [NVIDIA] <yury.norov@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/all/20250614155031.340988-2-yury.norov@gmail.com
2025-06-13sched_ext: Always use SMP versions in kernel/sched/ext_idle.hCheng-Yang Chou
Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13sched_ext: Always use SMP versions in kernel/sched/ext_idle.cCheng-Yang Chou
Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Fixed stray #else block which wasn't removed causing build failure. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-06-13sched_ext: Always use SMP versions in kernel/sched/ext.hCheng-Yang Chou
Simplify the scheduler by making formerly SMP-only primitives and data structures unconditional. tj: Updated subject for clarity. Replace #if defined() with #ifdef. Signed-off-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>