summaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2025-07-11bpf: Remove attach_type in bpf_tracing_linkTao Chen
Use attach_type in bpf_link, and remove it in bpf_tracing_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-7-chen.dylane@linux.dev
2025-07-11bpf: Remove attach_type in bpf_netns_linkTao Chen
Use attach_type in bpf_link, and remove it in bpf_netns_link. And move netns_type field to the end to fill the byte hole. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250710032038.888700-6-chen.dylane@linux.dev
2025-07-11bpf: Remove location field in tcx_linkTao Chen
Use attach_type in bpf_link to replace the location filed, and remove location field in tcx_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20250710032038.888700-5-chen.dylane@linux.dev
2025-07-11bpf: Remove attach_type in bpf_cgroup_linkTao Chen
Use attach_type in bpf_link, and remove it in bpf_cgroup_link. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-3-chen.dylane@linux.dev
2025-07-11bpf: Add attach_type field to bpf_linkTao Chen
Attach_type will be set when a link is created by user. It is better to record attach_type in bpf_link generically and have it available universally for all link types. So add the attach_type field in bpf_link and move the sleepable field to avoid unnecessary gap padding. Signed-off-by: Tao Chen <chen.dylane@linux.dev> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/bpf/20250710032038.888700-2-chen.dylane@linux.dev
2025-07-11bpf: Forget ranges when refining tnum after JSETPaul Chaignon
Syzbot reported a kernel warning due to a range invariant violation on the following BPF program. 0: call bpf_get_netns_cookie 1: if r0 == 0 goto <exit> 2: if r0 & Oxffffffff goto <exit> The issue is on the path where we fall through both jumps. That path is unreachable at runtime: after insn 1, we know r0 != 0, but with the sign extension on the jset, we would only fallthrough insn 2 if r0 == 0. Unfortunately, is_branch_taken() isn't currently able to figure this out, so the verifier walks all branches. The verifier then refines the register bounds using the second condition and we end up with inconsistent bounds on this unreachable path: 1: if r0 == 0 goto <exit> r0: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0xffffffffffffffff) 2: if r0 & 0xffffffff goto <exit> r0 before reg_bounds_sync: u64=[0x1, 0xffffffffffffffff] var_off=(0, 0) r0 after reg_bounds_sync: u64=[0x1, 0] var_off=(0, 0) Improving the range refinement for JSET to cover all cases is tricky. We also don't expect many users to rely on JSET given LLVM doesn't generate those instructions. So instead of improving the range refinement for JSETs, Eduard suggested we forget the ranges whenever we're narrowing tnums after a JSET. This patch implements that approach. Reported-by: syzbot+c711ce17dd78e5d4fdcf@syzkaller.appspotmail.com Suggested-by: Eduard Zingerman <eddyz87@gmail.com> Acked-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com> Link: https://lore.kernel.org/r/9d4fd6432a095d281f815770608fdcd16028ce0b.1752171365.git.paul.chaignon@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-11bpf/arena: add bpf_arena_reserve_pages kfuncEmil Tsalapatis
Add a new BPF arena kfunc for reserving a range of arena virtual addresses without backing them with pages. This prevents the range from being populated using bpf_arena_alloc_pages(). Acked-by: Yonghong Song <yonghong.song@linux.dev> Signed-off-by: Emil Tsalapatis <emil@etsalapatis.com> Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Link: https://lore.kernel.org/r/20250709191312.29840-2-emil@etsalapatis.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-07-11cpu: Define attack vectorsDavid Kaplan
Define 4 new attack vectors that are used for controlling CPU speculation mitigations. These may be individually disabled as part of the mitigations= command line. Attack vector controls are combined with global options like 'auto' or 'auto,nosmt' like 'mitigations=auto,no_user_kernel'. The global options come first in the mitigations= string. Cross-thread mitigations can either remain enabled fully, including potentially disabling SMT ('auto,nosmt'), remain enabled except for disabling SMT ('auto'), or entirely disabled through the new 'no_cross_thread' attack vector option. The default settings for these attack vectors are consistent with existing kernel defaults, other than the automatic disabling of VM-based attack vectors if KVM support is not present. Signed-off-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250707183316.1349127-3-david.kaplan@amd.com
2025-07-11Merge tag 'dma-mapping-6.16-2025-07-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux Pull dma-mapping fix from Marek Szyprowski: - small fix relevant to arm64 server and custom CMA configuration (Feng Tang) * tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux: dma-contiguous: hornor the cma address limit setup by user
2025-07-11futex: Remove support for IMMUTABLESebastian Andrzej Siewior
The FH_FLAG_IMMUTABLE flag was meant to avoid the reference counting on the private hash and so to avoid the performance regression on big machines. With the switch to per-CPU counter this is no longer needed. That flag was never useable on any released kernel. Remove any support for IMMUTABLE while preserve the flags argument and enforce it to be zero. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-5-bigeasy@linutronix.de
2025-07-11futex: Make futex_private_hash_get() staticSebastian Andrzej Siewior
futex_private_hash_get() is not used outside if its compilation unit. Make it static. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-4-bigeasy@linutronix.de
2025-07-11futex: Use RCU-based per-CPU reference counting instead of rcuref_tPeter Zijlstra
The use of rcuref_t for reference counting introduces a performance bottleneck when accessed concurrently by multiple threads during futex operations. Replace rcuref_t with special crafted per-CPU reference counters. The lifetime logic remains the same. The newly allocate private hash starts in FR_PERCPU state. In this state, each futex operation that requires the private hash uses a per-CPU counter (an unsigned int) for incrementing or decrementing the reference count. When the private hash is about to be replaced, the per-CPU counters are migrated to a atomic_t counter mm_struct::futex_atomic. The migration process: - Waiting for one RCU grace period to ensure all users observe the current private hash. This can be skipped if a grace period elapsed since the private hash was assigned. - futex_private_hash::state is set to FR_ATOMIC, forcing all users to use mm_struct::futex_atomic for reference counting. - After a RCU grace period, all users are guaranteed to be using the atomic counter. The per-CPU counters can now be summed up and added to the atomic_t counter. If the resulting count is zero, the hash can be safely replaced. Otherwise, active users still hold a valid reference. - Once the atomic reference count drops to zero, the next futex operation will switch to the new private hash. call_rcu_hurry() is used to speed up transition which otherwise might be delay with RCU_LAZY. There is nothing wrong with using call_rcu(). The side effects would be that on auto scaling the new hash is used later and the SET_SLOTS prctl() will block longer. [bigeasy: commit description + mm get/ put_async] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20250710110011.384614-3-bigeasy@linutronix.de
2025-07-11irqdomain: Export irq_domain_free_irqs_top()Nam Cao
Export irq_domain_free_irqs_top(), making it usable for drivers compiled as modules. Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2025-07-11Merge back earlier changes related to system suspend and hibernationRafael J. Wysocki
2025-07-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc6). No conflicts. Adjacent changes: Documentation/devicetree/bindings/net/allwinner,sun8i-a83t-emac.yaml 0a12c435a1d6 ("dt-bindings: net: sun8i-emac: Add A100 EMAC compatible") b3603c0466a8 ("dt-bindings: net: sun8i-emac: Rename A523 EMAC0 to GMAC0") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-10PM: hibernate: shrink shmem pages after dev_pm_ops.prepare()Samuel Zhang
When hibernate with data center dGPUs, huge number of VRAM data will be moved to shmem during dev_pm_ops.prepare(). These shmem pages take a lot of system memory so that there's no enough free memory for creating the hibernation image. This will cause hibernation fail and abort. After dev_pm_ops.prepare(), call shrink_all_memory() to force move shmem pages to swap disk and reclaim the pages, so that there's enough system memory for hibernation image and less pages needed to copy to the image. This patch can only flush and free about half shmem pages. It will be better to flush and free more pages, even all of shmem pages, so that there're less pages to be copied to the hibernation image and the overall hibernation time can be reduced. Signed-off-by: Samuel Zhang <guoqing.zhang@amd.com> Acked-by: Rafael J. Wysocki <rafael@kernel.org> Link: https://lore.kernel.org/r/20250710062313.3226149-4-guoqing.zhang@amd.com Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
2025-07-10PM: sleep: add kernel parameter to disable asynchronous suspend/resumeTudor Ambarus
On some platforms, device dependencies are not properly represented by device links, which can cause issues when asynchronous power management is enabled. While it is possible to disable this via sysfs, doing so at runtime can race with the first system suspend event. This patch introduces a kernel command-line parameter, "pm_async", which can be set to "off" to globally disable asynchronous suspend and resume operations from early boot. It effectively provides a way to set the initial value of the existing pm_async sysfs knob at boot time. This offers a robust method to fall back to synchronous (sequential) operation, which can stabilize platforms with problematic dependencies and also serve as a useful debugging tool. The default behavior remains unchanged (asynchronous enabled). To disable it, boot the kernel with the "pm_async=off" parameter. Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Link: https://patch.msgid.link/20250709-pm-async-off-v3-1-cb69a6fc8d04@linaro.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2025-07-09kthread: update comment for __to_kthreadJiazi Li
With commit 343f4c49f243 ("kthread: Don't allocate kthread_struct for init and umh") and commit 753550eb0ce1 ("fork: Explicitly set PF_KTHREAD"), umh task no longer have struct kthread and PF_KTHREAD flag. Update the comment to describe what the current rules are to detect is something is a kthread. Link: https://lkml.kernel.org/r/20250620100801.23185-1-jqqlijiazi@gmail.com Signed-off-by: Jiazi Li <jqqlijiazi@gmail.com> Signed-off-by: mingzhu.wang <mingzhu.wang@transsion.com> Suggested-by Eric W . Biederman <ebiederm@xmission.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: clean up ifdef logic around stack allocationPasha Tatashin
There is an unneeded OR in the ifdef functions that are used to allocate and free kernel stacks based on direct map or vmap. Adding dynamic stack support would complicate this logic even further. Therefore, clean up by changing the order so OR is no longer needed. Link: https://lkml.kernel.org/r/20250618-fork-fixes-v4-1-2e05a2e1f5fc@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Link: https://lore.kernel.org/20240311164638.2015063-3-pasha.tatashin@soleen.com Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09uprobes: revert ref_ctr_offset in uprobe_unregister error pathJiri Olsa
There's error path that could lead to inactive uprobe: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place, but ref_ctr is changed to 0 (it's not restored to 1 in the fail path) uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 0 - uprobe CAN NOT be triggered now 4) uprobe_unregister fails because ref_ctr value is unexpected Fix this by reverting the updated ref_ctr value back to 1 in step 2), which is the case when uprobe_unregister fails (int3 stays in place), but we have already updated refctr. The new scenario will go as follows: 1) uprobe_register succeeds - updates instruction to int3 and changes ref_ctr from 0 to 1 2) uprobe_unregister fails - int3 stays in place and ref_ctr is reverted to 1.. uprobe is leaked 3) another uprobe_register comes and re-uses the leaked uprobe and succeds - but int3 is already in place, so ref_ctr update is skipped and it stays 1 - uprobe CAN be triggered now 4) uprobe_unregister succeeds Link: https://lkml.kernel.org/r/20250514101809.2010193-1-jolsa@kernel.org Fixes: 1cc33161a83d ("uprobes: Support SDT markers having reference count (semaphore)") Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Suggested-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09exit: fix misleading comment in forget_original_parent()Fushuai Wang
The commit 482a3767e508 ("exit: reparent: call forget_original_parent() under tasklist_lock") moved the comment from exit_notify() to forget_original_parent(). However, the forget_original_parent() only handles (A), while (B) is handled in kill_orphaned_pgrp(). So remove the unrelated part. Link: https://lkml.kernel.org/r/20250615030930.58051-1-wangfushuai@baidu.com Signed-off-by: Fushuai Wang <wangfushuai@baidu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Andrii Nakryiko <andrii@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: wangfushuai <wangfushuai@baidu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kcov: fix typo in comment of kcov_fault_in_areaWei Nanxin
change '__santizer_cov_trace_pc()' to '__sanitizer_cov_trace_pc()' Link: https://lkml.kernel.org/r/20250615123237.110144-1-n9winx@163.com Signed-off-by: Wei Nanxin <n9winx@163.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Macro Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: support a counter tracking if data is too big to writeJason Xing
It really doesn't matter if the user/admin knows what the last too big value is. Record how many times this case is triggered would be helpful. Solve the existing issue where relay_reset() doesn't restore the value. Store the counter in the per-cpu buffer structure instead of the global buffer structure. It also solves the racy condition which is likely to happen when a few of per-cpu buffers encounter the too big data case and then access the global field last_toobig without lock protection. Remove the printk in relay_close() since kernel module can directly call relay_stats() as they want. Link: https://lkml.kernel.org/r/20250612061201.34272-6-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09blktrace: use rbuf->stats.full as a drop indicator in relayfsJason Xing
Replace internal subbuf_start in blktrace with the default policy in relayfs. Remove dropped field from struct blktrace. Correspondingly, call the common helper in relay. By incrementing full_count to keep track of how many times we encountered a full buffer issue, user space will know how many events were lost. Link: https://lkml.kernel.org/r/20250612061201.34272-5-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: introduce getting relayfs statistics functionJason Xing
In this version, only support getting the counter for buffer full and implement the framework of how it works. Users can pass certain flag to fetch what field/statistics they expect to know. Each time it only returns one result. So do not pass multiple flags. Link: https://lkml.kernel.org/r/20250612061201.34272-4-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: support a counter tracking if per-cpu buffers is fullJason Xing
When using relay mechanism, we often encounter the case where new data are lost or old unconsumed data are overwritten because of slow reader. Add 'full' field in per-cpu buffer structure to detect if the above case is happening. Relay has two modes: 1) non-overwrite mode, 2) overwrite mode. So buffer being full here respectively means: 1) relayfs doesn't intend to accept new data and then simply drop them, or 2) relayfs is going to start over again and overwrite old unread data with new data. Note: this counter doesn't need any explicit lock to protect from being modified by different threads for the better performance consideration. Writers calling __relay_write/relay_write should consider how to use the lock and ensure it performs under the lock protection, thus it's not necessary to add a new small lock here. Link: https://lkml.kernel.org/r/20250612061201.34272-3-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09relayfs: abolish prev_paddingJason Xing
Patch series "relayfs: misc changes", v5. The series mostly focuses on the error counters which helps every user debug their own kernel module. This patch (of 5): prev_padding represents the unused space of certain subbuffer. If the content of a call of relay_write() exceeds the limit of the remainder of this subbuffer, it will skip storing in the rest space and record the start point as buf->prev_padding in relay_switch_subbuf(). Since the buf is a per-cpu big buffer, the point of prev_padding as a global value for the whole buffer instead of a single subbuffer (whose padding info is stored in buf->padding[]) seems meaningless from the real use cases, so we don't bother to record it any more. Link: https://lkml.kernel.org/r/20250612061201.34272-1-kerneljasonxing@gmail.com Link: https://lkml.kernel.org/r/20250612061201.34272-2-kerneljasonxing@gmail.com Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Yushan Zhou <katrinzhou@tencent.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09kernel: relay: use __GFP_ZERO in relay_alloc_bufElijah Wright
Passing the __GFP_ZERO flag to alloc_page should result in less overhead th= an using memset() Link: https://lkml.kernel.org/r/20250610225639.314970-3-git@elijahs.space Signed-off-by: Elijah Wright <git@elijahs.space> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: define a local GFP_VMAP_STACKLinus Walleij
The current allocation of VMAP stack memory is using (THREADINFO_GFP & ~__GFP_ACCOUNT) which is a complicated way of saying (GFP_KERNEL | __GFP_ZERO): <linux/thread_info.h>: define THREADINFO_GFP (GFP_KERNEL_ACCOUNT | __GFP_ZERO) <linux/gfp_types.h>: define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT) This is an unfortunate side-effect of independent changes blurring the picture: commit 19809c2da28aee5860ad9a2eff760730a0710df0 changed (THREADINFO_GFP | __GFP_HIGHMEM) to just THREADINFO_GFP since highmem became implicit. commit 9b6f7e163cd0f468d1b9696b785659d3c27c8667 then added stack caching and rewrote the allocation to (THREADINFO_GFP & ~__GFP_ACCOUNT) as cached stacks need to be accounted separately. However that code, when it eventually accounts the memory does this: ret = memcg_kmem_charge(vm->pages[i], GFP_KERNEL, 0) so the memory is charged as a GFP_KERNEL allocation. Define a unique GFP_VMAP_STACK to use GFP_KERNEL | __GFP_ZERO and move the comment there. Link: https://lkml.kernel.org/r/20250509-gfp-stack-v1-1-82f6f7efc210@linaro.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reported-by: Mateusz Guzik <mjguzik@gmail.com> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks codePasha Tatashin
There are two data types: "struct vm_struct" and "struct vm_stack" that have the same local variable names: vm_stack, or vm, or s, which makes the code confusing to read. Change the code so the naming is consistent: struct vm_struct is always called vm_area struct vm_stack is always called vm_stack One change altering vfree(vm_stack) to vfree(vm_area->addr) may look like a semantic change but it is not: vm_area->addr points to the vm_stack. This was done to improve readability. [linus.walleij@linaro.org: rebased and added new users of the variable names, address review comments] Link: https://lore.kernel.org/20240311164638.2015063-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20250509-fork-fixes-v3-2-e6c69dd356f2@linaro.org Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Mateusz Guzik <mjguzik@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: filter zone device pages returned from folio_walk_start()Alistair Popple
Previously dax pages were skipped by the pagewalk code as pud_special() or vm_normal_page{_pmd}() would be false for DAX pages. Now that dax pages are refcounted normally that is no longer the case, so the pagewalk code will start returning them. Most callers already explicitly filter for DAX or zone device pages so don't need updating. However some don't, so add checks to those callers. Link: https://lkml.kernel.org/r/4ecb7b357fc5b435588024770b88bbb695c30090.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09mm: use folio_expected_ref_count() helper for reference countingShivank Garg
Replace open-coded folio reference count calculations with the folio_expected_ref_count(). No functional changes intended. Link: https://lkml.kernel.org/r/20250611052706.515408-2-shivankg@amd.com Signed-off-by: Shivank Garg <shivankg@amd.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Marc Rutland <mark.rutland@arm.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung kim <namhyung@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09Revert "sched/numa: add statistics of numa balance task"Chen Yu
This reverts commit ad6b26b6a0a79166b53209df2ca1cf8636296382. This commit introduces per-memcg/task NUMA balance statistics, but unfortunately it introduced a NULL pointer exception due to the following race condition: After a swap task candidate was chosen, its mm_struct pointer was set to NULL due to task exit. Later, when performing the actual task swapping, the p->mm caused the problem. CPU0 CPU1 : ... task_numa_migrate task_numa_find_cpu task_numa_compare # a normal task p is chosen env->best_task = p # p exit: exit_signals(p); p->flags |= PF_EXITING exit_mm p->mm = NULL; migrate_swap_stop __migrate_swap_task((arg->src_task, arg->dst_cpu) count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL task_lock() should be held and the PF_EXITING flag needs to be checked to prevent this from happening. After discussion, the conclusion was that adding a lock is not worthwhile for some statistics calculations. Revert the change and rely on the tracepoint for this purpose. Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09fgraph: Make pid_str size match the commentArtem Sadovnikov
The comment above buffer mentions sign, 10 bytes width for number and null terminator, but buffer itself isn't large enough to hold that much data. This is a cosmetic change, since PID cannot be negative, other than -1. Found by Linux Verification Center (linuxtesting.org) with SVACE. Link: https://lore.kernel.org/20250617152110.2530-1-a.sadovnikov@ispras.ru Signed-off-by: Artem Sadovnikov <a.sadovnikov@ispras.ru> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09tracing: ring_buffer: Rewind persistent ring buffer on rebootMasami Hiramatsu (Google)
Rewind persistent ring buffer pages which have been read in the previous boot. Those pages are highly possible to be lost before writing it to the disk if the previous kernel crashed. In this case, the trace data is kept on the persistent ring buffer, but it can not be read because its commit size has been reset after read. This skips clearing the commit size of each sub-buffer and recover it after reboot. Note: If you read the previous boot data via trace_pipe, that is not accessible in that time. But reboot without clearing (or reusing) the read data, the read data is recovered again in the next boot. Thus, when you read the previous boot data, clear it by `echo > trace`. Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/174899582116.955054.773265393511190051.stgit@mhiramat.tok.corp.google.com Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Allow to configure the number of per-task monitorNam Cao
Now that there are 2 monitors for real-time applications, users may want to enable both of them simultaneously. Make the number of per-task monitor configurable. Default it to 2 for now. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/93e83313fc4ba7f6e66f4abe80ca5f5494d658d0.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp_sleep monitorNam Cao
Add a monitor for checking that real-time tasks do not go to sleep in a manner that may cause undesirable latency. Also change RV depends on TRACING to RV select TRACING to avoid the following recursive dependency: error: recursive dependency detected! symbol TRACING is selected by PREEMPTIRQ_TRACEPOINTS symbol PREEMPTIRQ_TRACEPOINTS depends on TRACE_IRQFLAGS symbol TRACE_IRQFLAGS is selected by RV_MON_SLEEP symbol RV_MON_SLEEP depends on RV symbol RV depends on TRACING Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/75bc5bcc741d153aa279c95faf778dff35c5c8ad.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp_pagefault monitorNam Cao
Userspace real-time applications may have design flaws that they raise page faults in real-time threads, and thus have unexpected latencies. Add an linear temporal logic monitor to detect this scenario. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/78fea8a2de6d058241d3c6502c1a92910772b0ed.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add rtapp container monitorNam Cao
Add the container "rtapp" which is the monitor collection for detecting problems with real-time applications. The monitors will be added in the follow-up commits. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/fb18b87631d386271de00959d8d4826f23fcd1cd.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add support for LTL monitorsNam Cao
While attempting to implement DA monitors for some complex specifications, deterministic automaton is found to be inappropriate as the specification language. The automaton is complicated, hard to understand, and error-prone. For these cases, linear temporal logic is more suitable as the specification language. Add support for linear temporal logic runtime verification monitor. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Gabriele Monaco <gmonaco@redhat.com> Link: https://lore.kernel.org/d366c1fed60ed4e8f6451f3c15a99755f2740b5f.1752088709.git.namcao@linutronix.de Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: rename CONFIG_DA_MON_EVENTS to CONFIG_RV_MON_EVENTSNam Cao
CONFIG_DA_MON_EVENTS is not specific to deterministic automaton. It could be used for other monitor types. Therefore rename it to CONFIG_RV_MON_EVENTS. This prepares for the introduction of linear temporal logic monitor. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/507210517123d887c1d208aa2fd45ec69765d3f0.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Let the reactors take care of buffersNam Cao
Each RV monitor has one static buffer to send to the reactors. If multiple errors are detected simultaneously, the one buffer could be overwritten. Instead, leave it to the reactors to handle buffering. Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09panic: Add vpanic()Nam Cao
vpanic() is useful for implementing runtime verification reactors. Add it. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09printk: Make vprintk_deferred() publicNam Cao
vprintk_deferred() is useful for implementing runtime verification reactors. Make it public. Signed-off-by: Nam Cao <namcao@linutronix.de> Reviewed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09rv: Add #undef TRACE_INCLUDE_FILENam Cao
Without "#undef TRACE_INCLUDE_FILE", there could be a build error due to TRACE_INCLUDE_FILE being redefined. Therefore add it. Also fix a typo while at it. Cc: John Ogness <john.ogness@linutronix.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Link: https://lore.kernel.org/f805e074581e927bb176c742c981fa7675b6ebe5.1752088709.git.namcao@linutronix.de Reviewed-by: Gabriele Monaco <gmonaco@redhat.com> Signed-off-by: Nam Cao <namcao@linutronix.de> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2025-07-09sched/fair: Always trigger resched at the end of a protected periodVincent Guittot
Always trigger a resched after a protected period even if the entity is still eligible. It can happen that an entity remains eligible at the end of the protected period but must let an entity with a shorter slice to run in order to keep its lag shorter than slice. This is particulalry true with run to parity which tries to maximize the lag. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-7-vincent.guittot@linaro.org
2025-07-09sched/fair: Fix entity's lag with run to parityVincent Guittot
When an entity is enqueued without preempting current, we must ensure that the slice protection is updated to take into account the slice duration of the newly enqueued task so that its lag will not exceed its slice (+ tick). Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-6-vincent.guittot@linaro.org
2025-07-09sched/fair: Limit run to parity to the min slice of enqueued entitiesVincent Guittot
Run to parity ensures that current will get a chance to run its full slice in one go but this can create large latency and/or lag for entities with shorter slice that have exhausted their previous slice and wait to run their next slice. Clamp the run to parity to the shortest slice of all enqueued entities. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-5-vincent.guittot@linaro.org
2025-07-09sched/fair: Remove spurious shorter slice preemptionVincent Guittot
Even if the waking task can preempt current, it might not be the one selected by pick_task_fair. Check that the waking task will be selected if we cancel the slice protection before doing so. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-4-vincent.guittot@linaro.org
2025-07-09sched/fair: Fix NO_RUN_TO_PARITY caseVincent Guittot
EEVDF expects the scheduler to allocate a time quantum to the selected entity and then pick a new entity for next quantum. Although this notion of time quantum is not strictly doable in our case, we can ensure a minimum runtime for each task most of the time and pick a new entity after a minimum time has elapsed. Reuse the slice protection of run to parity to ensure such runtime quantum. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20250708165630.1948751-3-vincent.guittot@linaro.org